Skip to content

Commit a9c4114

Browse files
fabclmntboris-koganrenovate[bot]vascoalramosricardodcpereira
authored
docs: add information about pii classification feature (#1517)
* Update duplicates_pandas.py (#1427) Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset. * chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1 * chore(actions): update actions/checkout action to v4 * docs: setup new docs with mkdocs (#1418) * chore(actions): update actions/checkout action to v4 * fix: remove the duplicated cardinality threshold under categorical and text settings * fix: fixate matplotlib upper version * docs: change from `zap` to `sparkles` (#1447) Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com> * fix: template {{ file_name }} error in HTML wrapper (#1380) * Update javascript.html * Update style.html * feat: add density histogram (#1458) * feat: add histogram density option * test: add unit test * fix: discard weights if exceed max_bins * docs: update README.html (#1461) Update url of use cases, main integrations, and common issues. * fix: bug when creating a new report (#1440) * fix: gen wordcloud only for non-empty cols (#1459) * fix: table template ignoring text format (#1462) * fix: table template ignoring text format * fix: timeseries unit test * fix(linting): code formatting --------- Co-authored-by: Azory YData Bot <azory@ydata.ai> * fix: to_category misshandling pd.NA (#1464) * docs: add 📊 for Key features (#1451) See also #1445 (comment) * docs: fix hyperlink - related to package name change (#1457) Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> * chore(deps): increase numpy upper limit (#1467) * chore(deps): increase numpy upper limit * chore(deps): fixate numpy version for spark * chore(deps): fix numba package version, and filter warns (#1468) * chore: fix numba package version, and filter warns * fix: skip isort linter on init * chore(deps): update dependency typeguard to v4 (#1324) * chore(deps): update dependency typeguard to v4 --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> * docs: update docs with advent of code * docs: update links for fabric * chore(actions): update actions/setup-python action to v5 * docs: add information about PII classification & management. --------- Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai> Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com> Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com> Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com> Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com> Co-authored-by: Azory YData Bot <azory@ydata.ai> Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com> Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
1 parent 8d4d347 commit a9c4114

5 files changed

Lines changed: 85 additions & 16 deletions

File tree

docs/features/collaborative_data_profiling.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,19 @@
1-
# Data Catalog - A collaborative experience to profile datasets & relational databases
1+
# Data Catalog **
2+
A collaborative experience to profile datasets & relational databases
23

3-
!!! note "Data Catalog with data quality profiling"
4+
!!! info "** YData's Enterprise feature"
5+
6+
This feature is only available for users of [YData Fabric](https://ydata.ai).
47

5-
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **data catalog**
6-
and **collaborative** experience for datasets and database profiling at scale!
8+
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **Data catalog**
79

810
[YData Fabric](https://ydata.ai/products/fabric) is a Data-Centric AI
911
development platform. YData Fabric provides all capabilities of
1012
ydata-profiling in a hosted environment combined with a guided UI
1113
experience.
1214

13-
[Fabric's Data Catalog](https://ydata.ai/products/data_catalog)
15+
[Fabric's Data Catalog](https://ydata.ai/products/data_catalog),
16+
a scalable and interactive version of ydata-profiling,
1417
provides a comprehensive and powerful tool designed to enable data
1518
professionals, including data scientists and data engineers, to manage
1619
and understand data within an organization. The Data Catalog act as a
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Personally identifiable information (PII) identification & management **
2+
3+
!!! info "** YData's Enterprise feature"
4+
5+
This feature is only available for users of [YData Fabric](https://ydata.ai).
6+
7+
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and
8+
start your journey into **data management** with automated PII identification.
9+
10+
Personal Identifiable Information **(PII)** refers to any information that can be used to identify an individual.
11+
This includes but is not limited to, names, addresses, phone numbers, social security numbers, email addresses,
12+
and financial information. PII is crucial in today's digital age, where data is extensively collected, stored,
13+
and processed.
14+
15+
[YData Fabric Data Catalog](https://ydata.ai/products/data_catalog), a scalable and interactive version of ydata-profiling,
16+
integrates into the data profiling experience, an advanced machine learning solutions based on a Named Entity Recognition (NER) model
17+
combine with traditional rule-based patterns identification, allowing to efficiently detect PII.
18+
19+
:fontawesome-brands-youtube:{ .youtube }
20+
<a href="https://www.youtube.com/clip/UgkxBntXvAvCQ6I39Cp2KZRD4Ug9-NPzG1o1"><u>See Fabric's Data Catalog PII identification in action</u></a>.
21+
22+
## Why Fabric Catalog automated PII identification?
23+
24+
The relevance of automating the identification of PII lies in the need to protect individuals' privacy and comply
25+
with various data protection regulations. Mishandling or unauthorized access to PII can lead to severe consequences
26+
such as identity theft, financial fraud, and breaches of privacy. With the increasing volume of data generated manual
27+
identification of PII becomes impractical and error-prone.
28+
29+
Additionally, having a robust PII management solution is essential for organizations to establish and maintain
30+
a secure approach to handling sensitive information, fostering trust and adhering to legal requirements.
31+
32+
## Why Fabric to manage dataset PII identification
33+
34+
Besides automated PII identification, *Fabric Catalog* offers several key benefits in the content of data governance,
35+
privacy compliance and overall data management, through automated data profiling and metadata management:
36+
37+
### Compliance with Privacy Regulations:
38+
Many countries and regions have stringent data protection regulations (such as GDPR, CCPA, or HIPAA)
39+
that require organizations to handle PII responsibly. A dedicated platform ensures that PII is correctly classified,
40+
helping organizations comply with legal requirements and avoid potential penalties.
41+
42+
### Data Profiling for Accuracy:
43+
44+
Data profiling involves analyzing and understanding the structure and content of data. By incorporating data profiling
45+
capabilities into the platform, organizations can ensure accurate identification and classification of PII.
46+
This helps in maintaining the integrity of data and reduces the risk of misclassifications.
47+
48+
### Efficient Management of PII:
49+
As the volume of data continues to grow, manually managing and editing PII classifications becomes impractical.
50+
A platform streamlines this process, making it more efficient and reducing the likelihood of errors.
51+
It allows organizations to keep track of PII across various datasets and systems.
52+
53+
### Facilitating Data Governance:
54+
55+
Data governance involves establishing policies and processes to ensure high data quality, security, and compliance.
56+
A PII management solution enhances data governance efforts by providing a centralized hub for overseeing PII classifications,
57+
metadata, and related policies.
58+
59+

docs/features/sensitive_data.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,7 @@ pd.read_csv("filename.csv", dtype={"phone": str})
5656
Note that the type detection is hard. That is why
5757
[visions](https://github.com/dylan-profiler/visions), a type system to
5858
help developers solve these cases, was developed.
59+
60+
## Automated PII classification & management
61+
62+
You can find more details about this feature [here](pii_identification_management.md).

docs/index.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,15 @@ understanding and preparing data for analysis in a single line of code! If you'r
99

1010
!!! tip "Advent of Code - Get featured on ydata-profiling"
1111

12-
*“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to get more involved with open-source software, but no one’s given you an entry point?
12+
*“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to
13+
get more involved with open-source software, but no one’s given you an entry point?
1314

1415
That's why we joined [The Advent of code this year](https://zilliz.com/advent-of-code). Contribute to ydata-profiling and win some 🐼🐼 swag!
1516

1617
How can you be part of it?
1718

1819
- Give us some love with a Github ⭐
19-
- Write an article or create a tutorial like other [members the communit already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
20+
- Write an article or create a tutorial like other [members the community already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
2021
- Feeling adventurous? Contribute with a PR. We have a list of [great issues to get you started.](https://github.com/ydataai/ydata-profiling/issues?q=label%3A%22getting+started+%E2%98%9D%22+)
2122

2223
![ydata-profiling report](_static/img/ydata-profiling.gif)
@@ -55,15 +56,16 @@ YData-profiling can be used to deliver a variety of different applications. The
5556

5657
Check out the [free Community Version](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community).
5758

58-
| Features & functionalities | Description |
59-
|------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
60-
| [Comparing datasets](features/comparing_datasets.md) | Comparing multiple version of the same dataset |
61-
| [Profiling a Time-Series dataset](features/time_series_datasets.md) | Generating a report for a time-series dataset with a single line of code |
62-
| [Profiling large datasets](features/big_data.md) | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
63-
| [Handling sensitive data](features/sensitive_data.md) | Generating reports which are mindful about sensitive data in the input dataset |
64-
| [Dataset metadata and data dictionaries](features/metadata.md) | Complementing the report with dataset details and column-specific data dictionaries |
65-
| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations |
66-
| [Profiling Databases **](features/collaborative_data_profiling.md) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
59+
| Features & functionalities | Description |
60+
|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
61+
| [Comparing datasets](features/comparing_datasets.md) | Comparing multiple version of the same dataset |
62+
| [Profiling a Time-Series dataset](features/time_series_datasets.md) | Generating a report for a time-series dataset with a single line of code |
63+
| [Profiling large datasets](features/big_data.md) | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
64+
| [Handling sensitive data](features/sensitive_data.md) | Generating reports which are mindful about sensitive data in the input dataset |
65+
| [Dataset metadata and data dictionaries](features/metadata.md) | Complementing the report with dataset details and column-specific data dictionaries |
66+
| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations |
67+
| [Profiling Relational databases **](features/collaborative_data_profiling.md) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
68+
| [PII classification & management **](features/pii_identification_management.md ) | Automated PII classification and management through an UI experience |
6769

6870
### Tutorials
6971

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ nav:
1515
- Dataset metadata: 'features/metadata.md'
1616
- Datasets catalog **: 'features/collaborative_data_profiling.md'
1717
- Sensitive data: 'features/sensitive_data.md'
18+
- Automated PII classification & management **: 'features/pii_identification_management.md'
1819
- Time-series: 'features/time_series_datasets.md'
1920
- Comparing datasets: 'features/comparing_datasets.md'
2021
- Big data: 'features/big_data.md'

0 commit comments

Comments
 (0)