docs: add information about pii classification feature (#1517)

fabclmnt · boris-kogan · renovate[bot] · web-flow · commit a9c411444cc8 · 2023-12-07T10:41:47.000-05:00
* Update duplicates_pandas.py (#1427) Fixing Bug Report #1384 Dataset with categorical features causes memory error even on tiny dataset. * chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1 * chore(actions): update actions/checkout action to v4 * docs: setup new docs with mkdocs (#1418) * chore(actions): update actions/checkout action to v4 * fix: remove the duplicated cardinality threshold under categorical and text settings * fix: fixate matplotlib upper version * docs: change from `zap` to `sparkles` (#1447) Co-authored-by: Fabiana <30911746+fabclmnt@users.noreply.github.com> * fix: template {{ file_name }} error in HTML wrapper (#1380) * Update javascript.html * Update style.html * feat: add density histogram (#1458) * feat: add histogram density option * test: add unit test * fix: discard weights if exceed max_bins * docs: update README.html (#1461) Update url of use cases, main integrations, and common issues. * fix: bug when creating a new report (#1440) * fix: gen wordcloud only for non-empty cols (#1459) * fix: table template ignoring text format (#1462) * fix: table template ignoring text format * fix: timeseries unit test * fix(linting): code formatting --------- Co-authored-by: Azory YData Bot <azory@ydata.ai> * fix: to_category misshandling pd.NA (#1464) * docs: add 📊 for Key features (#1451) See also #1445 (comment) * docs: fix hyperlink - related to package name change (#1457) Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> * chore(deps): increase numpy upper limit (#1467) * chore(deps): increase numpy upper limit * chore(deps): fixate numpy version for spark * chore(deps): fix numba package version, and filter warns (#1468) * chore: fix numba package version, and filter warns * fix: skip isort linter on init * chore(deps): update dependency typeguard to v4 (#1324) * chore(deps): update dependency typeguard to v4 --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> * docs: update docs with advent of code * docs: update links for fabric * chore(actions): update actions/setup-python action to v5 * docs: add information about PII classification & management. --------- Co-authored-by: boris-kogan <139680785+boris-kogan@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: ricardodcpereira <ricardo.pereira@ydata.ai> Co-authored-by: Anselm Hahn <Anselm.Hahn@gmail.com> Co-authored-by: Joge <87136119+jogecodes@users.noreply.github.com> Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com> Co-authored-by: Chris Mahoney <44449504+chrimaho@users.noreply.github.com> Co-authored-by: Azory YData Bot <azory@ydata.ai> Co-authored-by: martin-kokos <4807476+martin-kokos@users.noreply.github.com> Co-authored-by: Martin Mokry <martin-kokos@users.noreply.github.com> Co-authored-by: Maciej Bukczynski <maciej@darkhorseanalytics.com> Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
diff --git a/docs/features/collaborative_data_profiling.md b/docs/features/collaborative_data_profiling.md
@@ -1,16 +1,19 @@
-# Data Catalog - A collaborative experience to profile datasets & relational databases
+# Data Catalog **
+A collaborative experience to profile datasets & relational databases
 
-!!! note "Data Catalog with data quality profiling"
+!!! info "** YData's Enterprise feature"
+    
+    This feature is only available for users of [YData Fabric](https://ydata.ai).
 
-    [Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **data catalog**
-    and **collaborative** experience for datasets and database profiling at scale!
+    [Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **Data catalog**
 
 [YData Fabric](https://ydata.ai/products/fabric) is a Data-Centric AI
 development platform. YData Fabric provides all capabilities of
 ydata-profiling in a hosted environment combined with a guided UI
 experience.
 
-[Fabric's Data Catalog](https://ydata.ai/products/data_catalog)
+[Fabric's Data Catalog](https://ydata.ai/products/data_catalog), 
+a scalable and interactive version of ydata-profiling,
 provides a comprehensive and powerful tool designed to enable data
 professionals, including data scientists and data engineers, to manage
 and understand data within an organization. The Data Catalog act as a
diff --git a/docs/features/pii_identification_management.md b/docs/features/pii_identification_management.md
@@ -0,0 +1,59 @@
+# Personally identifiable information (PII) identification & management **
+
+!!! info "** YData's Enterprise feature"
+    
+    This feature is only available for users of [YData Fabric](https://ydata.ai).
+
+    [Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and
+    start your journey into **data management** with automated PII identification.
+
+Personal Identifiable Information **(PII)** refers to any information that can be used to identify an individual.
+This includes but is not limited to, names, addresses, phone numbers, social security numbers, email addresses, 
+and financial information. PII is crucial in today's digital age, where data is extensively collected, stored, 
+and processed.
+
+[YData Fabric Data Catalog](https://ydata.ai/products/data_catalog), a scalable and interactive version of ydata-profiling, 
+integrates into the data profiling experience, an advanced machine learning solutions based on a Named Entity Recognition (NER) model
+combine with traditional rule-based patterns identification, allowing to efficiently detect PII. 
+
+:fontawesome-brands-youtube:{ .youtube }
+<a href="https://www.youtube.com/clip/UgkxBntXvAvCQ6I39Cp2KZRD4Ug9-NPzG1o1"><u>See Fabric's Data Catalog PII identification in action</u></a>.
+
+## Why Fabric Catalog automated PII identification?
+
+The relevance of automating the identification of PII lies in the need to protect individuals' privacy and comply
+with various data protection regulations. Mishandling or unauthorized access to PII can lead to severe consequences
+such as identity theft, financial fraud, and breaches of privacy. With the increasing volume of data generated manual
+identification of PII becomes impractical and error-prone.
+
+Additionally, having a robust PII management solution is essential for organizations to establish and maintain 
+a secure approach to handling sensitive information, fostering trust and adhering to legal requirements.
+
+## Why Fabric to manage dataset PII identification
+
+Besides automated PII identification, *Fabric Catalog* offers several key benefits in the content of data governance,
+privacy compliance and overall data management, through automated data profiling and metadata management:
+
+### Compliance with Privacy Regulations:
+Many countries and regions have stringent data protection regulations (such as GDPR, CCPA, or HIPAA) 
+that require organizations to handle PII responsibly. A dedicated platform ensures that PII is correctly classified, 
+helping organizations comply with legal requirements and avoid potential penalties.
+
+### Data Profiling for Accuracy:
+
+Data profiling involves analyzing and understanding the structure and content of data. By incorporating data profiling
+capabilities into the platform, organizations can ensure accurate identification and classification of PII.
+This helps in maintaining the integrity of data and reduces the risk of misclassifications.
+
+### Efficient Management of PII:
+As the volume of data continues to grow, manually managing and editing PII classifications becomes impractical. 
+A platform streamlines this process, making it more efficient and reducing the likelihood of errors. 
+It allows organizations to keep track of PII across various datasets and systems.
+
+### Facilitating Data Governance:
+
+Data governance involves establishing policies and processes to ensure high data quality, security, and compliance. 
+A PII management solution enhances data governance efforts by providing a centralized hub for overseeing PII classifications,
+metadata, and related policies.
+
+
diff --git a/docs/features/sensitive_data.md b/docs/features/sensitive_data.md
@@ -56,3 +56,7 @@ pd.read_csv("filename.csv", dtype={"phone": str})
 Note that the type detection is hard. That is why
 [visions](https://github.com/dylan-profiler/visions), a type system to
 help developers solve these cases, was developed.
+
+## Automated PII classification & management
+
+You can find more details about this feature [here](pii_identification_management.md).
diff --git a/docs/index.md b/docs/index.md
@@ -9,14 +9,15 @@ understanding and preparing data for analysis in a single line of code! If you'r
 
 !!! tip "Advent of Code - Get featured on ydata-profiling"
 
-    *“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to get more involved with open-source software, but no one’s given you an entry point?
+    *“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to
+    get more involved with open-source software, but no one’s given you an entry point?
     
     That's why we joined [The Advent of code this year](https://zilliz.com/advent-of-code). Contribute to ydata-profiling and win some 🐼🐼 swag!
 
     How can you be part of it?
     
     - Give us some love with a Github ⭐
-    - Write an article or create a tutorial like other [members the communit already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
+    - Write an article or create a tutorial like other [members the community already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
     - Feeling adventurous? Contribute with a PR. We have a list of [great issues to get you started.](https://github.com/ydataai/ydata-profiling/issues?q=label%3A%22getting+started+%E2%98%9D%22+)
 
 ![ydata-profiling report](_static/img/ydata-profiling.gif)
@@ -55,15 +56,16 @@ YData-profiling can be used to deliver a variety of different applications. The
 
     Check out the [free Community Version](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community).
 
-| Features & functionalities                                                   | Description                                                                                 |
-|------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| [Comparing datasets](features/comparing_datasets.md)                         | Comparing multiple version of the same dataset                                              |
-| [Profiling a Time-Series dataset](features/time_series_datasets.md)          | Generating a report for a time-series dataset with a single line of code                    |
-| [Profiling large datasets](features/big_data.md)                             | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
-| [Handling sensitive data](features/sensitive_data.md)                        | Generating reports which are mindful about sensitive data in the input dataset              |
-| [Dataset metadata and data dictionaries](features/metadata.md)               | Complementing the report with dataset details and column-specific data dictionaries         |
-| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations            |
-| [Profiling Databases **](features/collaborative_data_profiling.md)           | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
+| Features & functionalities                                                       | Description                                                                                 |
+|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
+| [Comparing datasets](features/comparing_datasets.md)                             | Comparing multiple version of the same dataset                                              |
+| [Profiling a Time-Series dataset](features/time_series_datasets.md)              | Generating a report for a time-series dataset with a single line of code                    |
+| [Profiling large datasets](features/big_data.md)                                 | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
+| [Handling sensitive data](features/sensitive_data.md)                            | Generating reports which are mindful about sensitive data in the input dataset              |
+| [Dataset metadata and data dictionaries](features/metadata.md)                   | Complementing the report with dataset details and column-specific data dictionaries         |
+| [Customizing the report's appearance](features/custom_report_appearance.md )     | Changing the appearance of the report's page and of the contained visualizations            |
+| [Profiling Relational databases **](features/collaborative_data_profiling.md)    | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
+| [PII classification & management **](features/pii_identification_management.md ) | Automated PII classification and management through an UI experience          |
 
 ### Tutorials
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -15,6 +15,7 @@ nav:
     - Dataset metadata: 'features/metadata.md'
     - Datasets catalog **: 'features/collaborative_data_profiling.md'
     - Sensitive data: 'features/sensitive_data.md'
+    - Automated PII classification & management **: 'features/pii_identification_management.md'
     - Time-series: 'features/time_series_datasets.md'
     - Comparing datasets: 'features/comparing_datasets.md'
     - Big data: 'features/big_data.md'