Commit 88c4450
feat: Add auto-classification support for storage service containers (open-metadata#26495)
* Add schema support for container auto-classification
Extend container entity schema to support sample data storage, enabling
PII detection and classification workflows on storage service containers.
Changes:
- Add sampleData field to container.json for storing sample data
- Create storageServiceAutoClassificationPipeline.json schema defining
configuration for storage service auto-classification pipelines
- Update workflow.json to include StorageServiceAutoClassificationPipeline
as a supported pipeline type
This provides the schema foundation for running auto-classification
workflows on S3, GCS, and other storage service containers.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add backend support for container sample data and classification
Implement Java backend functionality to handle sample data ingestion,
storage, and PII masking for container entities.
Changes:
- ContainerRepository: Add sample data retrieval and storage operations
- EntityRepository: Extend sample data support to container entities
- ContainerResource: Add REST endpoint for container sample data ingestion
- PIIMasker: Extend PII masking to support container entities
This enables the backend to process and store sample data from storage
service containers and apply PII masking rules during data retrieval.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Extend classifiable entity types to include containers
Add Container to the ClassifiableEntityType union, enabling PII detection
and auto-classification workflows to process storage service containers
alongside database tables.
Changes:
- Update ClassifiableEntityType from Table-only to Union[Table, Container]
- Import Container entity type
- Update module docstring to reflect current support
This type extension allows the PII processor to handle both database
tables and storage containers uniformly.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add container sample data ingestion to OpenMetadata API
Implement container-specific API mixin for sample data operations and
integrate it into the main OpenMetadata client.
Changes:
- Add OMetaContainerMixin with ingest_container_sample_data method
- Handle binary data encoding (base64) and serialization errors
- Register mixin in OpenMetadata class hierarchy
- Mirror table sample data ingestion patterns for consistency
This provides the Python API layer for ingesting sample data from
storage service containers into OpenMetadata.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Implement storage service samplers for S3 and GCS
Add sampler implementations for storage services to extract sample data
from structured containers (Parquet, CSV) for auto-classification.
Changes:
- Create base StorageSamplerInterface for storage service sampling
- Implement S3Sampler for AWS S3 containers with structured file support
- Implement GCSSampler for Google Cloud Storage containers
- Support column extraction and data sampling for structured formats
- Handle dataModel-based column definitions from containers
Storage samplers read container metadata, fetch file contents, and
generate sample datasets for downstream PII detection.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update PII processor to support container entities
Extend the base PII processor to handle both Table and Container
entities with unified column extraction logic.
Changes:
- Add _get_entity_columns helper to extract columns from Table or Container
- Handle Container entities with optional dataModel.columns structure
- Improve column matching with safe fallback for missing columns
- Use generic entity reference in error reporting
- Add early return when entity has no columns to process
This enables PII detection to run on storage containers the same way
it processes database tables.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add storage service support to sampler processor
Extend the sampler processor to handle both database and storage service
entities with appropriate sampler class selection.
Changes:
- Detect service type from source config (Database vs Storage)
- Import StorageServiceAutoClassificationPipeline
- Handle both Table and Container entity types in _run method
- Add column validation for Container entities (via dataModel.columns)
- Create storage-specific sampler interfaces for S3 and GCS
- Update sampler_interface to support Container entities
- Improve error messages with entity type context
The processor now dynamically selects database or storage samplers based
on the pipeline configuration type.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add storage fetcher strategy for container classification
Implement fetcher strategy pattern for storage services to retrieve
containers for auto-classification workflows.
Changes:
- Add StorageFetcherStrategy to handle storage service entity fetching
- Update EntityFetcher to select appropriate strategy based on service type
- Support both DatabaseService and StorageService in strategy selection
- Import StorageService type for service detection
- Improve error messages with specific service type information
The fetcher now dynamically creates database or storage-specific
strategies to retrieve entities based on pipeline configuration.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Register auto-classification pipeline in storage service specs
Add AutoClassification pipeline support to S3 and GCS storage service
specifications, enabling UI and workflow registration.
Changes:
- Add AutoClassification to S3ServiceSpec supported pipelines
- Add AutoClassification to GCSServiceSpec supported pipelines
- Import StorageServiceAutoClassificationPipeline in both specs
This registers the auto-classification workflow type for storage
services in the ingestion framework's service registry.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add container support to metadata sink and patch operations
Extend metadata sink and patch mixin to handle container entities,
enabling sample data ingestion and tag updates for containers.
Changes:
- Add Container to MetadataRestSink entity type handling
- Implement container sample data ingestion in sink._run
- Add Container to PatchMixin tag operations
- Import Container entity type in both modules
This completes the metadata ingestion pipeline by allowing the sink
to persist sample data and classification tags for container entities.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update classification workflow for storage service support
Extend the auto-classification workflow to handle both database and
storage service pipelines with unified step orchestration.
Changes:
- Import StorageServiceAutoClassificationPipeline
- Add type checking for both Database and Storage pipeline configs
- Remove unnecessary cast, use direct type checks
- Add validation warning for unsupported config types
- Preserve enableAutoClassification flag behavior for both types
The workflow now supports running PII detection and classification
on both database tables and storage containers based on config type.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add unit tests for container classification components
Add test coverage for container-specific fetcher and sampler components.
Changes:
- Add test_container_fetcher.py for StorageFetcherStrategy tests
- Add test_container_sampler_processor.py for container sampler tests
Tests validate:
- Storage service fetcher strategy selection and instantiation
- Container sampler processor initialization and execution
- Proper handling of Container entities vs Table entities
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Reorganize integration tests by entity type
Restructure auto-classification integration tests into separate
directories for databases and containers to improve organization.
Changes:
- Move database classification tests to databases/ subdirectory
- Move conftest.py, init.sql, and test_tag_processor.py into databases/
- Container tests already organized in containers/ subdirectory
- Remove old flat test structure
This organization makes it clearer which tests target database entities
vs storage container entities in classification workflows.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Properly retrieve sample data
* Update generated TypeScript types
* Apply Gitar bot
* Fix tests
* feat: Add supportsProfiler to storage connection schemas
Add supportsProfiler field to storage connection schemas (S3, GCS, ADLS,
Custom Storage) to enable auto-classification pipeline support for storage
services. This aligns with the backend changes in PR open-metadata#26495 that added
container auto-classification functionality.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* feat: Add UI support for storage service auto-classification
- Update IngestionWorkflowUtils to route storage services to storage-specific
auto-classification schema
- Modify getSupportedPipelineTypes to filter pipeline types based on service
category (storage services only show AutoClassification, not Profiler)
- Update AddIngestionButton to pass serviceCategory parameter
- Add unit test to verify storage services only get AutoClassification option
This enables users to configure and run auto-classification agents on storage
services (S3, GCS, ADLS) for PII detection on containers.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Add BucketArn field to S3BucketResponse model
AWS S3 API now returns a BucketArn field in list_buckets() responses.
Add this optional field to prevent Pydantic extra_forbidden validation errors.
Error: BucketArn Extra inputs are not permitted [type=extra_forbidden]
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Add Container permissions to AutoClassificationBotPolicy
Add Container entity permissions to AutoClassificationBotPolicy to allow the
autoClassification-bot to apply tags and sample data to storage containers.
Previously, the bot only had permissions for Table entities, causing
permission denied errors when running auto-classification on storage services.
Changes:
- Add Container rule with EditAll and ViewAll operations to policy seed data
- Create migrations for MySQL and PostgreSQL to update existing installations
Error fixed: Principal: CatalogPrincipal{name='autoclassification-bot'}
operations [EditTags] not allowed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update generated TypeScript types
* fix: Add fallback for storage service type detection in sampler
Add fallback logic to detect storage services by source type name when
the pipeline config type check fails. This handles cases where the Airflow
environment might not have the updated schema/package with
StorageServiceAutoClassificationPipeline.
Changes:
- Add fallback detection for s3, gcs, azuredatalake, customstorage
- Add debug logging for service type detection
- Preserve primary instanceof check for proper type detection
This fixes the "No module named 'metadata.ingestion.source.database.gcs'"
error when running storage auto-classification pipelines.
* Guide to support new entities in classification agent
* docs: Update auto-classification guide with debugging learnings
Add critical troubleshooting information discovered during container
classification debugging:
1. storeSampleData defaults to false
- Sample data NOT ingested unless explicitly enabled
- Document why this is by design (avoid large datasets)
- Add troubleshooting steps to verify flag is set
2. Service type detection fallback pattern
- Explain why fallback is needed (Airflow package caching)
- Show complete implementation with source type lists
- Add debug logging pattern
3. Troubleshooting section
- Sample data not appearing: check storeSampleData, database, logs
- Module import errors: service type detection issues
- PII tags not applied: config and data issues
4. Common pitfalls additions
- Emphasize storeSampleData default value
- Service type detection in cached environments
These updates reflect real debugging scenarios and will help future
developers avoid the same issues.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Apply gitar bot suggestions
* Fix suggestions, linting, and SonarCloud issues
* More gitar bot suggestions
* Fix compile error
* Fix linting
* Fix broken tests
* Fix unorganized import
* Improve config parsing
This is so that we rightly discover polymorphic properties of `source` when the config does not provide enough fields for Pydantic to correctly discriminate between models (e.g: confusing database source config with storage source config)
* Gitar bot comment
* Fix s3 source test
* Apply comments from reviews
* Extract cantidate column logic in samplers
* Fix tests
* Fix container customization test
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>1 parent 1c8b300 commit 88c4450
84 files changed
Lines changed: 5064 additions & 102 deletions
File tree
- bootstrap/sql/migrations/native/1.13.0
- mysql
- postgres
- docs/auto-classification
- ingestion
- src/metadata
- ingestion
- api
- ometa
- mixins
- sink
- source/storage
- gcs
- s3
- pii
- profiler/source
- fetcher
- sampler
- storage
- gcs
- s3
- workflow
- tests
- integration/auto_classification
- containers
- databases
- unit
- observability/profiler
- sampler
- topology/storage
- openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests
- openmetadata-service/src/main
- java/org/openmetadata/service
- governance/workflows/elements/nodes/automatedTask/createAndRunIngestionPipeline
- jdbi3
- resources/storages
- security/mask
- resources/json/data/policy
- openmetadata-spec/src/main/resources/json/schema
- entity
- data
- services/connections/storage
- metadataIngestion
- openmetadata-ui/src/main/resources/ui
- playwright/constant
- src
- components
- Database/SampleDataTable
- Settings/Services/Ingestion
- generated
- api
- automations
- services
- ingestionPipelines
- entity
- automations
- data
- services
- connections
- database
- pipeline
- storage
- ingestionPipelines
- metadataIngestion
- pages/ContainerPage
- rest
- utils
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 15 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
Lines changed: 16 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
0 commit comments