Skip to content

Refactor(ingestion): introduce ClassifiableEntityAdapter to eliminate scattered isinstance checks#27716

Open
edg956 wants to merge 2 commits intomainfrom
feat/refactor-entity-classification
Open

Refactor(ingestion): introduce ClassifiableEntityAdapter to eliminate scattered isinstance checks#27716
edg956 wants to merge 2 commits intomainfrom
feat/refactor-entity-classification

Conversation

@edg956
Copy link
Copy Markdown
Contributor

@edg956 edg956 commented Apr 24, 2026

Describe your changes:

Refactors the auto-classification pipeline to eliminate 6 scattered isinstance(entity, Table/Container) checks introduced in #26495. Replaces them with a ClassifiableEntityAdapter strategy pattern backed by a single O(1) registry.

What changed:

  • New sampler/entity_adapters.pyEntityAdapter ABC + TableAdapter, ContainerAdapter, and two dict registries (_BY_ENTITY, _BY_PIPELINE) with adapter_for() / adapter_for_pipeline() lookup functions
  • New sampler/config_utils.pybuild_database_service_conn_config() free function extracted from SamplerProcessor._copy_service_config()
  • sampler/processor.py — deleted _run_for_table, _run_for_container, _copy_service_config; replaced with single _run_for_entity() using adapter dispatch (~120 lines removed, ~30 added)
  • pii/base_processor.py_get_entity_columns now delegates to adapter_for(entity).get_columns(entity)
  • ometa/mixins/patch_mixin.py — deleted 4 isinstance helpers (_get_fields_for_entity, _prepare_table_destination, _prepare_container_destination, _prepare_destination_for_column_tags); replaced with single method using adapter
  • ingestion/sink/metadata_rest.py — deleted dead _patch_entity_column_tags singledispatch (both registered handlers were byte-identical); callsite replaced with direct metadata.patch_column_tags()
  • docs/auto-classification/add-support-for-another-entity.md — updated to reflect the adapter pattern

Why:
Adding DashboardDataModel (already referenced in pii/types.py) would have required touching all 6 sites. With this refactor it requires: one new EntityAdapter subclass + two registry entries + extend ClassifiableEntityType union + extend workflow/classification.py isinstance tuple. All sampling, PII, and patch logic picks it up automatically.

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Refactor(ingestion): introduce ClassifiableEntityAdapter to eliminate scattered isinstance checks
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added tests around the new logic.
  • For connector/ingestion changes: I updated the documentation.

@edg956 edg956 requested a review from a team as a code owner April 24, 2026 13:50
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 24, 2026
Comment on lines +497 to +500
from metadata.sampler.entity_adapters import adapter_for

adapter = adapter_for(table)
fields = adapter.patch_fields if adapter else ["tags"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: patch_column_tags can fetch entity with wrong fields when adapter is None

In patch_column_tags, when adapter_for(table) returns None (unsupported entity type), fields defaults to ["tags"]. The method then proceeds to fetch the entity and call _prepare_destination_for_column_tags, which independently calls adapter_for(table) again and returns None, causing a silent no-op. While not a crash, this is wasteful (fetches entity from API for nothing) and the redundant adapter_for lookup in _prepare_destination_for_column_tags is unnecessary — the adapter should be resolved once and passed through, or the method should bail out early when adapter is None.

Suggested fix:

adapter = adapter_for(table)
if adapter is None:
    logger.warning(
        "Unsupported entity type for column tag patching: %s",
        type(table).__name__,
    )
    return None
fields = adapter.patch_fields
# ...then pass adapter to _prepare_destination_for_column_tags
# instead of looking it up again inside that method

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +159 to +167
_BY_ENTITY: Dict[type, EntityAdapter] = {
Table: TableAdapter(),
Container: ContainerAdapter(),
}

_BY_PIPELINE: Dict[type, EntityAdapter] = {
DatabaseServiceAutoClassificationPipeline: TableAdapter(),
StorageServiceAutoClassificationPipeline: ContainerAdapter(),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Duplicate adapter instances in _BY_ENTITY and _BY_PIPELINE

The module-level registries _BY_ENTITY and _BY_PIPELINE each instantiate their own TableAdapter() and ContainerAdapter(), resulting in 4 adapter objects when 2 would suffice. Since adapters are stateless singletons by design, consider creating a single instance of each and referencing it in both dictionaries. This also makes the relationship between the two registries more explicit.

Suggested fix:

_TABLE_ADAPTER = TableAdapter()
_CONTAINER_ADAPTER = ContainerAdapter()

_BY_ENTITY: Dict[type, EntityAdapter] = {
    Table: _TABLE_ADAPTER,
    Container: _CONTAINER_ADAPTER,
}

_BY_PIPELINE: Dict[type, EntityAdapter] = {
    DatabaseServiceAutoClassificationPipeline: _TABLE_ADAPTER,
    StorageServiceAutoClassificationPipeline: _CONTAINER_ADAPTER,
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 24, 2026

Code Review ⚠️ Changes requested 0 resolved / 2 findings

Refactored entity classification, but requires addressing potential data fetch errors in patch_column_tags when adapters are missing and eliminating redundant adapter instantiations in global registries.

⚠️ Bug: patch_column_tags can fetch entity with wrong fields when adapter is None

📄 ingestion/src/metadata/ingestion/ometa/mixins/patch_mixin.py:497-500 📄 ingestion/src/metadata/ingestion/ometa/mixins/patch_mixin.py:458-460

In patch_column_tags, when adapter_for(table) returns None (unsupported entity type), fields defaults to ["tags"]. The method then proceeds to fetch the entity and call _prepare_destination_for_column_tags, which independently calls adapter_for(table) again and returns None, causing a silent no-op. While not a crash, this is wasteful (fetches entity from API for nothing) and the redundant adapter_for lookup in _prepare_destination_for_column_tags is unnecessary — the adapter should be resolved once and passed through, or the method should bail out early when adapter is None.

Suggested fix
adapter = adapter_for(table)
if adapter is None:
    logger.warning(
        "Unsupported entity type for column tag patching: %s",
        type(table).__name__,
    )
    return None
fields = adapter.patch_fields
# ...then pass adapter to _prepare_destination_for_column_tags
# instead of looking it up again inside that method
💡 Quality: Duplicate adapter instances in _BY_ENTITY and _BY_PIPELINE

📄 ingestion/src/metadata/sampler/entity_adapters.py:159-167

The module-level registries _BY_ENTITY and _BY_PIPELINE each instantiate their own TableAdapter() and ContainerAdapter(), resulting in 4 adapter objects when 2 would suffice. Since adapters are stateless singletons by design, consider creating a single instance of each and referencing it in both dictionaries. This also makes the relationship between the two registries more explicit.

Suggested fix
_TABLE_ADAPTER = TableAdapter()
_CONTAINER_ADAPTER = ContainerAdapter()

_BY_ENTITY: Dict[type, EntityAdapter] = {
    Table: _TABLE_ADAPTER,
    Container: _CONTAINER_ADAPTER,
}

_BY_PIPELINE: Dict[type, EntityAdapter] = {
    DatabaseServiceAutoClassificationPipeline: _TABLE_ADAPTER,
    StorageServiceAutoClassificationPipeline: _CONTAINER_ADAPTER,
}
🤖 Prompt for agents
Code Review: Refactored entity classification, but requires addressing potential data fetch errors in `patch_column_tags` when adapters are missing and eliminating redundant adapter instantiations in global registries.

1. ⚠️ Bug: patch_column_tags can fetch entity with wrong fields when adapter is None
   Files: ingestion/src/metadata/ingestion/ometa/mixins/patch_mixin.py:497-500, ingestion/src/metadata/ingestion/ometa/mixins/patch_mixin.py:458-460

   In `patch_column_tags`, when `adapter_for(table)` returns `None` (unsupported entity type), `fields` defaults to `["tags"]`. The method then proceeds to fetch the entity and call `_prepare_destination_for_column_tags`, which independently calls `adapter_for(table)` again and returns `None`, causing a silent no-op. While not a crash, this is wasteful (fetches entity from API for nothing) and the redundant `adapter_for` lookup in `_prepare_destination_for_column_tags` is unnecessary — the adapter should be resolved once and passed through, or the method should bail out early when adapter is None.

   Suggested fix:
   adapter = adapter_for(table)
   if adapter is None:
       logger.warning(
           "Unsupported entity type for column tag patching: %s",
           type(table).__name__,
       )
       return None
   fields = adapter.patch_fields
   # ...then pass adapter to _prepare_destination_for_column_tags
   # instead of looking it up again inside that method

2. 💡 Quality: Duplicate adapter instances in _BY_ENTITY and _BY_PIPELINE
   Files: ingestion/src/metadata/sampler/entity_adapters.py:159-167

   The module-level registries `_BY_ENTITY` and `_BY_PIPELINE` each instantiate their own `TableAdapter()` and `ContainerAdapter()`, resulting in 4 adapter objects when 2 would suffice. Since adapters are stateless singletons by design, consider creating a single instance of each and referencing it in both dictionaries. This also makes the relationship between the two registries more explicit.

   Suggested fix:
   _TABLE_ADAPTER = TableAdapter()
   _CONTAINER_ADAPTER = ContainerAdapter()
   
   _BY_ENTITY: Dict[type, EntityAdapter] = {
       Table: _TABLE_ADAPTER,
       Container: _CONTAINER_ADAPTER,
   }
   
   _BY_PIPELINE: Dict[type, EntityAdapter] = {
       DatabaseServiceAutoClassificationPipeline: _TABLE_ADAPTER,
       StorageServiceAutoClassificationPipeline: _CONTAINER_ADAPTER,
   }

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@edg956 edg956 changed the title Refactor entity classification Refactor(ingestion): introduce ClassifiableEntityAdapter to eliminate scattered isinstance checks Apr 24, 2026
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 3 failure(s), 22 flaky

✅ 3947 passed · ❌ 3 failed · 🟡 22 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 294 1 2 4
🔴 Shard 2 750 1 8 8
🟡 Shard 3 729 0 3 7
🟡 Shard 4 756 0 3 18
✅ Shard 5 687 0 0 41
🔴 Shard 6 731 1 6 8

Genuine Failures (failed on all attempts)

Pages/SearchSettings.spec.ts › Restore default search settings (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

�[32m- Expected  - 0�[39m
�[31m+ Received  + 5�[39m

�[33m@@ -45,10 +45,15 @@�[39m
�[2m        "boost": 20,�[22m
�[2m        "field": "displayName.keyword",�[22m
�[2m        "matchType": "exact",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
�[31m+       "boost": 20,�[39m
�[31m+       "field": "name.keyword",�[39m
�[31m+       "matchType": "exact",�[39m
�[31m+     },�[39m
�[31m+     Object {�[39m
�[2m        "boost": 10,�[22m
�[2m        "field": "name",�[22m
�[2m        "matchType": "phrase",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
Features/DomainFilterQueryFilter.spec.ts › Domain filter should persist across page navigation (shard 2)
�[31mTest timeout of 180000ms exceeded.�[39m
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6)
ReferenceError: getApiContext is not defined
🟡 22 flaky test(s) (passed on retry)
  • Flow/Tour.spec.ts › Tour should work from URL directly (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/DataProductRename.spec.ts › should rename data product and verify assets are still associated (shard 2, 1 retry)
  • Features/DataQuality/DataObservabilityGovernanceTab.spec.ts › standalone DQ dashboard still shows the filter bar (shard 2, 1 retry)
  • Features/DataQuality/TableLevelTests.spec.ts › Table Row Count To Be Between (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Assets from selected domain should be visible in explore page (shard 2, 1 retry)
  • Features/ExploreQuickFilters.spec.ts › tier with assigned asset appears in dropdown, tier without asset does not (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/Tasks/TaskComments.spec.ts › admin should be able to add comment to any task (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/SchemaTable.spec.ts › schema table test (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set & Update all CP types on apiCollection (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Database (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner with unsorted list (shard 4, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Topic (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Spreadsheet (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Admin user can edit teams from the user profile (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant