Skip to content

fix(athena): ingest Iceberg table properties from $properties metatable#27715

Merged
harshsoni2024 merged 10 commits intomainfrom
athena_iceberg
Apr 28, 2026
Merged

fix(athena): ingest Iceberg table properties from $properties metatable#27715
harshsoni2024 merged 10 commits intomainfrom
athena_iceberg

Conversation

@ulixius9
Copy link
Copy Markdown
Member

@ulixius9 ulixius9 commented Apr 24, 2026

fix #27623

Summary

  • The Athena connector was reading table properties from Glue's Parameters map. For Iceberg tables that map only contains catalog pointers (table_type, metadata_location) — so native Iceberg properties (write.parquet.compression-codec, and any user-set keys like kpler.* written by PyIceberg/Spark/Airflow pipelines) never reached OpenMetadata.
  • Switch to SELECT key, value FROM "<schema>"."<table>$properties" for Iceberg tables. Non-Iceberg tables short-circuit and skip the extra query.
  • Sanitise property names: anything that isn't [A-Za-z0-9_] becomes __. Keep the original name as displayName so the UI still shows the real key.
  • MD5-hash names whose sanitised length exceeds 256 chars (the server-side limit).
  • Skip properties whose value is null or "".
  • Plumb table_type through get_table_extensions so non-Iceberg tables are a cheap no-op; remove the thread-local props context and the ineffective _processed_prop_lock (dedup under GIL is adequate since registration is idempotent).

Reproducing the bug before the fix

  1. Create an Iceberg table in Athena.
  2. Set a property through PyIceberg (lands in metadata.json only, not Glue Parameters):
    table.transaction().set_properties(**{"kpler.owner": "team-a"})
  3. SELECT * FROM "<table>$properties" → shows the property.
  4. aws glue get-table ... | jq .Parameters → does not show the property.
  5. Ingest via the Athena connector → property was missing from OM. With this change, it's picked up.

Test plan

  • 35 new unit tests in tests/unit/topology/database/test_athena.py covering:
    • early-exit branches (missing type ref, non-Iceberg, empty query result, all values filtered)
    • name sanitisation (dots, hyphens, mixed) and displayName preservation
    • 256-char boundary and MD5 hashing for longer names
    • null / empty-string value filtering; "0" and whitespace preserved
    • dedup across tables, and failure-path behaviour (failed registration isn't cached; one prop's failure doesn't block siblings)
    • _fetch_iceberg_properties return shape, exception handling, $properties metatable targeting, value coercion to string
  • Existing tests (test_query_table_names_and_types, test_query_table_names_and_types_iceberg, etc.) still pass
  • Manual end-to-end: reproduced the customer scenario (Iceberg-only props), ran ingestion, confirmed the extension field is populated with sanitised keys and original-key display names

🤖 Generated with Claude Code


Summary by Gitar

  • Configuration update:
    • Added includeCustomProperties toggle to DatabaseServiceMetadataPipeline schema to control ingestion of custom properties.
  • Sanitization change:
    • Updated PROPERTY_NAME_INVALID_CHARS_PATTERN to permit dots (.) and hyphens (-) in property names, preserving them instead of replacing them with double underscores.

This will update automatically on new commits.

Glue Parameters only carry Iceberg catalog pointers (table_type,
metadata_location) — they don't surface native Iceberg properties
like write.parquet.compression-codec or any user-set keys (e.g.
kpler.*) written by PyIceberg/Spark/Airflow. Those live inside
metadata.json and are exposed via Athena's <table>$properties
metatable.

- Switch get_table_extensions to query $properties for Iceberg
  tables; skip non-Iceberg tables to avoid wasted Athena queries
- Sanitise property names (non-alphanumeric/underscore → __),
  preserve the original name as displayName
- MD5-hash sanitised names longer than 256 chars
- Skip null and empty-string values
- Plumb table_type through get_table_extensions; remove the
  thread-local props context and the ineffective processed-prop
  lock (idempotent registration is fine under GIL)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 13:12
@ulixius9 ulixius9 requested a review from a team as a code owner April 24, 2026 13:12
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 24, 2026
Comment on lines +398 to +404
sanitized_name = PROPERTY_NAME_INVALID_CHARS_PATTERN.sub(
PROPERTY_NAME_REPLACEMENT, prop_name
)
if len(sanitized_name) > PROPERTY_NAME_MAX_LENGTH:
sanitized_name = hashlib.md5(
prop_name.encode("utf-8"), usedforsecurity=False
).hexdigest()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Sanitized property names can collide, silently dropping values

The replacement strategy ([^A-Za-z0-9_]__) maps distinct property names to the same sanitized key. For example, kpler.owner and kpler-owner both become kpler__owner. When both appear in the same $properties result set:

  1. The first is registered and written to registered_properties.
  2. The second is seen as "already processed" (dedup) and its value silently overwrites the first in registered_properties.

This means the final value stored in OM is non-deterministic (depends on dict iteration order) and one property value is lost without any log message.

Consider appending a short hash suffix when a collision is detected, or at minimum logging a warning when two original names map to the same sanitized key.

Suggested fix:

# After computing sanitized_name, detect collision:
if sanitized_name in registered_properties:
    logger.warning(
        f"Sanitized name collision: [{prop_name}] maps to [{sanitized_name}] "
        f"which is already used by another property — skipping."
    )
    continue

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +432 to +434
query = text(
f'SELECT key, value FROM "{schema_name}"."{table_name}$properties"'
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Security: SQL identifiers interpolated via f-string without escaping

_fetch_iceberg_properties builds the query with an f-string: f'SELECT key, value FROM "{schema_name}"."{table_name}$properties"'. If a schema or table name contains a double-quote character, the quoting breaks and arbitrary SQL could be injected. The risk is low because these names originate from the Glue catalogue (not direct user input), but defence-in-depth suggests escaping the identifiers.

SQLAlchemy provides quoted_name or the dialect's identifier_preparer.quote_identifier() for safe quoting. Alternatively, a simple schema_name.replace('"', '""') (standard SQL double-quote escaping) would suffice.

Suggested fix:

safe_schema = schema_name.replace('"', '""')
safe_table = table_name.replace('"', '""')
query = text(
    f'SELECT key, value FROM "{safe_schema}"."{safe_table}$properties"'
)

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Athena ingestion connector to correctly ingest Iceberg table properties by querying Athena’s "<table>$properties" metatable (instead of relying on Glue Parameters / inspector options), and adds unit coverage for the new behavior.

Changes:

  • Add table_type plumbing into get_table_extensions so non-Iceberg tables can skip extension work cheaply.
  • Implement Iceberg $properties fetching + property-name sanitization/length handling in the Athena source.
  • Replace/expand Athena unit tests to cover early exits, sanitization, hashing, filtering, dedup, and query behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/ingestion/source/database/common_db_source.py Passes table_type into get_table_extensions from yield_table to enable connector-specific fast paths.
ingestion/src/metadata/ingestion/source/database/athena/metadata.py Implements Iceberg $properties query, filters values, sanitizes/limits property names, and registers custom properties.
ingestion/tests/unit/topology/database/test_athena.py Adds extensive unit tests for Iceberg property ingestion behavior and name/value handling.

Comment on lines +398 to +405
sanitized_name = PROPERTY_NAME_INVALID_CHARS_PATTERN.sub(
PROPERTY_NAME_REPLACEMENT, prop_name
)
if len(sanitized_name) > PROPERTY_NAME_MAX_LENGTH:
sanitized_name = hashlib.md5(
prop_name.encode("utf-8"), usedforsecurity=False
).hexdigest()
if sanitized_name not in self._processed_prop:
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanitizing property names by replacing non [A-Za-z0-9_] chars with __ can cause collisions (e.g., a-b and a.b both become a__b). With the current _processed_prop/registered_properties keyed by sanitized_name, one of the original properties will be silently dropped/overwritten. Consider adding deterministic collision handling (e.g., append a short hash of the original key when a collision is detected) so all distinct Iceberg properties can be preserved.

Copilot uses AI. Check for mistakes.
Comment thread ingestion/src/metadata/ingestion/source/database/athena/metadata.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

🟡 Playwright Results — all passed (13 flaky)

✅ 3962 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 296 0 3 4
🟡 Shard 2 739 0 6 8
🟡 Shard 3 747 0 1 7
🟡 Shard 4 758 0 1 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 735 0 2 8
🟡 13 flaky test(s) (passed on retry)
  • Flow/Tour.spec.ts › Tour should work from help section (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from welcome screen (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from URL directly (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values To Be Unique (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Quick filters should persist when domain filter is applied and cleared (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/Glossary/LargeGlossaryPerformance.spec.ts › should handle large number of glossary terms with pagination (shard 2, 1 retry)
  • Features/Workflows/WorkflowOssRestrictions.spec.ts › clicking a node in view mode opens read-only config sidebar (no save or delete buttons) (shard 3, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@ulixius9 ulixius9 requested a review from a team as a code owner April 27, 2026 09:14
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 61%
62.04% (61977/99885) 42.14% (33134/78616) 45.24% (9825/21713)

Copilot AI review requested due to automatic review settings April 27, 2026 13:01
Copilot AI review requested due to automatic review settings April 28, 2026 11:02
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 28, 2026

Code Review 👍 Approved with suggestions 0 resolved / 2 findings

Ingests Iceberg table properties via the $properties metatable. Address potential name collisions in the sanitization logic and replace f-string interpolation with parameterized queries to prevent SQL injection.

💡 Edge Case: Sanitized property names can collide, silently dropping values

📄 ingestion/src/metadata/ingestion/source/database/athena/metadata.py:398-404

The replacement strategy ([^A-Za-z0-9_]__) maps distinct property names to the same sanitized key. For example, kpler.owner and kpler-owner both become kpler__owner. When both appear in the same $properties result set:

  1. The first is registered and written to registered_properties.
  2. The second is seen as "already processed" (dedup) and its value silently overwrites the first in registered_properties.

This means the final value stored in OM is non-deterministic (depends on dict iteration order) and one property value is lost without any log message.

Consider appending a short hash suffix when a collision is detected, or at minimum logging a warning when two original names map to the same sanitized key.

Suggested fix
# After computing sanitized_name, detect collision:
if sanitized_name in registered_properties:
    logger.warning(
        f"Sanitized name collision: [{prop_name}] maps to [{sanitized_name}] "
        f"which is already used by another property — skipping."
    )
    continue
💡 Security: SQL identifiers interpolated via f-string without escaping

📄 ingestion/src/metadata/ingestion/source/database/athena/metadata.py:432-434

_fetch_iceberg_properties builds the query with an f-string: f'SELECT key, value FROM "{schema_name}"."{table_name}$properties"'. If a schema or table name contains a double-quote character, the quoting breaks and arbitrary SQL could be injected. The risk is low because these names originate from the Glue catalogue (not direct user input), but defence-in-depth suggests escaping the identifiers.

SQLAlchemy provides quoted_name or the dialect's identifier_preparer.quote_identifier() for safe quoting. Alternatively, a simple schema_name.replace('"', '""') (standard SQL double-quote escaping) would suffice.

Suggested fix
safe_schema = schema_name.replace('"', '""')
safe_table = table_name.replace('"', '""')
query = text(
    f'SELECT key, value FROM "{safe_schema}"."{safe_table}$properties"'
)
🤖 Prompt for agents
Code Review: Ingests Iceberg table properties via the $properties metatable. Address potential name collisions in the sanitization logic and replace f-string interpolation with parameterized queries to prevent SQL injection.

1. 💡 Edge Case: Sanitized property names can collide, silently dropping values
   Files: ingestion/src/metadata/ingestion/source/database/athena/metadata.py:398-404

   The replacement strategy (`[^A-Za-z0-9_]` → `__`) maps distinct property names to the same sanitized key. For example, `kpler.owner` and `kpler-owner` both become `kpler__owner`. When both appear in the same `$properties` result set:
   1. The first is registered and written to `registered_properties`.
   2. The second is seen as "already processed" (dedup) and its value silently overwrites the first in `registered_properties`.
   
   This means the final value stored in OM is non-deterministic (depends on dict iteration order) and one property value is lost without any log message.
   
   Consider appending a short hash suffix when a collision is detected, or at minimum logging a warning when two original names map to the same sanitized key.

   Suggested fix:
   # After computing sanitized_name, detect collision:
   if sanitized_name in registered_properties:
       logger.warning(
           f"Sanitized name collision: [{prop_name}] maps to [{sanitized_name}] "
           f"which is already used by another property — skipping."
       )
       continue

2. 💡 Security: SQL identifiers interpolated via f-string without escaping
   Files: ingestion/src/metadata/ingestion/source/database/athena/metadata.py:432-434

   `_fetch_iceberg_properties` builds the query with an f-string: `f'SELECT key, value FROM "{schema_name}"."{table_name}$properties"'`. If a schema or table name contains a double-quote character, the quoting breaks and arbitrary SQL could be injected. The risk is low because these names originate from the Glue catalogue (not direct user input), but defence-in-depth suggests escaping the identifiers.
   
   SQLAlchemy provides `quoted_name` or the dialect's `identifier_preparer.quote_identifier()` for safe quoting. Alternatively, a simple `schema_name.replace('"', '""')` (standard SQL double-quote escaping) would suffice.

   Suggested fix:
   safe_schema = schema_name.replace('"', '""')
   safe_table = table_name.replace('"', '""')
   query = text(
       f'SELECT key, value FROM "{safe_schema}"."{safe_table}$properties"'
   )

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@harshsoni2024 harshsoni2024 merged commit 46390da into main Apr 28, 2026
78 of 81 checks passed
@harshsoni2024 harshsoni2024 deleted the athena_iceberg branch April 28, 2026 17:49
jatinmasaram pushed a commit to jatinmasaram/OpenMetadata that referenced this pull request May 2, 2026
jaya6400 pushed a commit to jaya6400/OpenMetadata that referenced this pull request May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Connector] Support Ingestion of Custom Iceberg Table Properties from Athena

4 participants