fix(athena): ingest Iceberg table properties from $properties metatable#27715
fix(athena): ingest Iceberg table properties from $properties metatable#27715harshsoni2024 merged 10 commits intomainfrom
Conversation
Glue Parameters only carry Iceberg catalog pointers (table_type, metadata_location) — they don't surface native Iceberg properties like write.parquet.compression-codec or any user-set keys (e.g. kpler.*) written by PyIceberg/Spark/Airflow. Those live inside metadata.json and are exposed via Athena's <table>$properties metatable. - Switch get_table_extensions to query $properties for Iceberg tables; skip non-Iceberg tables to avoid wasted Athena queries - Sanitise property names (non-alphanumeric/underscore → __), preserve the original name as displayName - MD5-hash sanitised names longer than 256 chars - Skip null and empty-string values - Plumb table_type through get_table_extensions; remove the thread-local props context and the ineffective processed-prop lock (idempotent registration is fine under GIL) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| sanitized_name = PROPERTY_NAME_INVALID_CHARS_PATTERN.sub( | ||
| PROPERTY_NAME_REPLACEMENT, prop_name | ||
| ) | ||
| if len(sanitized_name) > PROPERTY_NAME_MAX_LENGTH: | ||
| sanitized_name = hashlib.md5( | ||
| prop_name.encode("utf-8"), usedforsecurity=False | ||
| ).hexdigest() |
There was a problem hiding this comment.
💡 Edge Case: Sanitized property names can collide, silently dropping values
The replacement strategy ([^A-Za-z0-9_] → __) maps distinct property names to the same sanitized key. For example, kpler.owner and kpler-owner both become kpler__owner. When both appear in the same $properties result set:
- The first is registered and written to
registered_properties. - The second is seen as "already processed" (dedup) and its value silently overwrites the first in
registered_properties.
This means the final value stored in OM is non-deterministic (depends on dict iteration order) and one property value is lost without any log message.
Consider appending a short hash suffix when a collision is detected, or at minimum logging a warning when two original names map to the same sanitized key.
Suggested fix:
# After computing sanitized_name, detect collision:
if sanitized_name in registered_properties:
logger.warning(
f"Sanitized name collision: [{prop_name}] maps to [{sanitized_name}] "
f"which is already used by another property — skipping."
)
continue
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| query = text( | ||
| f'SELECT key, value FROM "{schema_name}"."{table_name}$properties"' | ||
| ) |
There was a problem hiding this comment.
💡 Security: SQL identifiers interpolated via f-string without escaping
_fetch_iceberg_properties builds the query with an f-string: f'SELECT key, value FROM "{schema_name}"."{table_name}$properties"'. If a schema or table name contains a double-quote character, the quoting breaks and arbitrary SQL could be injected. The risk is low because these names originate from the Glue catalogue (not direct user input), but defence-in-depth suggests escaping the identifiers.
SQLAlchemy provides quoted_name or the dialect's identifier_preparer.quote_identifier() for safe quoting. Alternatively, a simple schema_name.replace('"', '""') (standard SQL double-quote escaping) would suffice.
Suggested fix:
safe_schema = schema_name.replace('"', '""')
safe_table = table_name.replace('"', '""')
query = text(
f'SELECT key, value FROM "{safe_schema}"."{safe_table}$properties"'
)
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
There was a problem hiding this comment.
Pull request overview
Updates the Athena ingestion connector to correctly ingest Iceberg table properties by querying Athena’s "<table>$properties" metatable (instead of relying on Glue Parameters / inspector options), and adds unit coverage for the new behavior.
Changes:
- Add
table_typeplumbing intoget_table_extensionsso non-Iceberg tables can skip extension work cheaply. - Implement Iceberg
$propertiesfetching + property-name sanitization/length handling in the Athena source. - Replace/expand Athena unit tests to cover early exits, sanitization, hashing, filtering, dedup, and query behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/ingestion/source/database/common_db_source.py |
Passes table_type into get_table_extensions from yield_table to enable connector-specific fast paths. |
ingestion/src/metadata/ingestion/source/database/athena/metadata.py |
Implements Iceberg $properties query, filters values, sanitizes/limits property names, and registers custom properties. |
ingestion/tests/unit/topology/database/test_athena.py |
Adds extensive unit tests for Iceberg property ingestion behavior and name/value handling. |
| sanitized_name = PROPERTY_NAME_INVALID_CHARS_PATTERN.sub( | ||
| PROPERTY_NAME_REPLACEMENT, prop_name | ||
| ) | ||
| if len(sanitized_name) > PROPERTY_NAME_MAX_LENGTH: | ||
| sanitized_name = hashlib.md5( | ||
| prop_name.encode("utf-8"), usedforsecurity=False | ||
| ).hexdigest() | ||
| if sanitized_name not in self._processed_prop: |
There was a problem hiding this comment.
Sanitizing property names by replacing non [A-Za-z0-9_] chars with __ can cause collisions (e.g., a-b and a.b both become a__b). With the current _processed_prop/registered_properties keyed by sanitized_name, one of the original properties will be silently dropped/overwritten. Consider adding deterministic collision handling (e.g., append a short hash of the original key when a collision is detected) so all distinct Iceberg properties can be preserved.
🟡 Playwright Results — all passed (13 flaky)✅ 3962 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped
🟡 13 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
Code Review 👍 Approved with suggestions 0 resolved / 2 findingsIngests Iceberg table properties via the $properties metatable. Address potential name collisions in the sanitization logic and replace f-string interpolation with parameterized queries to prevent SQL injection. 💡 Edge Case: Sanitized property names can collide, silently dropping values📄 ingestion/src/metadata/ingestion/source/database/athena/metadata.py:398-404 The replacement strategy (
This means the final value stored in OM is non-deterministic (depends on dict iteration order) and one property value is lost without any log message. Consider appending a short hash suffix when a collision is detected, or at minimum logging a warning when two original names map to the same sanitized key. Suggested fix💡 Security: SQL identifiers interpolated via f-string without escaping📄 ingestion/src/metadata/ingestion/source/database/athena/metadata.py:432-434
SQLAlchemy provides Suggested fix🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



fix #27623
Summary
Parametersmap. For Iceberg tables that map only contains catalog pointers (table_type,metadata_location) — so native Iceberg properties (write.parquet.compression-codec, and any user-set keys likekpler.*written by PyIceberg/Spark/Airflow pipelines) never reached OpenMetadata.SELECT key, value FROM "<schema>"."<table>$properties"for Iceberg tables. Non-Iceberg tables short-circuit and skip the extra query.[A-Za-z0-9_]becomes__. Keep the original name asdisplayNameso the UI still shows the real key.nullor"".table_typethroughget_table_extensionsso non-Iceberg tables are a cheap no-op; remove the thread-local props context and the ineffective_processed_prop_lock(dedup under GIL is adequate since registration is idempotent).Reproducing the bug before the fix
metadata.jsononly, not GlueParameters):SELECT * FROM "<table>$properties"→ shows the property.aws glue get-table ... | jq .Parameters→ does not show the property.Test plan
tests/unit/topology/database/test_athena.pycovering:displayNamepreservation"0"and whitespace preserved_fetch_iceberg_propertiesreturn shape, exception handling,$propertiesmetatable targeting, value coercion to stringtest_query_table_names_and_types,test_query_table_names_and_types_iceberg, etc.) still passextensionfield is populated with sanitised keys and original-key display names🤖 Generated with Claude Code
Summary by Gitar
includeCustomPropertiestoggle toDatabaseServiceMetadataPipelineschema to control ingestion of custom properties.PROPERTY_NAME_INVALID_CHARS_PATTERNto permit dots (.) and hyphens (-) in property names, preserving them instead of replacing them with double underscores.This will update automatically on new commits.