fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat#27766
fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat#27766
Conversation
…y 2.x Two issues addressed in the Databricks legacy connector: 1. Nested struct/array field descriptions were never captured. Spark's DataType.simpleString output (which DESCRIBE TABLE emits) strips nested COMMENT clauses, so the field-level descriptions users define on STRUCT members never reached OpenMetadata. Switch to DESCRIBE TABLE EXTENDED <table> AS JSON (Databricks Runtime 16.4+) which returns structured per-field metadata including comments. On older runtimes the AS JSON query errors and the connector silently degrades to top-level-only descriptions — no regression. 2. result.values() removed in SQLAlchemy 2.x. The Row.values() method was removed; attribute access on a Row now falls back to column lookup, so the existing calls failed with "Could not locate column in row for column 'values'", silently dropping schema and table descriptions. Replaced with tuple(result) at all three call sites (get_table_comment, get_schema_description, get_table_description). Adds 20 unit tests covering the JSON walker (struct nesting, array<struct> path semantics, map skipping, missing/malformed payloads), the query wrapper's failure paths (mocked), and the description applier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves the Databricks legacy ingestion connector by (1) extracting and applying nested field comments for complex columns using Databricks’ DESCRIBE ... AS JSON payloads, and (2) fixing SQLAlchemy 2.x incompatibilities that were preventing schema/table comments from being captured.
Changes:
- Add a JSON-based nested-comment extraction path (
DESCRIBE TABLE EXTENDED ... AS JSON) with graceful fallback when unsupported. - Apply extracted nested comments onto the parsed OpenMetadata
Columntree for complex types. - Replace
Row.values()usages withtuple(row)to restore compatibility with SQLAlchemy 2.x row behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/ingestion/source/database/databricks/metadata.py |
Adds nested comment extraction + application logic and updates SA row handling for comments. |
ingestion/tests/unit/topology/database/test_databricks_nested_comments.py |
Adds unit coverage for the JSON walker, failure paths, and description application onto nested columns. |
🟡 Playwright Results — all passed (13 flaky)✅ 3985 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped
🟡 13 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
…ow compat Per reviewer feedback, two new test groups guard against silent regressions: - TestDescribeJsonLazyFetch: array<struct> triggers the AS JSON fetch via the regex gate, while array<primitive> and map skip it. Locks in the ^array\s*<\s*struct\b regex behavior so a future tweak can't silently drop nested descriptions on array-of-struct columns. - TestSqlAlchemy2RowCompat: get_table_comment must work on a SA 2.x-style Row (no .values() method). Uses a tuple subclass that raises on .values() so any revert from `tuple(result)` to `result.values()` fails the test instead of silently dropping schema/table descriptions on SA 2.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code Review ✅ Approved 2 resolved / 2 findingsNested column descriptions are now captured and SQLAlchemy 2.x compatibility is restored. The redundant DESCRIBE AS JSON queries have been removed, and the null-guard for schema names is now covered by tests. ✅ 2 resolved✅ Performance: DESCRIBE AS JSON query runs for every table, even without complex columns
✅ Quality: New null-guard for db_name/schema lacks test coverage
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| def _fetch_nested_descriptions_via_describe_json( | ||
| connection, | ||
| db_name: str | None, | ||
| schema: str | None, | ||
| table_name: str, | ||
| ) -> dict[str, dict[tuple[str, ...], str]]: | ||
| """Run ``DESCRIBE TABLE EXTENDED <fqn> AS JSON`` and return a per-column | ||
| map of ``{field_path_tuple: comment}``. | ||
|
|
||
| ``DESCRIBE ... AS JSON`` is supported on Databricks Runtime 16.4+ and | ||
| returns a structured payload with ``columns[].type.fields[].comment`` on | ||
| nested struct fields — the only SQL path that exposes nested COMMENTs | ||
| (Spark's regular ``simpleString`` output strips them). | ||
|
|
||
| Returns an empty dict on any failure (older runtime, JSON parse error, | ||
| schema variation, or missing db/schema) so the caller cleanly degrades | ||
| to top-level-only descriptions. | ||
| """ | ||
| if not db_name or not schema: | ||
| return {} | ||
| try: | ||
| result = connection.execute( | ||
| text(f"DESCRIBE TABLE EXTENDED `{db_name}`.`{schema}`.`{table_name}` AS JSON") | ||
| ).fetchone() | ||
| if not result or not result[0]: | ||
| return {} | ||
| payload = json.loads(result[0]) | ||
| except Exception as err: # pylint: disable=broad-except | ||
| logger.debug(f"DESCRIBE AS JSON unavailable or unparseable for {db_name}.{schema}.{table_name}: {err}") | ||
| return {} | ||
|
|
||
| return _build_column_descriptions_map(payload) |
There was a problem hiding this comment.
If we are using DESCRIBE TABLE EXTENDED {db_name}.{schema}.{table_name} AS JSON (which I think it is great to be honest), we should use this opportunity to actually use it for all the places we are actually running DESCRIBE TABLE EXTENDED.
We should also use this opportunity to only call that query once. From the logs, it seems we are running that query 4 times per table.
We should run it once, parse the result and cache it until the table processing is done.
One good thing that we should do is keep the regular query as a backup and parse it the same way.
I would create a new pydantic class or pydantic dataclass that knows how to construct itself from the result of both the json and regular query.
It would be interesting to explore if we could use slots and set freeze for correctness and small performance gains, some docs:
https://wiki.python.org/moin/UsingSlots
https://pydantic.dev/docs/validation/latest/api/pydantic/config/#pydantic.config.ConfigDict.frozen
https://pydantic.dev/docs/validation/latest/api/pydantic/dataclasses/
|
|
||
|
|
||
| @reflection.cache | ||
| def get_columns(self, connection, table_name, schema=None, **kw): |
There was a problem hiding this comment.
This method is fairly big and it could be worth to think about extracting some parts from it to make reading it better
|
|
||
|
|
||
| def _build_column_descriptions_map( | ||
| payload: Any, |
There was a problem hiding this comment.
We should avoid using any as much as possible, can't we have a specific type for this? even if that means creating some type aliases for instance
|
|
||
| def _build_column_descriptions_map( | ||
| payload: Any, | ||
| ) -> dict[str, dict[tuple[str, ...], str]]: |
There was a problem hiding this comment.
It would be nice to have some context on what the semantic for the expected return is, here type aliases could help
|



fix #27667
Summary
DataType.simpleString(used byDESCRIBE TABLE) strips nestedCOMMENT '...'clauses, so per-field descriptions onSTRUCTmembers never reached OpenMetadata. We now runDESCRIBE TABLE EXTENDED <fqn> AS JSON(Databricks Runtime 16.4+) once per table, walk the structuredcolumns[].type.fields[].commentpayload, and attach descriptions onto child columns by field path. On older runtimes the query errors and we silently fall back to top-level-only descriptions — no regression.Row.values()was removed in SA 2.x; attribute access on a Row now falls back to column lookup, soresult.values()was failing withCould not locate column in row for column 'values'and silently dropping schema/table descriptions. Replaced withtuple(result)at three call sites (get_table_comment,get_schema_description,get_table_description).Repro (before this PR)
```sql
CREATE TABLE mayur_test.customer_profiles (
customer_id STRING COMMENT 'Unique customer identifier',
personal_info STRUCT<
first_name: STRING COMMENT 'Customer first name',
contact: STRUCT<email: STRING COMMENT 'Primary email address'> COMMENT 'Contact details'
> COMMENT 'Basic personal information'
) USING DELTA;
```
Top-level `personal_info` got a description, but `first_name`, `contact`, `contact.email`, etc. all came through with `description: null`. Same for every other commented nested field.
Test plan
Notes
The same "nested COMMENT lost via simpleString" pattern likely affects other Spark/Hive-based connectors (Hive, Athena, Trino, ClickHouse, standalone Deltalake). Not addressed in this PR — recommend tracking those as separate issues per real bug reports, since each connector needs a different source-of-truth (no universal `AS JSON` outside Databricks 16.4+).
🤖 Generated with Claude Code
Summary by Gitar
_fetch_nested_descriptions_via_describe_jsonto avoid unnecessary API round-trips for tables containing only primitive types.AS JSONmetadata query runs at most once per table regardless of the number of complex columns encountered.This will update automatically on new commits.