fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat by ulixius9 · Pull Request #27766 · open-metadata/OpenMetadata

ulixius9 · 2026-04-27T12:23:32Z

Summary

Nested struct/array field descriptions are now captured for the Databricks legacy connector. Spark's DataType.simpleString (used by DESCRIBE TABLE) strips nested COMMENT '...' clauses, so per-field descriptions on STRUCT members never reached OpenMetadata. We now run DESCRIBE TABLE EXTENDED <fqn> AS JSON (Databricks Runtime 16.4+) once per table, walk the structured columns[].type.fields[].comment payload, and attach descriptions onto child columns by field path. On older runtimes the query errors and we silently fall back to top-level-only descriptions — no regression.
SQLAlchemy 2.x compatibility fix. Row.values() was removed in SA 2.x; attribute access on a Row now falls back to column lookup, so result.values() was failing with Could not locate column in row for column 'values' and silently dropping schema/table descriptions. Replaced with tuple(result) at three call sites (get_table_comment, get_schema_description, get_table_description).

Repro (before this PR)

```sql
CREATE TABLE mayur_test.customer_profiles (
customer_id STRING COMMENT 'Unique customer identifier',
personal_info STRUCT<
first_name: STRING COMMENT 'Customer first name',
contact: STRUCT<email: STRING COMMENT 'Primary email address'> COMMENT 'Contact details'
> COMMENT 'Basic personal information'
) USING DELTA;
```

Top-level `personal_info` got a description, but `first_name`, `contact`, `contact.email`, etc. all came through with `description: null`. Same for every other commented nested field.

Test plan

20 new unit tests in `ingestion/tests/unit/topology/database/test_databricks_nested_comments.py` covering:
- JSON walker for nested struct, `array` (no path level added), map skipping, fields missing comments, non-dict payloads
- Query wrapper failure paths (older runtime / malformed JSON / empty result) — all return `{}` cleanly
- Description applier (top-level, deeply nested, existing-description preservation, no-op on primitive columns)
Existing 21 Databricks tests (`test_databricks_get_columns.py`, `test_databricks_ordinal_position.py`, `test_column_type_parser.py`) still pass
Re-ingest a Unity Catalog table with `STRUCT<...>` field-level comments on a DBR 16.4+ workspace and confirm nested descriptions appear in the API/UI
Re-ingest the same table on a DBR <16.4 workspace and confirm ingestion still succeeds (top-level comments only, debug log shows "DESCRIBE AS JSON unavailable")
Confirm schema and table descriptions now flow through (the SA 2.x fix) — `Could not locate column in row for column 'values'` warnings should disappear

Notes

The same "nested COMMENT lost via simpleString" pattern likely affects other Spark/Hive-based connectors (Hive, Athena, Trino, ClickHouse, standalone Deltalake). Not addressed in this PR — recommend tracking those as separate issues per real bug reports, since each connector needs a different source-of-truth (no universal `AS JSON` outside Databricks 16.4+).

🤖 Generated with Claude Code

Summary by Gitar

Performance optimization:
- Implemented lazy loading for _fetch_nested_descriptions_via_describe_json to avoid unnecessary API round-trips for tables containing only primitive types.
- Ensured the AS JSON metadata query runs at most once per table regardless of the number of complex columns encountered.

_{This will update automatically on new commits.}

…y 2.x Two issues addressed in the Databricks legacy connector: 1. Nested struct/array field descriptions were never captured. Spark's DataType.simpleString output (which DESCRIBE TABLE emits) strips nested COMMENT clauses, so the field-level descriptions users define on STRUCT members never reached OpenMetadata. Switch to DESCRIBE TABLE EXTENDED <table> AS JSON (Databricks Runtime 16.4+) which returns structured per-field metadata including comments. On older runtimes the AS JSON query errors and the connector silently degrades to top-level-only descriptions — no regression. 2. result.values() removed in SQLAlchemy 2.x. The Row.values() method was removed; attribute access on a Row now falls back to column lookup, so the existing calls failed with "Could not locate column in row for column 'values'", silently dropping schema and table descriptions. Replaced with tuple(result) at all three call sites (get_table_comment, get_schema_description, get_table_description). Adds 20 unit tests covering the JSON walker (struct nesting, array<struct> path semantics, map skipping, missing/malformed payloads), the query wrapper's failure paths (mocked), and the description applier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves the Databricks legacy ingestion connector by (1) extracting and applying nested field comments for complex columns using Databricks’ DESCRIBE ... AS JSON payloads, and (2) fixing SQLAlchemy 2.x incompatibilities that were preventing schema/table comments from being captured.

Changes:

Add a JSON-based nested-comment extraction path (DESCRIBE TABLE EXTENDED ... AS JSON) with graceful fallback when unsupported.
Apply extracted nested comments onto the parsed OpenMetadata Column tree for complex types.
Replace Row.values() usages with tuple(row) to restore compatibility with SQLAlchemy 2.x row behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`ingestion/src/metadata/ingestion/source/database/databricks/metadata.py`	Adds nested comment extraction + application logic and updates SA row handling for comments.
`ingestion/tests/unit/topology/database/test_databricks_nested_comments.py`	Adds unit coverage for the JSON walker, failure paths, and description application onto nested columns.

github-actions · 2026-04-27T15:52:12Z

🟡 Playwright Results — all passed (13 flaky)

✅ 3985 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped

Shard	Passed	Flaky	Skipped
✅ Shard 1	299	0	4
🟡 Shard 2	750	4	8
🟡 Shard 3	745	1	7
🟡 Shard 4	773	2	18
🟡 Shard 5	685	2	41
🟡 Shard 6	733	4	8

🟡 13 flaky test(s) (passed on retry)

Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
Features/Glossary/GlossaryRemoveOperations.spec.ts › should add and remove reviewer from glossary term (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 2 retries)
Pages/CustomProperties.spec.ts › Duration (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for Directory (shard 4, 1 retry)
Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for table (shard 5, 1 retry)
Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

…ow compat Per reviewer feedback, two new test groups guard against silent regressions: - TestDescribeJsonLazyFetch: array<struct> triggers the AS JSON fetch via the regex gate, while array<primitive> and map skip it. Locks in the ^array\s*<\s*struct\b regex behavior so a future tweak can't silently drop nested descriptions on array-of-struct columns. - TestSqlAlchemy2RowCompat: get_table_comment must work on a SA 2.x-style Row (no .values() method). Uses a tuple subclass that raises on .values() so any revert from `tuple(result)` to `result.values()` fails the test instead of silently dropping schema/table descriptions on SA 2.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-04-30T10:16:58Z

Code Review ✅ Approved 2 resolved / 2 findings

Nested column descriptions are now captured and SQLAlchemy 2.x compatibility is restored. The redundant DESCRIBE AS JSON queries have been removed, and the null-guard for schema names is now covered by tests.

✅ 2 resolved

✅ Performance: DESCRIBE AS JSON query runs for every table, even without complex columns

📄 ingestion/src/metadata/ingestion/source/database/databricks/metadata.py:291-293 📄 ingestion/src/metadata/ingestion/source/database/databricks/metadata.py:348-350
_fetch_nested_descriptions_via_describe_json is called unconditionally at the top of get_columns (line 291-293), issuing an extra DESCRIBE TABLE EXTENDED ... AS JSON round-trip for every single table — including tables with only primitive columns that will never use the result. On a catalog with thousands of tables, this doubles the number of DESCRIBE queries during ingestion.

Consider deferring the call until a complex column is actually encountered (i.e., move it inside the if col_type in {"struct", "array", "map"} branch), or lazily cache the result so it's fetched at most once per table only when needed.

✅ Quality: New null-guard for db_name/schema lacks test coverage

📄 ingestion/src/metadata/ingestion/source/database/databricks/metadata.py:142-143 📄 ingestion/tests/unit/topology/database/test_databricks_nested_comments.py:211-225
The new early-return guard if not db_name or not schema: return {} at line 142-143 of metadata.py is a sensible defensive addition, but there is no corresponding unit test in TestFetchNestedDescriptionsViaDescribeJson. Adding a quick parametrized test ensures this path stays exercised if the function signature changes later.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

IceS2 · 2026-05-01T06:56:45Z

+def _fetch_nested_descriptions_via_describe_json(
+    connection,
+    db_name: str | None,
+    schema: str | None,
+    table_name: str,
+) -> dict[str, dict[tuple[str, ...], str]]:
+    """Run ``DESCRIBE TABLE EXTENDED <fqn> AS JSON`` and return a per-column
+    map of ``{field_path_tuple: comment}``.
+
+    ``DESCRIBE ... AS JSON`` is supported on Databricks Runtime 16.4+ and
+    returns a structured payload with ``columns[].type.fields[].comment`` on
+    nested struct fields — the only SQL path that exposes nested COMMENTs
+    (Spark's regular ``simpleString`` output strips them).
+
+    Returns an empty dict on any failure (older runtime, JSON parse error,
+    schema variation, or missing db/schema) so the caller cleanly degrades
+    to top-level-only descriptions.
+    """
+    if not db_name or not schema:
+        return {}
+    try:
+        result = connection.execute(
+            text(f"DESCRIBE TABLE EXTENDED `{db_name}`.`{schema}`.`{table_name}` AS JSON")
+        ).fetchone()
+        if not result or not result[0]:
+            return {}
+        payload = json.loads(result[0])
+    except Exception as err:  # pylint: disable=broad-except
+        logger.debug(f"DESCRIBE AS JSON unavailable or unparseable for {db_name}.{schema}.{table_name}: {err}")
+        return {}
+
+    return _build_column_descriptions_map(payload)


If we are using DESCRIBE TABLE EXTENDED {db_name}.{schema}.{table_name} AS JSON (which I think it is great to be honest), we should use this opportunity to actually use it for all the places we are actually running DESCRIBE TABLE EXTENDED.

We should also use this opportunity to only call that query once. From the logs, it seems we are running that query 4 times per table.
We should run it once, parse the result and cache it until the table processing is done.

One good thing that we should do is keep the regular query as a backup and parse it the same way.

I would create a new pydantic class or pydantic dataclass that knows how to construct itself from the result of both the json and regular query.
It would be interesting to explore if we could use slots and set freeze for correctness and small performance gains, some docs:
https://wiki.python.org/moin/UsingSlots
https://pydantic.dev/docs/validation/latest/api/pydantic/config/#pydantic.config.ConfigDict.frozen
https://pydantic.dev/docs/validation/latest/api/pydantic/dataclasses/

IceS2 · 2026-05-01T06:58:15Z



 @reflection.cache
 def get_columns(self, connection, table_name, schema=None, **kw):


This method is fairly big and it could be worth to think about extracting some parts from it to make reading it better

IceS2 · 2026-05-01T06:59:41Z

+
+
+def _build_column_descriptions_map(
+    payload: Any,


We should avoid using any as much as possible, can't we have a specific type for this? even if that means creating some type aliases for instance

IceS2 · 2026-05-01T07:00:11Z

+
+def _build_column_descriptions_map(
+    payload: Any,
+) -> dict[str, dict[tuple[str, ...], str]]:


It would be nice to have some context on what the semantic for the expected return is, here type aliases could help

sonarqubecloud · 2026-05-01T14:31:51Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
82.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings April 27, 2026 12:23

ulixius9 requested a review from a team as a code owner April 27, 2026 12:23

github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 27, 2026

Copilot started reviewing on behalf of ulixius9 April 27, 2026 12:24 View session

gitar-bot Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated

ulixius9 had a problem deploying to test April 27, 2026 12:35 — with GitHub Actions Error

address feedback

5ae34c4

ulixius9 temporarily deployed to test April 27, 2026 13:38 — with GitHub Actions Inactive

ulixius9 had a problem deploying to test April 27, 2026 13:38 — with GitHub Actions Failure

ulixius9 temporarily deployed to test April 27, 2026 13:38 — with GitHub Actions Inactive

ulixius9 had a problem deploying to test April 27, 2026 13:38 — with GitHub Actions Failure

ulixius9 temporarily deployed to test April 27, 2026 13:38 — with GitHub Actions Inactive

ulixius9 had a problem deploying to test April 27, 2026 13:38 — with GitHub Actions Failure

ulixius9 had a problem deploying to test April 27, 2026 17:12 — with GitHub Actions Failure

ulixius9 temporarily deployed to test April 27, 2026 17:12 — with GitHub Actions Inactive

ulixius9 had a problem deploying to test April 28, 2026 01:44 — with GitHub Actions Failure

ulixius9 had a problem deploying to test April 30, 2026 00:24 — with GitHub Actions Failure

ulixius9 had a problem deploying to test April 30, 2026 04:21 — with GitHub Actions Failure

Copilot started reviewing on behalf of mohittilala April 30, 2026 09:57 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py

mohittilala had a problem deploying to test April 30, 2026 10:08 — with GitHub Actions Error

mohittilala temporarily deployed to test April 30, 2026 10:26 — with GitHub Actions Inactive

mohittilala self-assigned this May 1, 2026

IceS2 reviewed May 1, 2026

View reviewed changes

mohittilala added To release Will cherry-pick this PR into the release branch python Pull requests that update python code databricks labels May 1, 2026

mohittilala requested a deployment to test May 1, 2026 13:36 — with GitHub Actions In progress

mohittilala deployed to test May 1, 2026 13:36 — with GitHub Actions Active

mohittilala requested a deployment to test May 1, 2026 13:36 — with GitHub Actions In progress

mohittilala temporarily deployed to test May 1, 2026 13:36 — with GitHub Actions Inactive

mohittilala added this to Shipping May 1, 2026

mohittilala removed this from Shipping May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat#27766

fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat#27766
ulixius9 wants to merge 6 commits intomainfrom
databricks_nested_comment

ulixius9 commented Apr 27, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

IceS2 May 1, 2026

Uh oh!

IceS2 May 1, 2026

Uh oh!

IceS2 May 1, 2026

Uh oh!

IceS2 May 1, 2026

Uh oh!

sonarqubecloud Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		@reflection.cache
		def get_columns(self, connection, table_name, schema=None, **kw):

Conversation

ulixius9 commented Apr 27, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Repro (before this PR)

Test plan

Notes

Summary by Gitar

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Playwright Results — all passed (13 flaky)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IceS2 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

IceS2 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

IceS2 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

IceS2 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 1, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ulixius9 commented Apr 27, 2026 •

edited by gitar-bot Bot

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading

gitar-bot Bot commented Apr 30, 2026 •

edited

Loading