Skip to content

fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat#27766

Open
ulixius9 wants to merge 6 commits intomainfrom
databricks_nested_comment
Open

fix(databricks): capture nested column descriptions and fix SQLAlchemy 2.x compat#27766
ulixius9 wants to merge 6 commits intomainfrom
databricks_nested_comment

Conversation

@ulixius9
Copy link
Copy Markdown
Member

@ulixius9 ulixius9 commented Apr 27, 2026

fix #27667

Summary

  • Nested struct/array field descriptions are now captured for the Databricks legacy connector. Spark's DataType.simpleString (used by DESCRIBE TABLE) strips nested COMMENT '...' clauses, so per-field descriptions on STRUCT members never reached OpenMetadata. We now run DESCRIBE TABLE EXTENDED <fqn> AS JSON (Databricks Runtime 16.4+) once per table, walk the structured columns[].type.fields[].comment payload, and attach descriptions onto child columns by field path. On older runtimes the query errors and we silently fall back to top-level-only descriptions — no regression.
  • SQLAlchemy 2.x compatibility fix. Row.values() was removed in SA 2.x; attribute access on a Row now falls back to column lookup, so result.values() was failing with Could not locate column in row for column 'values' and silently dropping schema/table descriptions. Replaced with tuple(result) at three call sites (get_table_comment, get_schema_description, get_table_description).

Repro (before this PR)

```sql
CREATE TABLE mayur_test.customer_profiles (
customer_id STRING COMMENT 'Unique customer identifier',
personal_info STRUCT<
first_name: STRING COMMENT 'Customer first name',
contact: STRUCT<email: STRING COMMENT 'Primary email address'> COMMENT 'Contact details'
> COMMENT 'Basic personal information'
) USING DELTA;
```

Top-level `personal_info` got a description, but `first_name`, `contact`, `contact.email`, etc. all came through with `description: null`. Same for every other commented nested field.

Test plan

  • 20 new unit tests in `ingestion/tests/unit/topology/database/test_databricks_nested_comments.py` covering:
    • JSON walker for nested struct, `array` (no path level added), map skipping, fields missing comments, non-dict payloads
    • Query wrapper failure paths (older runtime / malformed JSON / empty result) — all return `{}` cleanly
    • Description applier (top-level, deeply nested, existing-description preservation, no-op on primitive columns)
  • Existing 21 Databricks tests (`test_databricks_get_columns.py`, `test_databricks_ordinal_position.py`, `test_column_type_parser.py`) still pass
  • Re-ingest a Unity Catalog table with `STRUCT<...>` field-level comments on a DBR 16.4+ workspace and confirm nested descriptions appear in the API/UI
  • Re-ingest the same table on a DBR <16.4 workspace and confirm ingestion still succeeds (top-level comments only, debug log shows "DESCRIBE AS JSON unavailable")
  • Confirm schema and table descriptions now flow through (the SA 2.x fix) — `Could not locate column in row for column 'values'` warnings should disappear

Notes

The same "nested COMMENT lost via simpleString" pattern likely affects other Spark/Hive-based connectors (Hive, Athena, Trino, ClickHouse, standalone Deltalake). Not addressed in this PR — recommend tracking those as separate issues per real bug reports, since each connector needs a different source-of-truth (no universal `AS JSON` outside Databricks 16.4+).

🤖 Generated with Claude Code


Summary by Gitar

  • Performance optimization:
    • Implemented lazy loading for _fetch_nested_descriptions_via_describe_json to avoid unnecessary API round-trips for tables containing only primitive types.
    • Ensured the AS JSON metadata query runs at most once per table regardless of the number of complex columns encountered.

This will update automatically on new commits.

…y 2.x

Two issues addressed in the Databricks legacy connector:

1. Nested struct/array field descriptions were never captured. Spark's
   DataType.simpleString output (which DESCRIBE TABLE emits) strips nested
   COMMENT clauses, so the field-level descriptions users define on STRUCT
   members never reached OpenMetadata. Switch to DESCRIBE TABLE EXTENDED
   <table> AS JSON (Databricks Runtime 16.4+) which returns structured
   per-field metadata including comments. On older runtimes the AS JSON
   query errors and the connector silently degrades to top-level-only
   descriptions — no regression.

2. result.values() removed in SQLAlchemy 2.x. The Row.values() method was
   removed; attribute access on a Row now falls back to column lookup, so
   the existing calls failed with "Could not locate column in row for column
   'values'", silently dropping schema and table descriptions. Replaced with
   tuple(result) at all three call sites (get_table_comment,
   get_schema_description, get_table_description).

Adds 20 unit tests covering the JSON walker (struct nesting, array<struct>
path semantics, map skipping, missing/malformed payloads), the query
wrapper's failure paths (mocked), and the description applier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 12:23
@ulixius9 ulixius9 requested a review from a team as a code owner April 27, 2026 12:23
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 27, 2026
Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Databricks legacy ingestion connector by (1) extracting and applying nested field comments for complex columns using Databricks’ DESCRIBE ... AS JSON payloads, and (2) fixing SQLAlchemy 2.x incompatibilities that were preventing schema/table comments from being captured.

Changes:

  • Add a JSON-based nested-comment extraction path (DESCRIBE TABLE EXTENDED ... AS JSON) with graceful fallback when unsupported.
  • Apply extracted nested comments onto the parsed OpenMetadata Column tree for complex types.
  • Replace Row.values() usages with tuple(row) to restore compatibility with SQLAlchemy 2.x row behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Adds nested comment extraction + application logic and updates SA row handling for comments.
ingestion/tests/unit/topology/database/test_databricks_nested_comments.py Adds unit coverage for the JSON walker, failure paths, and description application onto nested columns.

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

🟡 Playwright Results — all passed (13 flaky)

✅ 3985 passed · ❌ 0 failed · 🟡 13 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 750 0 4 8
🟡 Shard 3 745 0 1 7
🟡 Shard 4 773 0 2 18
🟡 Shard 5 685 0 2 41
🟡 Shard 6 733 0 4 8
🟡 13 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/Glossary/GlossaryRemoveOperations.spec.ts › should add and remove reviewer from glossary term (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 2 retries)
  • Pages/CustomProperties.spec.ts › Duration (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Directory (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for table (shard 5, 1 retry)
  • Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

…ow compat

Per reviewer feedback, two new test groups guard against silent regressions:

- TestDescribeJsonLazyFetch: array<struct> triggers the AS JSON fetch via
  the regex gate, while array<primitive> and map skip it. Locks in the
  ^array\s*<\s*struct\b regex behavior so a future tweak can't silently
  drop nested descriptions on array-of-struct columns.

- TestSqlAlchemy2RowCompat: get_table_comment must work on a SA 2.x-style
  Row (no .values() method). Uses a tuple subclass that raises on .values()
  so any revert from `tuple(result)` to `result.values()` fails the test
  instead of silently dropping schema/table descriptions on SA 2.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 30, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Nested column descriptions are now captured and SQLAlchemy 2.x compatibility is restored. The redundant DESCRIBE AS JSON queries have been removed, and the null-guard for schema names is now covered by tests.

✅ 2 resolved
Performance: DESCRIBE AS JSON query runs for every table, even without complex columns

📄 ingestion/src/metadata/ingestion/source/database/databricks/metadata.py:291-293 📄 ingestion/src/metadata/ingestion/source/database/databricks/metadata.py:348-350
_fetch_nested_descriptions_via_describe_json is called unconditionally at the top of get_columns (line 291-293), issuing an extra DESCRIBE TABLE EXTENDED ... AS JSON round-trip for every single table — including tables with only primitive columns that will never use the result. On a catalog with thousands of tables, this doubles the number of DESCRIBE queries during ingestion.

Consider deferring the call until a complex column is actually encountered (i.e., move it inside the if col_type in {"struct", "array", "map"} branch), or lazily cache the result so it's fetched at most once per table only when needed.

Quality: New null-guard for db_name/schema lacks test coverage

📄 ingestion/src/metadata/ingestion/source/database/databricks/metadata.py:142-143 📄 ingestion/tests/unit/topology/database/test_databricks_nested_comments.py:211-225
The new early-return guard if not db_name or not schema: return {} at line 142-143 of metadata.py is a sensible defensive addition, but there is no corresponding unit test in TestFetchNestedDescriptionsViaDescribeJson. Adding a quick parametrized test ensures this path stays exercised if the function signature changes later.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Comment on lines +124 to +155
def _fetch_nested_descriptions_via_describe_json(
connection,
db_name: str | None,
schema: str | None,
table_name: str,
) -> dict[str, dict[tuple[str, ...], str]]:
"""Run ``DESCRIBE TABLE EXTENDED <fqn> AS JSON`` and return a per-column
map of ``{field_path_tuple: comment}``.

``DESCRIBE ... AS JSON`` is supported on Databricks Runtime 16.4+ and
returns a structured payload with ``columns[].type.fields[].comment`` on
nested struct fields — the only SQL path that exposes nested COMMENTs
(Spark's regular ``simpleString`` output strips them).

Returns an empty dict on any failure (older runtime, JSON parse error,
schema variation, or missing db/schema) so the caller cleanly degrades
to top-level-only descriptions.
"""
if not db_name or not schema:
return {}
try:
result = connection.execute(
text(f"DESCRIBE TABLE EXTENDED `{db_name}`.`{schema}`.`{table_name}` AS JSON")
).fetchone()
if not result or not result[0]:
return {}
payload = json.loads(result[0])
except Exception as err: # pylint: disable=broad-except
logger.debug(f"DESCRIBE AS JSON unavailable or unparseable for {db_name}.{schema}.{table_name}: {err}")
return {}

return _build_column_descriptions_map(payload)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using DESCRIBE TABLE EXTENDED {db_name}.{schema}.{table_name} AS JSON (which I think it is great to be honest), we should use this opportunity to actually use it for all the places we are actually running DESCRIBE TABLE EXTENDED.

We should also use this opportunity to only call that query once. From the logs, it seems we are running that query 4 times per table.
We should run it once, parse the result and cache it until the table processing is done.

One good thing that we should do is keep the regular query as a backup and parse it the same way.

I would create a new pydantic class or pydantic dataclass that knows how to construct itself from the result of both the json and regular query.
It would be interesting to explore if we could use slots and set freeze for correctness and small performance gains, some docs:
https://wiki.python.org/moin/UsingSlots
https://pydantic.dev/docs/validation/latest/api/pydantic/config/#pydantic.config.ConfigDict.frozen
https://pydantic.dev/docs/validation/latest/api/pydantic/dataclasses/



@reflection.cache
def get_columns(self, connection, table_name, schema=None, **kw):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is fairly big and it could be worth to think about extracting some parts from it to make reading it better



def _build_column_descriptions_map(
payload: Any,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using any as much as possible, can't we have a specific type for this? even if that means creating some type aliases for instance


def _build_column_descriptions_map(
payload: Any,
) -> dict[str, dict[tuple[str, ...], str]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some context on what the semantic for the expected return is, here type aliases could help

@mohittilala mohittilala added To release Will cherry-pick this PR into the release branch python Pull requests that update python code databricks labels May 1, 2026
@mohittilala mohittilala removed this from Shipping May 1, 2026
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 1, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

databricks Ingestion python Pull requests that update python code safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nested Column Descriptions Missing in Databricks Metadata

4 participants