Fix col type parsign - Databricks by harshach · Pull Request #27610 · open-metadata/OpenMetadata

harshach · 2026-04-22T06:41:02Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Bug fix:
- Generalized end-of-columns detection to terminate on any #-prefixed marker or empty col_type to prevent TypeError.
- Added broad try-except to column processing to ensure metadata for surviving columns is still captured.
Improvement:
- Added robust handling for unparseable col_type values by skipping the column and logging a warning.
- Implemented unit tests for Databricks get_columns to verify section boundary handling and failure recovery.

_{This will update automatically on new commits.}

Copilot

Pull request overview

This PR targets Databricks ingestion robustness by preventing get_columns from crashing (and silently returning no columns) when DESCRIBE TABLE EXTENDED returns rows with empty/None col_type.

Changes:

Add unit/regression coverage for malformed DESCRIBE TABLE EXTENDED rows (empty/None col_type).
Tighten _get_column_rows filtering and add defensive parsing in get_columns to skip unparseable types instead of raising.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`ingestion/tests/unit/topology/database/test_databricks_get_columns.py`	Adds regression tests covering `None`/empty `col_type` rows for Databricks `get_columns` and `_get_column_rows`.
`ingestion/src/metadata/ingestion/source/database/databricks/metadata.py`	Updates row filtering and adds a guard around regex parsing of `col_type` to avoid runtime errors.

Copilot · 2026-04-22T06:45:43Z

+    # Filter out empty rows, comment markers, and rows missing a usable col_type.
+    # DESCRIBE TABLE EXTENDED can return rows where col_type is empty/None for
+    # Delta/streaming partition markers; downstream regex at line ~200 requires
+    # a non-empty string.
+    return [
+        row
+        for row in column_rows
+        if row[0] and row[0] != "# col_name" and isinstance(row[1], str) and row[1]


The new _get_column_rows filter drops rows where col_type is empty/None. However, Databricks/Hive DESCRIBE output uses section header rows like # Partition Information / # Detailed Table Information that often have empty col_type (see existing test_databricks_ordinal_position.py which provides ("# Partition Information", "", "")). If those header rows are filtered out, get_columns will never hit the break condition and can start treating table/partition metadata rows (e.g., Location/Provider) as columns. Consider preserving header rows (e.g., allow row[0].startswith('#') except # col_name, or explicitly whitelist the break-marker values) while still dropping truly malformed column rows with missing col_type.

Suggested change

# Filter out empty rows, comment markers, and rows missing a usable col_type.

# DESCRIBE TABLE EXTENDED can return rows where col_type is empty/None for

# Delta/streaming partition markers; downstream regex at line ~200 requires

# a non-empty string.

return [

row

for row in column_rows

if row[0] and row[0] != "# col_name" and isinstance(row[1], str) and row[1]

# Filter out empty rows and the DESCRIBE schema header row, while preserving

# section marker rows such as "# Partition Information" and

# "# Detailed Table Information". Those marker rows often have an empty

# col_type, but downstream parsing relies on them to stop before table

# metadata rows like Location/Provider are treated as columns.

return [

row

for row in column_rows

if row[0]

and row[0] != "# col_name"

and (

row[0].startswith("#")

or (isinstance(row[1], str) and row[1])

)

github-actions · 2026-04-22T08:51:25Z

🔴 Playwright Results — 1 failure(s), 22 flaky

✅ 3688 passed · ❌ 1 failed · 🟡 22 flaky · ⏭️ 89 skipped

Shard	Passed	Failed	Flaky	Skipped
🔴 Shard 1	479	1	1	4
🟡 Shard 2	654	0	2	7
🟡 Shard 3	663	0	3	1
🟡 Shard 4	644	0	4	27
🟡 Shard 5	609	0	2	42
🟡 Shard 6	639	0	10	8

Genuine Failures (failed on all attempts)

❌ Pages/SearchIndexApplication.spec.ts › Search Index Application (shard 1)

Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

Expected: �[32mStringMatching /success|activeError/g�[39m
Received: �[31m"failed"�[39m

🟡 22 flaky test(s) (passed on retry)

Pages/Customproperties-part1.spec.ts › no duplicate card after update (shard 1, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/UserProfileOnlineStatus.spec.ts › Should show online status badge on user profile for active users (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContractInheritance.spec.ts › Run Validation - Inherited contract validation uses entity-based validation (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for ApiEndpoint (shard 4, 1 retry)
Pages/DomainAdvanced.spec.ts › Filter assets by domain from explore page (shard 4, 1 retry)
Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for table (shard 5, 1 retry)
Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
Pages/HyperlinkCustomProperty.spec.ts › should show No Data placeholder when hyperlink has no value (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 2 retries)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Search Index (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Directory (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for searchIndex -> mlModel (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/Tag.spec.ts › Verify Owner Add Delete (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

IceS2 · 2026-04-22T10:57:39Z

+            raw_col_type = col_type
+            type_match = re.search(r"^\w+", col_type)
+            if type_match is None:
+                logger.warning(
+                    f"Skipping column '{col_name}' in {schema}.{table_name}: "
+                    f"unparseable col_type '{col_type}'"
+                )
+                continue


I am not sold about this. From research it seems that this should never happen. Added it here as a defensive way to continue the column extraction in any case, but I also do not like the possibility of maybe having incomplete column metadata.

Not sure a proper failure and skipping column extraction would be best here.

github-actions · 2026-04-22T11:02:12Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot · 2026-04-22T11:18:34Z

+    def test_unexpected_exception_in_row_does_not_drop_other_columns(self, mock_rows):
+        """Broad per-row try/except ensures one bad column doesn't lose the
+        rest of the table's columns via the outer sql_column_handler catch."""
+        mock_rows.return_value = [
+            ("good1", "bigint", None),
+            # col_type that passes the truthy guard and the regex, but a
+            # downstream failure is simulated by an unknown type mapping
+            # returning NullType (already handled — should still append).
+            ("exotic", "quintuplet", None),
+            ("good2", "string", None),
+        ]


test_unexpected_exception_in_row_does_not_drop_other_columns doesn't currently trigger an exception inside get_columns (unknown types are handled via KeyError -> NullType). As written, it won't fail if the new broad per-row exception handling is removed. Adjust the test to force an actual exception in per-row processing (e.g., patch _type_map lookup or re.search/complex-type subquery to raise) and assert that subsequent columns are still returned.

github-actions · 2026-04-22T12:30:47Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

…arkers DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous.

gitar-bot · 2026-04-22T12:36:50Z

Code Review ✅ Approved 1 resolved / 1 findings

Corrects Databricks column type parsing and removes the unused rows parameter from the _run test helper. No issues found.

✅ 1 resolved

✅ Quality: Test helper _run accepts rows param but never uses it

📄 ingestion/tests/unit/topology/database/test_databricks_get_columns.py:48-55
The _run(self, rows) helper accepts rows as a parameter and every test passes mock_rows to it, but the parameter is completely ignored inside the method. The mock works only because mock_rows.return_value is set on the patched global before _run is called. This is confusing for readers who might expect _run to wire up the rows somehow. Either remove the parameter or use it to set return_value inside _run to make the data flow explicit.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

sonarqubecloud · 2026-04-22T13:39:50Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…arkers (#27610) DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous. Co-authored-by: Pablo Takara <pjt1991@gmail.com>

…arkers (open-metadata#27610) DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous. Co-authored-by: Pablo Takara <pjt1991@gmail.com>

Copilot AI review requested due to automatic review settings April 22, 2026 06:41

harshach requested a review from a team as a code owner April 22, 2026 06:41

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026

Copilot started reviewing on behalf of harshach April 22, 2026 06:41 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

harshach temporarily deployed to test April 22, 2026 06:51 — with GitHub Actions Inactive

IceS2 reviewed Apr 22, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated

IceS2 reviewed Apr 22, 2026

View reviewed changes

pmbrull previously approved these changes Apr 22, 2026

View reviewed changes

IceS2 had a problem deploying to test April 22, 2026 11:05 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings April 22, 2026 11:14

IceS2 dismissed pmbrull’s stale review via 3750f95 April 22, 2026 11:14

Copilot started reviewing on behalf of IceS2 April 22, 2026 11:15 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

IceS2 had a problem deploying to test April 22, 2026 11:24 — with GitHub Actions Error

open-metadata deleted a comment from Copilot AI Apr 22, 2026

gitar-bot Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread ingestion/tests/unit/topology/database/test_databricks_get_columns.py Outdated

IceS2 had a problem deploying to test April 22, 2026 12:35 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings April 22, 2026 12:35

IceS2 force-pushed the databricks_connector branch from dbe2749 to 0edfac7 Compare April 22, 2026 12:35

Copilot started reviewing on behalf of IceS2 April 22, 2026 12:36 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

IceS2 had a problem deploying to test April 22, 2026 12:45 — with GitHub Actions Failure

IceS2 temporarily deployed to test April 22, 2026 12:45 — with GitHub Actions Inactive

pmbrull approved these changes Apr 22, 2026

View reviewed changes

pmbrull merged commit 89de412 into main Apr 22, 2026
52 of 53 checks passed

pmbrull deleted the databricks_connector branch April 22, 2026 15:32

ulixius9 mentioned this pull request Apr 27, 2026

Databricks Ingestion fails when have metadata columns #27769

Closed

Conversation

harshach commented Apr 22, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Playwright Results — 1 failure(s), 22 flaky

Genuine Failures (failed on all attempts)

Uh oh!

IceS2 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

gitar-bot Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

sonarqubecloud Bot commented Apr 22, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

harshach commented Apr 22, 2026 •

edited by gitar-bot Bot

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading

gitar-bot Bot commented Apr 22, 2026 •

edited

Loading