Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets Databricks ingestion robustness by preventing get_columns from crashing (and silently returning no columns) when DESCRIBE TABLE EXTENDED returns rows with empty/None col_type.
Changes:
- Add unit/regression coverage for malformed
DESCRIBE TABLE EXTENDEDrows (empty/Nonecol_type). - Tighten
_get_column_rowsfiltering and add defensive parsing inget_columnsto skip unparseable types instead of raising.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/tests/unit/topology/database/test_databricks_get_columns.py |
Adds regression tests covering None/empty col_type rows for Databricks get_columns and _get_column_rows. |
ingestion/src/metadata/ingestion/source/database/databricks/metadata.py |
Updates row filtering and adds a guard around regex parsing of col_type to avoid runtime errors. |
| # Filter out empty rows, comment markers, and rows missing a usable col_type. | ||
| # DESCRIBE TABLE EXTENDED can return rows where col_type is empty/None for | ||
| # Delta/streaming partition markers; downstream regex at line ~200 requires | ||
| # a non-empty string. | ||
| return [ | ||
| row | ||
| for row in column_rows | ||
| if row[0] and row[0] != "# col_name" and isinstance(row[1], str) and row[1] |
There was a problem hiding this comment.
The new _get_column_rows filter drops rows where col_type is empty/None. However, Databricks/Hive DESCRIBE output uses section header rows like # Partition Information / # Detailed Table Information that often have empty col_type (see existing test_databricks_ordinal_position.py which provides ("# Partition Information", "", "")). If those header rows are filtered out, get_columns will never hit the break condition and can start treating table/partition metadata rows (e.g., Location/Provider) as columns. Consider preserving header rows (e.g., allow row[0].startswith('#') except # col_name, or explicitly whitelist the break-marker values) while still dropping truly malformed column rows with missing col_type.
| # Filter out empty rows, comment markers, and rows missing a usable col_type. | |
| # DESCRIBE TABLE EXTENDED can return rows where col_type is empty/None for | |
| # Delta/streaming partition markers; downstream regex at line ~200 requires | |
| # a non-empty string. | |
| return [ | |
| row | |
| for row in column_rows | |
| if row[0] and row[0] != "# col_name" and isinstance(row[1], str) and row[1] | |
| # Filter out empty rows and the DESCRIBE schema header row, while preserving | |
| # section marker rows such as "# Partition Information" and | |
| # "# Detailed Table Information". Those marker rows often have an empty | |
| # col_type, but downstream parsing relies on them to stop before table | |
| # metadata rows like Location/Provider are treated as columns. | |
| return [ | |
| row | |
| for row in column_rows | |
| if row[0] | |
| and row[0] != "# col_name" | |
| and ( | |
| row[0].startswith("#") | |
| or (isinstance(row[1], str) and row[1]) | |
| ) |
🔴 Playwright Results — 1 failure(s), 22 flaky✅ 3688 passed · ❌ 1 failed · 🟡 22 flaky · ⏭️ 89 skipped
Genuine Failures (failed on all attempts)❌
|
| raw_col_type = col_type | ||
| type_match = re.search(r"^\w+", col_type) | ||
| if type_match is None: | ||
| logger.warning( | ||
| f"Skipping column '{col_name}' in {schema}.{table_name}: " | ||
| f"unparseable col_type '{col_type}'" | ||
| ) | ||
| continue |
There was a problem hiding this comment.
I am not sold about this. From research it seems that this should never happen. Added it here as a defensive way to continue the column extraction in any case, but I also do not like the possibility of maybe having incomplete column metadata.
Not sure a proper failure and skipping column extraction would be best here.
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
| def test_unexpected_exception_in_row_does_not_drop_other_columns(self, mock_rows): | ||
| """Broad per-row try/except ensures one bad column doesn't lose the | ||
| rest of the table's columns via the outer sql_column_handler catch.""" | ||
| mock_rows.return_value = [ | ||
| ("good1", "bigint", None), | ||
| # col_type that passes the truthy guard and the regex, but a | ||
| # downstream failure is simulated by an unknown type mapping | ||
| # returning NullType (already handled — should still append). | ||
| ("exotic", "quintuplet", None), | ||
| ("good2", "string", None), | ||
| ] |
There was a problem hiding this comment.
test_unexpected_exception_in_row_does_not_drop_other_columns doesn't currently trigger an exception inside get_columns (unknown types are handled via KeyError -> NullType). As written, it won't fail if the new broad per-row exception handling is removed. Adjust the test to force an actual exception in per-row processing (e.g., patch _type_map lookup or re.search/complex-type subquery to raise) and assert that subsequent columns are still returned.
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
…arkers DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous.
dbe2749 to
0edfac7
Compare
Code Review ✅ Approved 1 resolved / 1 findingsCorrects Databricks column type parsing and removes the unused ✅ 1 resolved✅ Quality: Test helper
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|
…arkers (#27610) DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous. Co-authored-by: Pablo Takara <pjt1991@gmail.com>
…arkers (#27610) DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous. Co-authored-by: Pablo Takara <pjt1991@gmail.com>
…arkers (#27610) DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous. Co-authored-by: Pablo Takara <pjt1991@gmail.com>
…arkers (open-metadata#27610) DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2 DescribeTableExec), causing re.search on a None col_type to raise TypeError and sql_column_handler to drop all columns for the table. Treat any '#'-prefixed row, empty col_type, or col_type with no leading word chars as end-of-columns and break. Add a per-row try/except so unexpected failures skip one column instead of the whole table. Use len(result) for ordinal_position to keep values contiguous. Co-authored-by: Pablo Takara <pjt1991@gmail.com>



Describe your changes:
Fixes
I worked on ... because ...
Type of change:
Checklist:
Fixes <issue-number>: <short explanation>Summary by Gitar
#-prefixed marker or emptycol_typeto preventTypeError.try-exceptto column processing to ensure metadata for surviving columns is still captured.col_typevalues by skipping the column and logging a warning.get_columnsto verify section boundary handling and failure recovery.This will update automatically on new commits.