fix(clickhouse): add downstream lineage for MATERIALIZED VIEW TO clause#27628
fix(clickhouse): add downstream lineage for MATERIALIZED VIEW TO clause#27628Jtss-ux wants to merge 1 commit intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
1 similar comment
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
6e4e90b to
0f3b173
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
@harshach @nikhilchennam - I've addressed the edge case flagged by Gitar (tightened the regex to stop at ENGINE/POPULATE/SETTINGS) and added 6 unit tests covering the main variants (simple TO, ENGINE clause, IF NOT EXISTS, ON CLUSTER, no-TO passthrough, and end-to-end lineage validation). The branch is also rebased on latest main. Happy to make any further adjustments! |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
🟡 Playwright Results — all passed (22 flaky)✅ 3951 passed · ❌ 0 failed · 🟡 22 flaky · ⏭️ 86 skipped
🟡 22 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
|
34f92d4 to
99bb91c
Compare
99bb91c to
e89a464
Compare
| _CLICKHOUSE_MV_TO_RE = re.compile( | ||
| r"^\s*CREATE\s+MATERIALIZED\s+VIEW\s+" | ||
| r"(?:IF\s+NOT\s+EXISTS\s+)?" # optional IF NOT EXISTS | ||
| r"(?:`[^`]+`|\S+)\s+" # skip MV name (handles quoted names with spaces) |
There was a problem hiding this comment.
💡 Edge Case: MV-name skip group fails for multi-part backtick-quoted names
The regex group that skips the MV name on line 85 ((?:\[^\`]+`|\S+)\s+) only matches a single backtick-quoted segment. For a multi-part backtick-quoted MV name like `` my schema.my mv``, it would match ``my schema`` and then expect whitespace, but instead encounter.`, causing the entire regex to fail and silently skip the rewrite.
This is an edge case (ClickHouse MV names with spaces in both schema and table parts are rare), but it would result in a silent false-negative where the lineage is not captured.
Suggested fix:
Change the MV-name skip group to repeat, matching the same pattern as the target capture:
r"(?:(?:`[^`]+`|\S+)\.)*(?:`[^`]+`|\S+)\s+" # skip MV name (multi-part, quoted)
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| # The character class for <target> handles backtick-quoted segments with spaces | ||
| # and stops at the first whitespace / opening paren NOT inside backticks. |
There was a problem hiding this comment.
💡 Quality: Stray indentation on module-level comment
Line 80 has a 4-space indent on a module-level comment ( # The character class for <target>...). While Python ignores comment indentation, this visually suggests it belongs to a code block rather than continuing the module-level documentation block above it.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review 👍 Approved with suggestions 9 resolved / 11 findingsImplements downstream lineage for ClickHouse MATERIALIZED VIEW TO clauses and resolves multiple regex and null-pointer issues. Address the remaining regex mismatch for multi-part backtick-quoted names and correct the stray module-level indentation. 💡 Edge Case: MV-name skip group fails for multi-part backtick-quoted names📄 ingestion/src/metadata/ingestion/lineage/parser.py:85 The regex group that skips the MV name on line 85 ( This is an edge case (ClickHouse MV names with spaces in both schema and table parts are rare), but it would result in a silent false-negative where the lineage is not captured. Suggested fix💡 Quality: Stray indentation on module-level comment📄 ingestion/src/metadata/ingestion/lineage/parser.py:80-81 Line 80 has a 4-space indent on a module-level comment ( ✅ 9 resolved✅ Quality: Move
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |



What does this PR do?
Fixes missing downstream lineage for ClickHouse
CREATE MATERIALIZED VIEW ... TO <target>statements (closes #26265).When ClickHouse defines a Materialized View with a
TO <schema>.<table>clause, OpenMetadata's lineage parser previously registered the MV name as the downstream write-target — not theTOtable. This means downstream lineage links were silently dropped for all ClickHouse Materialized Views using theTOclause.Root cause
sqllineage(and all three of its backends: SqlGlot, SqlFluff, SqlParse) treatCREATE MATERIALIZED VIEW mv_nameas a CREATE targetingmv_name. TheTO <target>clause is ClickHouse-specific DDL that none of the generic SQL parsers understand, so the correct downstream table is silently ignored.Fix
A pre-compiled module-level regex
_CLICKHOUSE_MV_TO_REinparser.pydetects anyCREATE MATERIALIZED VIEW ... TO <target> ... AS SELECT ...statement and rewrites it toCREATE TABLE <target> AS SELECT ...before the query reaches the lineage runner.Why normalise at this layer?
Fixing it in
clean_raw_query()means all three downstream parsers benefit automatically with a single, testable change. No synthetic queries are emitted, so query history and Elasticsearch are not polluted with DDL that never ran.SQL forms covered
All three forms from the ClickHouse docs and the original bug report:
ENGINE, POPULATE, SETTINGS, and DEFINER clauses between the
TOtarget andAS SELECTare all correctly skipped.Changes
ingestion/src/metadata/ingestion/lineage/parser.py_CLICKHOUSE_MV_TO_REconstant with full inline documentationclean_raw_query()to use the compiled regexingestion/tests/unit/lineage/test_sql_lineage.pyClickhouseMaterializedViewLineageTestcovering all SQL variantsChecklist
blackandisortformatting applied (py-checkstyle passes)remoduleopen-metadata:main