Skip to content

fix(clickhouse): add downstream lineage for MATERIALIZED VIEW TO clause#27628

Open
Jtss-ux wants to merge 1 commit intoopen-metadata:mainfrom
Jtss-ux:fix/clickhouse-materialized-view-lineage
Open

fix(clickhouse): add downstream lineage for MATERIALIZED VIEW TO clause#27628
Jtss-ux wants to merge 1 commit intoopen-metadata:mainfrom
Jtss-ux:fix/clickhouse-materialized-view-lineage

Conversation

@Jtss-ux
Copy link
Copy Markdown

@Jtss-ux Jtss-ux commented Apr 22, 2026

What does this PR do?

Fixes missing downstream lineage for ClickHouse CREATE MATERIALIZED VIEW ... TO <target> statements (closes #26265).

When ClickHouse defines a Materialized View with a TO <schema>.<table> clause, OpenMetadata's lineage parser previously registered the MV name as the downstream write-target — not the TO table. This means downstream lineage links were silently dropped for all ClickHouse Materialized Views using the TO clause.

Root cause

sqllineage (and all three of its backends: SqlGlot, SqlFluff, SqlParse) treat CREATE MATERIALIZED VIEW mv_name as a CREATE targeting mv_name. The TO <target> clause is ClickHouse-specific DDL that none of the generic SQL parsers understand, so the correct downstream table is silently ignored.

Fix

A pre-compiled module-level regex _CLICKHOUSE_MV_TO_RE in parser.py detects any CREATE MATERIALIZED VIEW ... TO <target> ... AS SELECT ... statement and rewrites it to CREATE TABLE <target> AS SELECT ... before the query reaches the lineage runner.

Why normalise at this layer?
Fixing it in clean_raw_query() means all three downstream parsers benefit automatically with a single, testable change. No synthetic queries are emitted, so query history and Elasticsearch are not polluted with DDL that never ran.

SQL forms covered

All three forms from the ClickHouse docs and the original bug report:

-- 1. Simple form
CREATE MATERIALIZED VIEW [IF NOT EXISTS] schema.mv [ON CLUSTER c]
TO schema.target AS SELECT * FROM schema.source;

-- 2. Column-list form (first example in #26265)
CREATE MATERIALIZED VIEW schema.mv
TO schema.target (column_01, column_02)
AS SELECT column_01, column_02 FROM schema.source;

-- 3. REFRESH EVERY + DEFINER form (second example in #26265)
CREATE MATERIALIZED VIEW schema.mv
REFRESH EVERY 3 HOUR
TO schema.target (column_01, column_02)
DEFINER = user SQL SECURITY DEFINER
AS SELECT * FROM schema.source;

ENGINE, POPULATE, SETTINGS, and DEFINER clauses between the TO target and AS SELECT are all correctly skipped.

Changes

  • ingestion/src/metadata/ingestion/lineage/parser.py

    • Added pre-compiled _CLICKHOUSE_MV_TO_RE constant with full inline documentation
    • Updated clean_raw_query() to use the compiled regex
  • ingestion/tests/unit/lineage/test_sql_lineage.py

    • Added 8 unit tests in ClickhouseMaterializedViewLineageTest covering all SQL variants

Checklist

  • Unit tests for all three SQL forms from the bug report
  • black and isort formatting applied (py-checkstyle passes)
  • No new imports — uses existing re module
  • Branch up to date with open-metadata:main

@Jtss-ux Jtss-ux requested a review from a team as a code owner April 22, 2026 11:26
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/lineage/parser.py Outdated
Comment thread ingestion/src/metadata/ingestion/lineage/parser.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@Jtss-ux Jtss-ux force-pushed the fix/clickhouse-materialized-view-lineage branch from 6e4e90b to 0f3b173 Compare April 22, 2026 11:46
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@Jtss-ux
Copy link
Copy Markdown
Author

Jtss-ux commented Apr 22, 2026

@harshach @nikhilchennam - I've addressed the edge case flagged by Gitar (tightened the regex to stop at ENGINE/POPULATE/SETTINGS) and added 6 unit tests covering the main variants (simple TO, ENGINE clause, IF NOT EXISTS, ON CLUSTER, no-TO passthrough, and end-to-end lineage validation). The branch is also rebased on latest main. Happy to make any further adjustments!

@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

🟡 Playwright Results — all passed (22 flaky)

✅ 3951 passed · ❌ 0 failed · 🟡 22 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 296 0 3 4
🟡 Shard 2 750 0 9 8
🟡 Shard 3 729 0 3 7
🟡 Shard 4 756 0 3 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 733 0 4 8
🟡 22 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Data Product - customization should work (shard 1, 1 retry)
  • Features/CustomizeDetailPage.spec.ts › customize tab label should only render if it's customize by user (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › EditAll User: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with TEST_CASE.EDIT_ALL can see edit action on test case (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › 3-level domain hierarchy: SubSubDomain assets visible when SubDomain selected (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Search suggestions should be filtered by selected domain (shard 2, 2 retries)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Table (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Domain Rbac (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Delete Table (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Create glossary, change language to Dutch, and delete glossary (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@sonarqubecloud
Copy link
Copy Markdown

@Jtss-ux Jtss-ux force-pushed the fix/clickhouse-materialized-view-lineage branch from 34f92d4 to 99bb91c Compare April 25, 2026 20:58
Comment thread ingestion/src/metadata/ingestion/lineage/parser.py Outdated
@Jtss-ux Jtss-ux force-pushed the fix/clickhouse-materialized-view-lineage branch from 99bb91c to e89a464 Compare April 25, 2026 21:08
_CLICKHOUSE_MV_TO_RE = re.compile(
r"^\s*CREATE\s+MATERIALIZED\s+VIEW\s+"
r"(?:IF\s+NOT\s+EXISTS\s+)?" # optional IF NOT EXISTS
r"(?:`[^`]+`|\S+)\s+" # skip MV name (handles quoted names with spaces)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: MV-name skip group fails for multi-part backtick-quoted names

The regex group that skips the MV name on line 85 ((?:\[^\`]+`|\S+)\s+) only matches a single backtick-quoted segment. For a multi-part backtick-quoted MV name like `` my schema.my mv``, it would match ``my schema`` and then expect whitespace, but instead encounter.`, causing the entire regex to fail and silently skip the rewrite.

This is an edge case (ClickHouse MV names with spaces in both schema and table parts are rare), but it would result in a silent false-negative where the lineage is not captured.

Suggested fix:

Change the MV-name skip group to repeat, matching the same pattern as the target capture:

  r"(?:(?:`[^`]+`|\S+)\.)*(?:`[^`]+`|\S+)\s+"  # skip MV name (multi-part, quoted)

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +80 to +81
# The character class for <target> handles backtick-quoted segments with spaces
# and stops at the first whitespace / opening paren NOT inside backticks.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Stray indentation on module-level comment

Line 80 has a 4-space indent on a module-level comment ( # The character class for <target>...). While Python ignores comment indentation, this visually suggests it belongs to a code block rather than continuing the module-level documentation block above it.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 25, 2026

Code Review 👍 Approved with suggestions 9 resolved / 11 findings

Implements downstream lineage for ClickHouse MATERIALIZED VIEW TO clauses and resolves multiple regex and null-pointer issues. Address the remaining regex mismatch for multi-part backtick-quoted names and correct the stray module-level indentation.

💡 Edge Case: MV-name skip group fails for multi-part backtick-quoted names

📄 ingestion/src/metadata/ingestion/lineage/parser.py:85

The regex group that skips the MV name on line 85 ((?:\[^\`]+`|\S+)\s+) only matches a single backtick-quoted segment. For a multi-part backtick-quoted MV name like `` my schema.my mv``, it would match ``my schema`` and then expect whitespace, but instead encounter.`, causing the entire regex to fail and silently skip the rewrite.

This is an edge case (ClickHouse MV names with spaces in both schema and table parts are rare), but it would result in a silent false-negative where the lineage is not captured.

Suggested fix
Change the MV-name skip group to repeat, matching the same pattern as the target capture:

  r"(?:(?:`[^`]+`|\S+)\.)*(?:`[^`]+`|\S+)\s+"  # skip MV name (multi-part, quoted)
💡 Quality: Stray indentation on module-level comment

📄 ingestion/src/metadata/ingestion/lineage/parser.py:80-81

Line 80 has a 4-space indent on a module-level comment ( # The character class for <target>...). While Python ignores comment indentation, this visually suggests it belongs to a code block rather than continuing the module-level documentation block above it.

✅ 9 resolved
Quality: Move import re to module level

📄 ingestion/src/metadata/ingestion/lineage/parser.py:507
The import re statement is placed inside clean_raw_query. Since re is a stdlib module and the method is a classmethod that may be called frequently, the import should be at the top of the file with the other imports. This aligns with PEP 8 and the rest of the codebase's import style.

Edge Case: Regex captures extra clauses (ENGINE/POPULATE) as target table

📄 ingestion/src/metadata/ingestion/lineage/parser.py:521
ClickHouse CREATE MATERIALIZED VIEW supports optional clauses between the TO <target> and AS SELECT, such as ENGINE = MergeTree() ORDER BY id or POPULATE. The non-greedy (.*?) in group 2 will absorb these into the target table name.

Example: CREATE MATERIALIZED VIEW mv TO target POPULATE AS SELECT * FROM src
→ group 2 = target POPULATE
→ rewritten as CREATE TABLE target POPULATE AS SELECT * FROM src

This would cause sqllineage to fail to parse the rewritten query, producing no lineage at all. While POPULATE is deprecated, ENGINE clauses are common in production ClickHouse schemas.

Quality: No unit tests for ClickHouse MV TO rewrite logic

The existing test file ingestion/tests/unit/test_query_parser.py has tests for other clean_raw_query transformations (COPY GRANTS, MERGE INTO, COPY FROM, CREATE TRIGGER, etc.), but no tests were added for the new ClickHouse MATERIALIZED VIEW TO rewrite. This regex has non-trivial matching behavior that should be covered.

Suggested test cases:

  • Basic: CREATE MATERIALIZED VIEW db.mv TO db.target AS SELECT * FROM db.src
  • With IF NOT EXISTS
  • Without TO clause (should pass through unchanged)
  • With schema-qualified and backtick-quoted names
  • With ON CLUSTER clause before TO
Bug: NullPointerException in mergeTags when entity has no tags

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SuggestionRepository.java:193-207
The mergeTags method iterates over existingTags with a for-each loop (line 210) without a null check. EntityInterface.getTags() returns null by default, so when a suggestion of type SuggestTagLabel is accepted on an entity that has no tags, entity.getTags() passes null as existingTags, causing a NullPointerException.

This is reachable via the acceptSuggestion path at line 179–180 in SuggestionWorkflow.acceptSuggestion.

Bug: acceptAllSuggestions passes wrong status to permission check

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/feeds/SuggestionsResource.java:298-299
In acceptAllSuggestions, checkPermissionsForAcceptOrRejectSuggestion is called with SuggestionStatus.Rejected (line 299) instead of SuggestionStatus.Accepted. While the authorization logic is identical for both statuses, the status is used in the error message when authorization fails. This means unauthorized users see an error about rejecting when they tried to accept, which is confusing and suggests a copy-paste mistake from the rejectAllSuggestions endpoint.

...and 4 more resolved from earlier reviews

🤖 Prompt for agents
Code Review: Implements downstream lineage for ClickHouse MATERIALIZED VIEW TO clauses and resolves multiple regex and null-pointer issues. Address the remaining regex mismatch for multi-part backtick-quoted names and correct the stray module-level indentation.

1. 💡 Edge Case: MV-name skip group fails for multi-part backtick-quoted names
   Files: ingestion/src/metadata/ingestion/lineage/parser.py:85

   The regex group that skips the MV name on line 85 (`(?:\`[^\`]+\`|\S+)\s+`) only matches a single backtick-quoted segment. For a multi-part backtick-quoted MV name like `` `my schema`.`my mv` ``, it would match `` `my schema` `` and then expect whitespace, but instead encounter `.`, causing the entire regex to fail and silently skip the rewrite.
   
   This is an edge case (ClickHouse MV names with spaces in both schema and table parts are rare), but it would result in a silent false-negative where the lineage is not captured.

   Suggested fix:
   Change the MV-name skip group to repeat, matching the same pattern as the target capture:
   
     r"(?:(?:`[^`]+`|\S+)\.)*(?:`[^`]+`|\S+)\s+"  # skip MV name (multi-part, quoted)

2. 💡 Quality: Stray indentation on module-level comment
   Files: ingestion/src/metadata/ingestion/lineage/parser.py:80-81

   Line 80 has a 4-space indent on a module-level comment (`    # The character class for <target>...`). While Python ignores comment indentation, this visually suggests it belongs to a code block rather than continuing the module-level documentation block above it.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Clickhouse Linage] Missing downsteam for MATERIALIZED VIEW

2 participants