Add source filter and indexed hash prefix to cert tag batch query#27847
Add source filter and indexed hash prefix to cert tag batch query#27847sonika-shah merged 4 commits intomainfrom
Conversation
The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch) was hitting ~12 seconds per call on instances with deep classification hierarchies — fired ~5,800 times per Data Insights run, contributing ~19 hrs of cumulative DB time per DI run. Two missing index-friendly predicates caused the slowness: 1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose covering INCLUDE has tagFQN. 2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops and tagFQNHash. The LIKE always ran as a post-filter on every row the IN clause returned. Fix: - Add `source = :source` filter (Certifications are always Classification source = 0). - Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`, with the hash prefix pre-computed via FullyQualifiedName.buildHash so the query hits the indexed hash column. Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed. Also a correctness improvement: the `source = 0` filter excludes glossary terms (source = 1) that happen to have FQNs starting with "Certification.". Previously such glossary terms could be incorrectly returned as certifications; now they're excluded as expected. Test: - Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in TagResourceIT — exercises the bulk fetch path with three schemas (cert-tagged / untagged / non-cert-tagged) and asserts each gets the right certification (or null) in the listed response. Locks in source-filter correctness and prevents future regressions where a non-cert tag could leak into the certification field.
edfb21d to
e832508
Compare
There was a problem hiding this comment.
Pull request overview
Optimizes certification tag batch fetching by making the getCertTagsInternalBatch query index-friendly (adding a source predicate and filtering by tagFQNHash prefix), and updates repository call sites accordingly to pass the new parameters. Adds an integration test intended to validate correctness of the bulk certification fetch path.
Changes:
- Update
CollectionDAO.TagUsageDAO.getCertTagsInternalBatchSQL to filter bysourceandtagFQNHash LIKE :prefix. - Update
EntityRepositorycall sites to passTagSource.CLASSIFICATIONand aFullyQualifiedName.buildHash(certClassification) + ".%"prefix. - Add a new IT covering bulk list behavior for certification population.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java | Updates certification fetches to pass source and hashed prefix into the batch DAO query. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java | Makes cert batch query use source and tagFQNHash prefix for index usage and correctness. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/TagResourceIT.java | Adds a new integration test to validate bulk certification fetching behavior. |
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
…ernalBatch signature
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
Code Review ✅ ApprovedImplements source filtering and indexed hash prefixes in the certificate tag batch query. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
🟡 Playwright Results — all passed (19 flaky)✅ 3979 passed · ❌ 0 failed · 🟡 19 flaky · ⏭️ 86 skipped
🟡 19 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
|
|
Changes have been cherry-picked to the 1.12.7 branch. |
…7847) * Add source filter and use indexed hash prefix in cert tag batch query The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch) was hitting ~12 seconds per call on instances with deep classification hierarchies — fired ~5,800 times per Data Insights run, contributing ~19 hrs of cumulative DB time per DI run. Two missing index-friendly predicates caused the slowness: 1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose covering INCLUDE has tagFQN. 2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops and tagFQNHash. The LIKE always ran as a post-filter on every row the IN clause returned. Fix: - Add `source = :source` filter (Certifications are always Classification source = 0). - Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`, with the hash prefix pre-computed via FullyQualifiedName.buildHash so the query hits the indexed hash column. Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed. Also a correctness improvement: the `source = 0` filter excludes glossary terms (source = 1) that happen to have FQNs starting with "Certification.". Previously such glossary terms could be incorrectly returned as certifications; now they're excluded as expected. Test: - Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in TagResourceIT — exercises the bulk fetch path with three schemas (cert-tagged / untagged / non-cert-tagged) and asserts each gets the right certification (or null) in the listed response. Locks in source-filter correctness and prevents future regressions where a non-cert tag could leak into the certification field. * Fix duplicate schema names in cert batch test, trim verbose comments * Update EntityRepositoryCertificationTest mocks for new getCertTagsInternalBatch signature * fix check style (cherry picked from commit 4a2f42f)
|
Changes have been cherry-picked to the 1.13 branch. |
…7847) * Add source filter and use indexed hash prefix in cert tag batch query The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch) was hitting ~12 seconds per call on instances with deep classification hierarchies — fired ~5,800 times per Data Insights run, contributing ~19 hrs of cumulative DB time per DI run. Two missing index-friendly predicates caused the slowness: 1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose covering INCLUDE has tagFQN. 2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops and tagFQNHash. The LIKE always ran as a post-filter on every row the IN clause returned. Fix: - Add `source = :source` filter (Certifications are always Classification source = 0). - Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`, with the hash prefix pre-computed via FullyQualifiedName.buildHash so the query hits the indexed hash column. Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed. Also a correctness improvement: the `source = 0` filter excludes glossary terms (source = 1) that happen to have FQNs starting with "Certification.". Previously such glossary terms could be incorrectly returned as certifications; now they're excluded as expected. Test: - Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in TagResourceIT — exercises the bulk fetch path with three schemas (cert-tagged / untagged / non-cert-tagged) and asserts each gets the right certification (or null) in the listed response. Locks in source-filter correctness and prevents future regressions where a non-cert tag could leak into the certification field. * Fix duplicate schema names in cert batch test, trim verbose comments * Update EntityRepositoryCertificationTest mocks for new getCertTagsInternalBatch signature * fix check style (cherry picked from commit 4a2f42f)
…en-metadata#27847) * Add source filter and use indexed hash prefix in cert tag batch query The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch) was hitting ~12 seconds per call on instances with deep classification hierarchies — fired ~5,800 times per Data Insights run, contributing ~19 hrs of cumulative DB time per DI run. Two missing index-friendly predicates caused the slowness: 1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose covering INCLUDE has tagFQN. 2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops and tagFQNHash. The LIKE always ran as a post-filter on every row the IN clause returned. Fix: - Add `source = :source` filter (Certifications are always Classification source = 0). - Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`, with the hash prefix pre-computed via FullyQualifiedName.buildHash so the query hits the indexed hash column. Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed. Also a correctness improvement: the `source = 0` filter excludes glossary terms (source = 1) that happen to have FQNs starting with "Certification.". Previously such glossary terms could be incorrectly returned as certifications; now they're excluded as expected. Test: - Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in TagResourceIT — exercises the bulk fetch path with three schemas (cert-tagged / untagged / non-cert-tagged) and asserts each gets the right certification (or null) in the listed response. Locks in source-filter correctness and prevents future regressions where a non-cert tag could leak into the certification field. * Fix duplicate schema names in cert batch test, trim verbose comments * Update EntityRepositoryCertificationTest mocks for new getCertTagsInternalBatch signature * fix check style



Summary
The certification tag batch query (
TagUsageDAO.getCertTagsInternalBatch) was running at ~12 seconds per call on instances with heavy classification hierarchies, fired ~5,800 times per Data Insights run — contributing roughly 19 hours of cumulative DB time per DI run on a customer instance with deep nested containers.Root cause
Two missing index-friendly predicates in the existing SQL:
sourcefilter — couldn't useidx_tag_usage_target_exact (source, targetFQNHash, state) INCLUDE (tagFQN, labelType)whose covering INCLUDE hastagFQN.tagFQN LIKE 'Certification.%'on the raw column — there's no LIKE-friendly index on rawtagFQN. Onlytagfqn_lower text_pattern_opsandtagFQNHashare indexed for LIKE patterns. The LIKE always ran as a post-filter on every row the IN clause returned.Changes
CollectionDAO.TagUsageDAO.getCertTagsInternalBatchCaller updates (
EntityRepository)Two call sites —
getCertification()(single-entity GET) andbatchFetchCertification()(bulk LIST). Both updated to:TagLabel.TagSource.CLASSIFICATION.ordinal()assource.FullyQualifiedName.buildHash(certClassification) + \".%\"instead of the rawcertClassification + \".%\".The hash is computed once per call via the existing
FullyQualifiedName.buildHashhelper (the same MD5 used by@BindFQNwhen storing the row), so the LIKE prefix matches the hierarchical hash format actually stored intag_usage.tagFQNHash.Correctness improvement (bonus)
The new
source = 0filter excludes glossary terms (source = 1) that happen to have FQNs starting with "Certification.". Previously such glossary terms could be incorrectly returned as certifications via the unconstrained LIKE; now they're correctly excluded.Cross-DB compatibility
All constructs (
source = ?,targetFQNHash IN (...),tagFQNHash LIKE 'prefix%',ORDER BY) work identically on MySQL and Postgres. No@ConnectionAwareSqlQuerysplit needed.Index usage (verified via EXPLAIN ANALYZE on RDS)
idx_tag_usage_target_fqn_hashoridx_tag_usage_target_sourcefor thetargetFQNHash IN (...)clauseidx_tag_usage_join_source(Postgres) /idx_tag_usage_tag_fqn_hash(MySQL) for thetagFQNHash LIKE 'hash.%'clausesource = 0filter unlocksidx_tag_usage_target_exactcovering scan when planner picks itTests
test_certificationTagNotLeakingIntoTagsField(inTagResourceIT) already covers the happy path — single GET and bulk LIST both populatecertificationcorrectly and the cert tag does not leak intotags. Continues to pass with the new SQL.test_certBatch_bulkFetchReturnsCorrectCertsPerEntity— exercises the bulk fetch path with three schemas:certificationpopulated correctlycertification == null(no false positives from the IN list)certification == null(regression test for thesourcefilter + hash prefix)Performance impact
🤖 Generated with Claude Code