fix(it): stop flaky integration tests (#27649) by mohityadav766 · Pull Request #27651 · open-metadata/OpenMetadata

mohityadav766 · 2026-04-23T06:29:37Z

Summary

Fixes #27649 — flaky backend integration tests caused by writes silently being lost during in-flight reindexes. Most of the change is to production search-index write routing, not just test plumbing — see "Scope" below.

Scope

The originally-reported flakes (WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK, GlossaryOntologyExportIT, TagResourceIT) all surfaced the same underlying bug class: a reindex in flight discards live writes when the alias swaps. Fixing it cleanly required several production changes:

Removed the SearchIndexRetryQueue reindex-suspension mechanism — shouldSkipStreamingIndexing used to silently drop live writes when a reindex job was active. Replaced with always-write semantics. (commit 09e592397b)
Staged-index write routing — SearchRepository.activeStagedIndices (keyed by canonical index name) tracks indices being populated by a reindex; live writes go to the staged index directly so they survive the alias swap. Centralized in getWriteIndexName(IndexMapping) / routeToStagedIfActive(String) / getWriteFanoutTargets(String). (commits 52992bcdbc, 568c2d32e5, 95aa9235b4)
Re-register SearchIndexHandler on every SearchRepository init — EntityLifecycleEventDispatcher.registerHandler skips duplicates by name; tests construct SearchRepository 3 times via TestSuiteBootstrap so the dispatcher kept a stale handler bound to the migration-time instance. Now each construction unregisters the previous handler first. (commit ea093b0d86)
Always unregister staged-index routing in DefaultRecreateHandler — try/finally ensures unregisterStagedIndex runs on swap success, swap failure, empty-aliases, and exceptions; previously it could be left routing writes to an orphan index. (commit 37c343855a)
Fan out cross-alias bulk update-by-query — updateAssetDomainsForDataProduct/updateAssetDomainsByIds/updateDomainFqnByPrefix/updateAssetDomainFqnByPrefix originally targeted only GLOBAL_SEARCH_ALIAS / domain; the staged index missed the script and the change was lost on swap. Now they target the alias plus every active staged index. The scripts are idempotent so re-applying them to staged is safe. (commit 95aa9235b4)

Test-only changes:

GlossaryOntologyExportIT → @Isolated (RDF singleton bleed). commit 14cb35ff38
WorkflowDefinitionResourceIT#triggerWorkflow_SDK — drop redundant deploy-wait, add Awaitility aliases for actionable timeouts.
TestSuiteBootstrap — bump MySQL sort_buffer_size=8M and Postgres work_mem=32MB (TagDAO.listAfter pagination overflow). commit 0a4144c4a1
BaseEntityIT search-wait ceilings 90s → 180s for parity with checkCreatedEntity.

Test plan

mvn -pl openmetadata-service compile clean
mvn spotless:apply clean
Local repro: AppsResourceIT + WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK + RecognizerFeedbackRepositoryTest#testApplyFeedback_withRecognizerMetadata_shouldTargetSpecificRecognizer — 34/34 passed
Local: GlossaryOntologyExportIT (@Isolated) — 2/2 passed
CI run on the full integration-test matrix

Risk

The suspension-removal and staged-index routing are real production behaviour changes and should be reviewed accordingly. Trade-offs documented in each commit message.

🤖 Generated with Claude Code

@isolated

GlossaryOntologyExportIT: mark @isolated. @BeforeAll flips RdfUpdater (a JVM-wide singleton) on, which makes every concurrent test class start doing synchronous Fuseki writes on entity create, saturating the Dropwizard thread pool and causing 60s request timeouts. @execution (SAME_THREAD) alone only serialises within this class. WorkflowDefinitionResourceIT#triggerWorkflow_SDK: drop the redundant waitForWorkflowDeployment call — the create path already waits. Add descriptive aliases to the two await() polls so the next flake tells us which FQN or workflow name actually timed out instead of an anonymous lambda. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes flaky backend integration tests by preventing cross-test interference from JVM-wide RDF toggling and improving Awaitility timeout diagnostics in workflow tests.

Changes:

Added JUnit @Isolated to GlossaryOntologyExportIT to prevent concurrent test classes from inheriting the global RdfUpdater configuration.
Removed a redundant workflow deployment wait in triggerWorkflow_SDK (deployment is already awaited during workflow creation).
Added descriptive aliases to Awaitility await(...) calls to make future timeouts actionable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java`	Removes redundant wait and adds Awaitility aliases for clearer timeout failures.
`openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryOntologyExportIT.java`	Isolates RDF export tests to avoid JVM-wide singleton leakage across parallel integration tests.

Live search indexing was silently skipped whenever a reindex job was in RUNNING/READY/STOPPING state. SearchRepository.createEntityIndex() and six sibling methods consulted SearchIndexRetryQueue.isEntityTypeSuspended() and returned early with nothing written, nothing enqueued — entities vanished from search until a future reindex happened to cover them. The retry worker doubled down: when the scope refresh observed an active job, it purged the retry queue; and processRecord() deleted records whose type was suspended. So even manually enqueued retries were wiped. This is how the #27649 flake surfaced: AppsResourceIT triggers SearchIndexingApplication runs and its best-effort 30s wait silently swallows timeouts. If a run was still RUNNING when AppsResourceIT finished, the next class in the sequential fork (WorkflowDefinitionResourceIT) inherited the suspension and its freshly-created tables were never indexed — waitForEntityIndexedInSearch then timed out at 120s. Same mechanism bites real users mid-reindex in production. Remove the suspension mechanism entirely: * SearchRepository — drop the 8 isEntityTypeSuspended() early-returns; the client-availability path already enqueues for retry on its own. * SearchIndexRetryWorker — drop refreshReindexSuspensionScopeIfNeeded() and the suspension branches in processRecord(); remove the retry-queue purge on suspendAll. * SearchIndexRetryQueue — delete the updateSuspension / clearSuspension / isEntityTypeSuspended / isStreamingSuspended / isSuspendAllStreaming / getSuspendedEntityTypes API and the static AtomicBoolean / AtomicReference they backed. * Drop the two IT cases that asserted the removed behaviour. Live writes now always reach the search client; reindex and live writes both target the same indices as before. Version conflicts between the two paths (stale reindex batch overwriting a newer live write) remain possible as they did before suspension was introduced — that is the race suspension was meant to dodge, but dropping writes altogether was worse than the race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Re-key activeStagedIndices by canonical index name (e.g. openmetadata_table_search_index) instead of entity type, and route every live-write site through a single getWriteIndexName(IndexMapping) helper. Why - Previous routing went through resolveWriteIndex(entityType, mapping) but only at hand-picked call sites. Several write paths still resolved indexMapping.getIndexName(clusterAlias) directly and bypassed routing — bulkIndexPipelineExecutions, deleteByScript, softDeleteOrRestoreEntity, propagateToDomainChildren, updateEntityCertificationInSearch, propagateToRelatedEntities (PAGE), deleteTableColumns, updateTableColumnsInheritedFields. Any reindex in flight could lose those writes on the alias swap. - Keying by canonical index name lets any write site resolve correctly even without entity type in scope (FQN-prefix deletes, child propagation, script updates). What - activeStagedIndices: Map<canonicalIndexName, stagedIndexName>. - registerStagedIndex(entityType, stagedIndex) now resolves the canonical name from the IndexMapping before storing. - New getWriteIndexName(IndexMapping) is the single point of resolution; routeToStagedIfActive(String) handles raw alias names (e.g. pipeline_status_search_index resolved via getIndexOrAliasName). - Replaced every direct indexMapping.getIndexName(clusterAlias) for writes with getWriteIndexName(indexMapping). Admin/setup paths (createIndex/updateIndex/deleteIndex/createOrUpdateIndexTemplate) intentionally keep canonical names — they manage the alias itself. - Cascade ops on shared aliases (GLOBAL_SEARCH_ALIAS, DATA_ASSET_SEARCH_ALIAS, child aliases) are not entity-scoped and cannot route to a single staged index; left untouched. - resolveWriteIndex(entityType, mapping) preserved as a thin wrapper for binary compatibility. Also runs spotless:apply on the file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Two CI failures observed under the parallel-tests fork load on the post-centralization run: 1. TagResourceIT line 161 (listEntities) — the server returned 500 wrapping "java.sql.SQLException: Out of sort memory, consider increasing server sort buffer size" from TagDAO.listAfter. The query joins tag → entity_relationship → classification and orders by tag.name,tag.id; with the tag table accumulating across many parallel test classes (reuseForks=true), MySQL's default 256KB sort_buffer_size overflows. Bump it to 8MB. Add a parallel work_mem=32MB bump to the postgres command for the same query. 2. TagResourceIT line 1 — Awaitility timeout at 1m30s waiting for a freshly created tag to appear in search index. Five inherited waits in BaseEntityIT had a 90s ceiling while the sibling checkCreatedEntity already used 180s. Standardise on 180s — under tag-scale data the alias swap that the staged-index routing depends on can take longer than 90s in slow CI workers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…7649) * DefaultRecreateHandler.finalizeReindex / promoteEntityIndex — wrap the entire promote block in try/finally so unregisterStagedIndex always runs, including on swap failure, empty aliasesToAttach, and exceptions. Without this the routing map could be left pointing at a staged index nobody reads from, silently diverging live writes from search results until the next reindex (Copilot, multiple comments). * SearchRepository.resolveWriteIndex — deprecate. The entityType argument is unused; getWriteIndexName(IndexMapping) is the single resolution point now (Copilot + Gitar). * SearchRepository.routeToStagedIfActive — tighten the Javadoc to state explicitly that it expects a canonical index name and that short/parent aliases are passed through unchanged (Copilot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

…27649) The four bulk update-by-query operations rooted on shared aliases — updateAssetDomainsForDataProduct, updateAssetDomainsByIds, updateDomainFqnByPrefix, updateAssetDomainFqnByPrefix — hardcoded their target to GLOBAL_SEARCH_ALIAS / Entity.DOMAIN. During an in-flight reindex those updates landed on the about-to-be-discarded active index only; on alias swap, the new staged index (built from a DB snapshot taken before the script ran) replaced it and the script's effect was lost. Copilot called this out four times. Add SearchRepository.getWriteFanoutTargets(aliasOrIndex) — returns the caller's alias plus every currently-staged index. Pass that list to req.index(...) on all four methods in both OpenSearchEntityManager and ElasticSearchEntityManager. The OS/ES update-by-query API natively takes a list, so the fan-out is one request per call. The scripts these methods run are idempotent (UPDATE_ASSET_DOMAIN_SCRIPT checks `exists` before adding a domain; UPDATE_DOMAIN_FQN_BY_PREFIX_SCRIPT walks the array and rewrites in place), so applying them again to the staged index — even if the staged copy of the document already reflects the latest DB state — converges to the same result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

…27649) Previous getWriteFanoutTargets always appended every staged index, which made entity-scoped update-by-query calls (e.g. updateDomainFqnByPrefix targeting only the domain canonical index) fan out onto unrelated staged indices. Adds avoidable load on every currently-reindexing entity type for an update that should touch one index. Branch the implementation on whether the input is a known canonical entity index name. If yes, only the matching staged index is added. If no — i.e. the caller is hitting a multi-entity alias such as GLOBAL_SEARCH_ALIAS — every staged index is added because the update's match query can hit documents from any reindexing type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…enMetadata into fix-flaky-tests

gitar-bot · 2026-04-27T13:36:12Z

Code Review ✅ Approved 2 resolved / 2 findings

Stabilizes flaky integration tests by correcting the resolveWriteIndex Javadoc and ensuring getWriteFanoutTargets correctly handles staged indices. All findings have been addressed.

✅ 2 resolved

✅ Quality: resolveWriteIndex Javadoc claims entityType lookup but ignores it

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:586-591
The Javadoc on resolveWriteIndex (line 586-588) says "Equivalent to getWriteIndexName(getIndexMapping(entityType))", but the implementation at line 590 calls getWriteIndexName(indexMapping) — using the caller-supplied indexMapping, not looking up by entityType. The entityType parameter is now completely unused. Since all existing callers pass matching pairs this is functionally harmless, but the misleading doc and dead parameter could confuse future maintainers.

✅ Performance: getWriteFanoutTargets fans out to ALL staged indices

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:609-617
When multiple entity types are being reindexed concurrently, getWriteFanoutTargets appends every entry in activeStagedIndices — not just those relevant to the alias being targeted. For example, a domain-update routed through GLOBAL_SEARCH_ALIAS would also target a staged index for domain_search_index or any other entity type being reindexed.

Because conflicts(Conflicts.Proceed) is set and the scripts are idempotent, this is functionally correct — irrelevant staged indices will simply match zero documents. However, it sends unnecessary update-by-query RPCs to indices that can never contain matching docs, which is wasted I/O proportional to the number of concurrent reindex jobs.

This is unlikely to matter in practice (concurrent reindexes are rare, and extra zero-hit queries are cheap), so this is informational.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-27T14:39:57Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings April 23, 2026 06:29

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 23, 2026

Copilot started reviewing on behalf of mohityadav766 April 23, 2026 06:30 View session

mohityadav766 changed the title ~~fix(it): stop two flaky integration tests (#27649)~~ fix(it): stop flaky integration tests (#27649) Apr 23, 2026

Merge branch 'main' into fix-flaky-tests

b8f8928

Copilot AI reviewed Apr 23, 2026

View reviewed changes

mohityadav766 self-assigned this Apr 23, 2026

pmbrull previously approved these changes Apr 23, 2026

View reviewed changes

mohityadav766 temporarily deployed to test April 23, 2026 06:40 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test April 23, 2026 06:40 — with GitHub Actions Error

mohityadav766 dismissed pmbrull’s stale review via 09e5923 April 23, 2026 08:30

Merge branch 'main' into fix-flaky-tests

6e3f2b7

Copilot AI review requested due to automatic review settings April 23, 2026 08:32

Copilot started reviewing on behalf of mohityadav766 April 23, 2026 08:33 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java

mohityadav766 had a problem deploying to test April 23, 2026 08:41 — with GitHub Actions Error

Merge branch 'main' into fix-flaky-tests

c1fa78c

mohityadav766 had a problem deploying to test April 23, 2026 09:18 — with GitHub Actions Error

mohityadav766 had a problem deploying to test April 27, 2026 06:33 — with GitHub Actions Error

gitar-bot Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Outdated

mohityadav766 had a problem deploying to test April 27, 2026 06:46 — with GitHub Actions Error

mohityadav766 temporarily deployed to test April 27, 2026 06:46 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test April 27, 2026 06:46 — with GitHub Actions Error

Merge branch 'main' into fix-flaky-tests

520b924

Copilot AI review requested due to automatic review settings April 27, 2026 08:11

Copilot AI reviewed Apr 27, 2026

View reviewed changes

mohityadav766 and others added 2 commits April 27, 2026 14:24

Copilot AI reviewed Apr 27, 2026

View reviewed changes

mohityadav766 and others added 3 commits April 27, 2026 17:22

Merge branch 'main' into fix-flaky-tests

642d2c4

Merge branch 'main' into fix-flaky-tests

ce1ee8f

gitar-bot Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Outdated

mohityadav766 and others added 2 commits April 27, 2026 19:04

Merge branch 'fix-flaky-tests' of https://github.com/open-metadata/Op…

545191c

…enMetadata into fix-flaky-tests

pmbrull approved these changes Apr 27, 2026

View reviewed changes

Conversation

mohityadav766 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Test plan

Risk

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 27, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohityadav766 commented Apr 23, 2026 •

edited

Loading

gitar-bot Bot commented Apr 27, 2026 •

edited

Loading