fix(it): stop flaky integration tests (#27649)#27651
Conversation
GlossaryOntologyExportIT: mark @isolated. @BeforeAll flips RdfUpdater (a JVM-wide singleton) on, which makes every concurrent test class start doing synchronous Fuseki writes on entity create, saturating the Dropwizard thread pool and causing 60s request timeouts. @execution (SAME_THREAD) alone only serialises within this class. WorkflowDefinitionResourceIT#triggerWorkflow_SDK: drop the redundant waitForWorkflowDeployment call — the create path already waits. Add descriptive aliases to the two await() polls so the next flake tells us which FQN or workflow name actually timed out instead of an anonymous lambda. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes flaky backend integration tests by preventing cross-test interference from JVM-wide RDF toggling and improving Awaitility timeout diagnostics in workflow tests.
Changes:
- Added JUnit
@IsolatedtoGlossaryOntologyExportITto prevent concurrent test classes from inheriting the globalRdfUpdaterconfiguration. - Removed a redundant workflow deployment wait in
triggerWorkflow_SDK(deployment is already awaited during workflow creation). - Added descriptive aliases to Awaitility
await(...)calls to make future timeouts actionable.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java |
Removes redundant wait and adds Awaitility aliases for clearer timeout failures. |
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryOntologyExportIT.java |
Isolates RDF export tests to avoid JVM-wide singleton leakage across parallel integration tests. |
Live search indexing was silently skipped whenever a reindex job was in RUNNING/READY/STOPPING state. SearchRepository.createEntityIndex() and six sibling methods consulted SearchIndexRetryQueue.isEntityTypeSuspended() and returned early with nothing written, nothing enqueued — entities vanished from search until a future reindex happened to cover them. The retry worker doubled down: when the scope refresh observed an active job, it purged the retry queue; and processRecord() deleted records whose type was suspended. So even manually enqueued retries were wiped. This is how the #27649 flake surfaced: AppsResourceIT triggers SearchIndexingApplication runs and its best-effort 30s wait silently swallows timeouts. If a run was still RUNNING when AppsResourceIT finished, the next class in the sequential fork (WorkflowDefinitionResourceIT) inherited the suspension and its freshly-created tables were never indexed — waitForEntityIndexedInSearch then timed out at 120s. Same mechanism bites real users mid-reindex in production. Remove the suspension mechanism entirely: * SearchRepository — drop the 8 isEntityTypeSuspended() early-returns; the client-availability path already enqueues for retry on its own. * SearchIndexRetryWorker — drop refreshReindexSuspensionScopeIfNeeded() and the suspension branches in processRecord(); remove the retry-queue purge on suspendAll. * SearchIndexRetryQueue — delete the updateSuspension / clearSuspension / isEntityTypeSuspended / isStreamingSuspended / isSuspendAllStreaming / getSuspendedEntityTypes API and the static AtomicBoolean / AtomicReference they backed. * Drop the two IT cases that asserted the removed behaviour. Live writes now always reach the search client; reindex and live writes both target the same indices as before. Version conflicts between the two paths (stale reindex batch overwriting a newer live write) remain possible as they did before suspension was introduced — that is the race suspension was meant to dodge, but dropping writes altogether was worse than the race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-key activeStagedIndices by canonical index name (e.g. openmetadata_table_search_index) instead of entity type, and route every live-write site through a single getWriteIndexName(IndexMapping) helper. Why - Previous routing went through resolveWriteIndex(entityType, mapping) but only at hand-picked call sites. Several write paths still resolved indexMapping.getIndexName(clusterAlias) directly and bypassed routing — bulkIndexPipelineExecutions, deleteByScript, softDeleteOrRestoreEntity, propagateToDomainChildren, updateEntityCertificationInSearch, propagateToRelatedEntities (PAGE), deleteTableColumns, updateTableColumnsInheritedFields. Any reindex in flight could lose those writes on the alias swap. - Keying by canonical index name lets any write site resolve correctly even without entity type in scope (FQN-prefix deletes, child propagation, script updates). What - activeStagedIndices: Map<canonicalIndexName, stagedIndexName>. - registerStagedIndex(entityType, stagedIndex) now resolves the canonical name from the IndexMapping before storing. - New getWriteIndexName(IndexMapping) is the single point of resolution; routeToStagedIfActive(String) handles raw alias names (e.g. pipeline_status_search_index resolved via getIndexOrAliasName). - Replaced every direct indexMapping.getIndexName(clusterAlias) for writes with getWriteIndexName(indexMapping). Admin/setup paths (createIndex/updateIndex/deleteIndex/createOrUpdateIndexTemplate) intentionally keep canonical names — they manage the alias itself. - Cascade ops on shared aliases (GLOBAL_SEARCH_ALIAS, DATA_ASSET_SEARCH_ALIAS, child aliases) are not entity-scoped and cannot route to a single staged index; left untouched. - resolveWriteIndex(entityType, mapping) preserved as a thin wrapper for binary compatibility. Also runs spotless:apply on the file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI failures observed under the parallel-tests fork load on the post-centralization run: 1. TagResourceIT line 161 (listEntities) — the server returned 500 wrapping "java.sql.SQLException: Out of sort memory, consider increasing server sort buffer size" from TagDAO.listAfter. The query joins tag → entity_relationship → classification and orders by tag.name,tag.id; with the tag table accumulating across many parallel test classes (reuseForks=true), MySQL's default 256KB sort_buffer_size overflows. Bump it to 8MB. Add a parallel work_mem=32MB bump to the postgres command for the same query. 2. TagResourceIT line 1 — Awaitility timeout at 1m30s waiting for a freshly created tag to appear in search index. Five inherited waits in BaseEntityIT had a 90s ceiling while the sibling checkCreatedEntity already used 180s. Standardise on 180s — under tag-scale data the alias swap that the staged-index routing depends on can take longer than 90s in slow CI workers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…7649) * DefaultRecreateHandler.finalizeReindex / promoteEntityIndex — wrap the entire promote block in try/finally so unregisterStagedIndex always runs, including on swap failure, empty aliasesToAttach, and exceptions. Without this the routing map could be left pointing at a staged index nobody reads from, silently diverging live writes from search results until the next reindex (Copilot, multiple comments). * SearchRepository.resolveWriteIndex — deprecate. The entityType argument is unused; getWriteIndexName(IndexMapping) is the single resolution point now (Copilot + Gitar). * SearchRepository.routeToStagedIfActive — tighten the Javadoc to state explicitly that it expects a canonical index name and that short/parent aliases are passed through unchanged (Copilot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…27649) The four bulk update-by-query operations rooted on shared aliases — updateAssetDomainsForDataProduct, updateAssetDomainsByIds, updateDomainFqnByPrefix, updateAssetDomainFqnByPrefix — hardcoded their target to GLOBAL_SEARCH_ALIAS / Entity.DOMAIN. During an in-flight reindex those updates landed on the about-to-be-discarded active index only; on alias swap, the new staged index (built from a DB snapshot taken before the script ran) replaced it and the script's effect was lost. Copilot called this out four times. Add SearchRepository.getWriteFanoutTargets(aliasOrIndex) — returns the caller's alias plus every currently-staged index. Pass that list to req.index(...) on all four methods in both OpenSearchEntityManager and ElasticSearchEntityManager. The OS/ES update-by-query API natively takes a list, so the fan-out is one request per call. The scripts these methods run are idempotent (UPDATE_ASSET_DOMAIN_SCRIPT checks `exists` before adding a domain; UPDATE_DOMAIN_FQN_BY_PREFIX_SCRIPT walks the array and rewrites in place), so applying them again to the staged index — even if the staged copy of the document already reflects the latest DB state — converges to the same result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…27649) Previous getWriteFanoutTargets always appended every staged index, which made entity-scoped update-by-query calls (e.g. updateDomainFqnByPrefix targeting only the domain canonical index) fan out onto unrelated staged indices. Adds avoidable load on every currently-reindexing entity type for an update that should touch one index. Branch the implementation on whether the input is a known canonical entity index name. If yes, only the matching staged index is added. If no — i.e. the caller is hitting a multi-entity alias such as GLOBAL_SEARCH_ALIAS — every staged index is added because the update's match query can hit documents from any reindexing type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enMetadata into fix-flaky-tests
Code Review ✅ Approved 2 resolved / 2 findingsStabilizes flaky integration tests by correcting the resolveWriteIndex Javadoc and ensuring getWriteFanoutTargets correctly handles staged indices. All findings have been addressed. ✅ 2 resolved✅ Quality: resolveWriteIndex Javadoc claims entityType lookup but ignores it
✅ Performance: getWriteFanoutTargets fans out to ALL staged indices
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Summary
Fixes #27649 — flaky backend integration tests caused by writes silently being lost during in-flight reindexes. Most of the change is to production search-index write routing, not just test plumbing — see "Scope" below.
Scope
The originally-reported flakes (
WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK,GlossaryOntologyExportIT,TagResourceIT) all surfaced the same underlying bug class: a reindex in flight discards live writes when the alias swaps. Fixing it cleanly required several production changes:SearchIndexRetryQueuereindex-suspension mechanism —shouldSkipStreamingIndexingused to silently drop live writes when a reindex job was active. Replaced with always-write semantics. (commit09e592397b)SearchRepository.activeStagedIndices(keyed by canonical index name) tracks indices being populated by a reindex; live writes go to the staged index directly so they survive the alias swap. Centralized ingetWriteIndexName(IndexMapping)/routeToStagedIfActive(String)/getWriteFanoutTargets(String). (commits52992bcdbc,568c2d32e5,95aa9235b4)SearchIndexHandleron everySearchRepositoryinit —EntityLifecycleEventDispatcher.registerHandlerskips duplicates by name; tests constructSearchRepository3 times viaTestSuiteBootstrapso the dispatcher kept a stale handler bound to the migration-time instance. Now each construction unregisters the previous handler first. (commitea093b0d86)DefaultRecreateHandler—try/finallyensuresunregisterStagedIndexruns on swap success, swap failure, empty-aliases, and exceptions; previously it could be left routing writes to an orphan index. (commit37c343855a)updateAssetDomainsForDataProduct/updateAssetDomainsByIds/updateDomainFqnByPrefix/updateAssetDomainFqnByPrefixoriginally targeted onlyGLOBAL_SEARCH_ALIAS/domain; the staged index missed the script and the change was lost on swap. Now they target the alias plus every active staged index. The scripts are idempotent so re-applying them to staged is safe. (commit95aa9235b4)Test-only changes:
GlossaryOntologyExportIT→@Isolated(RDF singleton bleed). commit14cb35ff38WorkflowDefinitionResourceIT#triggerWorkflow_SDK— drop redundant deploy-wait, add Awaitility aliases for actionable timeouts.TestSuiteBootstrap— bump MySQLsort_buffer_size=8Mand Postgreswork_mem=32MB(TagDAO.listAfter pagination overflow). commit0a4144c4a1BaseEntityITsearch-wait ceilings 90s → 180s for parity withcheckCreatedEntity.Test plan
mvn -pl openmetadata-service compilecleanmvn spotless:applycleanAppsResourceIT+WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK+RecognizerFeedbackRepositoryTest#testApplyFeedback_withRecognizerMetadata_shouldTargetSpecificRecognizer— 34/34 passedGlossaryOntologyExportIT(@Isolated) — 2/2 passedRisk
The suspension-removal and staged-index routing are real production behaviour changes and should be reviewed accordingly. Trade-offs documented in each commit message.
🤖 Generated with Claude Code