Skip to content

fix(it): stop flaky integration tests (#27649)#27651

Merged
pmbrull merged 19 commits intomainfrom
fix-flaky-tests
Apr 27, 2026
Merged

fix(it): stop flaky integration tests (#27649)#27651
pmbrull merged 19 commits intomainfrom
fix-flaky-tests

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented Apr 23, 2026

Summary

Fixes #27649 — flaky backend integration tests caused by writes silently being lost during in-flight reindexes. Most of the change is to production search-index write routing, not just test plumbing — see "Scope" below.

Scope

The originally-reported flakes (WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK, GlossaryOntologyExportIT, TagResourceIT) all surfaced the same underlying bug class: a reindex in flight discards live writes when the alias swaps. Fixing it cleanly required several production changes:

  1. Removed the SearchIndexRetryQueue reindex-suspension mechanismshouldSkipStreamingIndexing used to silently drop live writes when a reindex job was active. Replaced with always-write semantics. (commit 09e592397b)
  2. Staged-index write routingSearchRepository.activeStagedIndices (keyed by canonical index name) tracks indices being populated by a reindex; live writes go to the staged index directly so they survive the alias swap. Centralized in getWriteIndexName(IndexMapping) / routeToStagedIfActive(String) / getWriteFanoutTargets(String). (commits 52992bcdbc, 568c2d32e5, 95aa9235b4)
  3. Re-register SearchIndexHandler on every SearchRepository initEntityLifecycleEventDispatcher.registerHandler skips duplicates by name; tests construct SearchRepository 3 times via TestSuiteBootstrap so the dispatcher kept a stale handler bound to the migration-time instance. Now each construction unregisters the previous handler first. (commit ea093b0d86)
  4. Always unregister staged-index routing in DefaultRecreateHandlertry/finally ensures unregisterStagedIndex runs on swap success, swap failure, empty-aliases, and exceptions; previously it could be left routing writes to an orphan index. (commit 37c343855a)
  5. Fan out cross-alias bulk update-by-queryupdateAssetDomainsForDataProduct/updateAssetDomainsByIds/updateDomainFqnByPrefix/updateAssetDomainFqnByPrefix originally targeted only GLOBAL_SEARCH_ALIAS / domain; the staged index missed the script and the change was lost on swap. Now they target the alias plus every active staged index. The scripts are idempotent so re-applying them to staged is safe. (commit 95aa9235b4)

Test-only changes:

  • GlossaryOntologyExportIT@Isolated (RDF singleton bleed). commit 14cb35ff38
  • WorkflowDefinitionResourceIT#triggerWorkflow_SDK — drop redundant deploy-wait, add Awaitility aliases for actionable timeouts.
  • TestSuiteBootstrap — bump MySQL sort_buffer_size=8M and Postgres work_mem=32MB (TagDAO.listAfter pagination overflow). commit 0a4144c4a1
  • BaseEntityIT search-wait ceilings 90s → 180s for parity with checkCreatedEntity.

Test plan

  • mvn -pl openmetadata-service compile clean
  • mvn spotless:apply clean
  • Local repro: AppsResourceIT + WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK + RecognizerFeedbackRepositoryTest#testApplyFeedback_withRecognizerMetadata_shouldTargetSpecificRecognizer — 34/34 passed
  • Local: GlossaryOntologyExportIT (@Isolated) — 2/2 passed
  • CI run on the full integration-test matrix

Risk

The suspension-removal and staged-index routing are real production behaviour changes and should be reviewed accordingly. Trade-offs documented in each commit message.

🤖 Generated with Claude Code

GlossaryOntologyExportIT: mark @isolated. @BeforeAll flips RdfUpdater (a
JVM-wide singleton) on, which makes every concurrent test class start
doing synchronous Fuseki writes on entity create, saturating the
Dropwizard thread pool and causing 60s request timeouts. @execution
(SAME_THREAD) alone only serialises within this class.

WorkflowDefinitionResourceIT#triggerWorkflow_SDK: drop the redundant
waitForWorkflowDeployment call — the create path already waits. Add
descriptive aliases to the two await() polls so the next flake tells
us which FQN or workflow name actually timed out instead of an
anonymous lambda.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 06:29
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 23, 2026
@mohityadav766 mohityadav766 changed the title fix(it): stop two flaky integration tests (#27649) fix(it): stop flaky integration tests (#27649) Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes flaky backend integration tests by preventing cross-test interference from JVM-wide RDF toggling and improving Awaitility timeout diagnostics in workflow tests.

Changes:

  • Added JUnit @Isolated to GlossaryOntologyExportIT to prevent concurrent test classes from inheriting the global RdfUpdater configuration.
  • Removed a redundant workflow deployment wait in triggerWorkflow_SDK (deployment is already awaited during workflow creation).
  • Added descriptive aliases to Awaitility await(...) calls to make future timeouts actionable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java Removes redundant wait and adds Awaitility aliases for clearer timeout failures.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryOntologyExportIT.java Isolates RDF export tests to avoid JVM-wide singleton leakage across parallel integration tests.

@mohityadav766 mohityadav766 self-assigned this Apr 23, 2026
pmbrull
pmbrull previously approved these changes Apr 23, 2026
Live search indexing was silently skipped whenever a reindex job was in
RUNNING/READY/STOPPING state. SearchRepository.createEntityIndex() and
six sibling methods consulted SearchIndexRetryQueue.isEntityTypeSuspended()
and returned early with nothing written, nothing enqueued — entities
vanished from search until a future reindex happened to cover them.

The retry worker doubled down: when the scope refresh observed an active
job, it purged the retry queue; and processRecord() deleted records
whose type was suspended. So even manually enqueued retries were wiped.

This is how the #27649 flake surfaced: AppsResourceIT triggers
SearchIndexingApplication runs and its best-effort 30s wait silently
swallows timeouts. If a run was still RUNNING when AppsResourceIT
finished, the next class in the sequential fork
(WorkflowDefinitionResourceIT) inherited the suspension and its
freshly-created tables were never indexed — waitForEntityIndexedInSearch
then timed out at 120s. Same mechanism bites real users mid-reindex in
production.

Remove the suspension mechanism entirely:

* SearchRepository — drop the 8 isEntityTypeSuspended() early-returns;
  the client-availability path already enqueues for retry on its own.
* SearchIndexRetryWorker — drop refreshReindexSuspensionScopeIfNeeded()
  and the suspension branches in processRecord(); remove the retry-queue
  purge on suspendAll.
* SearchIndexRetryQueue — delete the updateSuspension / clearSuspension
  / isEntityTypeSuspended / isStreamingSuspended / isSuspendAllStreaming
  / getSuspendedEntityTypes API and the static AtomicBoolean /
  AtomicReference they backed.
* Drop the two IT cases that asserted the removed behaviour.

Live writes now always reach the search client; reindex and live
writes both target the same indices as before. Version conflicts
between the two paths (stale reindex batch overwriting a newer live
write) remain possible as they did before suspension was introduced —
that is the race suspension was meant to dodge, but dropping writes
altogether was worse than the race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 08:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Re-key activeStagedIndices by canonical index name (e.g.
openmetadata_table_search_index) instead of entity type, and route
every live-write site through a single getWriteIndexName(IndexMapping)
helper.

Why
- Previous routing went through resolveWriteIndex(entityType, mapping)
  but only at hand-picked call sites. Several write paths still
  resolved indexMapping.getIndexName(clusterAlias) directly and
  bypassed routing — bulkIndexPipelineExecutions, deleteByScript,
  softDeleteOrRestoreEntity, propagateToDomainChildren,
  updateEntityCertificationInSearch, propagateToRelatedEntities (PAGE),
  deleteTableColumns, updateTableColumnsInheritedFields. Any reindex
  in flight could lose those writes on the alias swap.
- Keying by canonical index name lets any write site resolve correctly
  even without entity type in scope (FQN-prefix deletes, child
  propagation, script updates).

What
- activeStagedIndices: Map<canonicalIndexName, stagedIndexName>.
- registerStagedIndex(entityType, stagedIndex) now resolves the
  canonical name from the IndexMapping before storing.
- New getWriteIndexName(IndexMapping) is the single point of
  resolution; routeToStagedIfActive(String) handles raw alias names
  (e.g. pipeline_status_search_index resolved via
  getIndexOrAliasName).
- Replaced every direct indexMapping.getIndexName(clusterAlias) for
  writes with getWriteIndexName(indexMapping). Admin/setup paths
  (createIndex/updateIndex/deleteIndex/createOrUpdateIndexTemplate)
  intentionally keep canonical names — they manage the alias itself.
- Cascade ops on shared aliases (GLOBAL_SEARCH_ALIAS,
  DATA_ASSET_SEARCH_ALIAS, child aliases) are not entity-scoped and
  cannot route to a single staged index; left untouched.
- resolveWriteIndex(entityType, mapping) preserved as a thin wrapper
  for binary compatibility.

Also runs spotless:apply on the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 08:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

mohityadav766 and others added 2 commits April 27, 2026 14:24
Two CI failures observed under the parallel-tests fork load on the
post-centralization run:

1. TagResourceIT line 161 (listEntities) — the server returned 500
   wrapping "java.sql.SQLException: Out of sort memory, consider
   increasing server sort buffer size" from TagDAO.listAfter. The
   query joins tag → entity_relationship → classification and orders
   by tag.name,tag.id; with the tag table accumulating across many
   parallel test classes (reuseForks=true), MySQL's default 256KB
   sort_buffer_size overflows. Bump it to 8MB. Add a parallel
   work_mem=32MB bump to the postgres command for the same query.

2. TagResourceIT line 1 — Awaitility timeout at 1m30s waiting for a
   freshly created tag to appear in search index. Five inherited
   waits in BaseEntityIT had a 90s ceiling while the sibling
   checkCreatedEntity already used 180s. Standardise on 180s — under
   tag-scale data the alias swap that the staged-index routing
   depends on can take longer than 90s in slow CI workers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…7649)

* DefaultRecreateHandler.finalizeReindex / promoteEntityIndex — wrap
  the entire promote block in try/finally so unregisterStagedIndex
  always runs, including on swap failure, empty aliasesToAttach, and
  exceptions. Without this the routing map could be left pointing at
  a staged index nobody reads from, silently diverging live writes
  from search results until the next reindex (Copilot, multiple
  comments).

* SearchRepository.resolveWriteIndex — deprecate. The entityType
  argument is unused; getWriteIndexName(IndexMapping) is the single
  resolution point now (Copilot + Gitar).

* SearchRepository.routeToStagedIfActive — tighten the Javadoc to
  state explicitly that it expects a canonical index name and that
  short/parent aliases are passed through unchanged (Copilot).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

mohityadav766 and others added 3 commits April 27, 2026 17:22
…27649)

The four bulk update-by-query operations rooted on shared aliases —
updateAssetDomainsForDataProduct, updateAssetDomainsByIds,
updateDomainFqnByPrefix, updateAssetDomainFqnByPrefix — hardcoded their
target to GLOBAL_SEARCH_ALIAS / Entity.DOMAIN. During an in-flight
reindex those updates landed on the about-to-be-discarded active
index only; on alias swap, the new staged index (built from a DB
snapshot taken before the script ran) replaced it and the script's
effect was lost. Copilot called this out four times.

Add SearchRepository.getWriteFanoutTargets(aliasOrIndex) — returns the
caller's alias plus every currently-staged index. Pass that list to
req.index(...) on all four methods in both OpenSearchEntityManager and
ElasticSearchEntityManager. The OS/ES update-by-query API natively
takes a list, so the fan-out is one request per call.

The scripts these methods run are idempotent (UPDATE_ASSET_DOMAIN_SCRIPT
checks `exists` before adding a domain; UPDATE_DOMAIN_FQN_BY_PREFIX_SCRIPT
walks the array and rewrites in place), so applying them again to the
staged index — even if the staged copy of the document already reflects
the latest DB state — converges to the same result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

mohityadav766 and others added 2 commits April 27, 2026 19:04
…27649)

Previous getWriteFanoutTargets always appended every staged index,
which made entity-scoped update-by-query calls (e.g.
updateDomainFqnByPrefix targeting only the domain canonical index)
fan out onto unrelated staged indices. Adds avoidable load on every
currently-reindexing entity type for an update that should touch one
index.

Branch the implementation on whether the input is a known canonical
entity index name. If yes, only the matching staged index is added.
If no — i.e. the caller is hitting a multi-entity alias such as
GLOBAL_SEARCH_ALIAS — every staged index is added because the
update's match query can hit documents from any reindexing type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 27, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Stabilizes flaky integration tests by correcting the resolveWriteIndex Javadoc and ensuring getWriteFanoutTargets correctly handles staged indices. All findings have been addressed.

✅ 2 resolved
Quality: resolveWriteIndex Javadoc claims entityType lookup but ignores it

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:586-591
The Javadoc on resolveWriteIndex (line 586-588) says "Equivalent to getWriteIndexName(getIndexMapping(entityType))", but the implementation at line 590 calls getWriteIndexName(indexMapping) — using the caller-supplied indexMapping, not looking up by entityType. The entityType parameter is now completely unused. Since all existing callers pass matching pairs this is functionally harmless, but the misleading doc and dead parameter could confuse future maintainers.

Performance: getWriteFanoutTargets fans out to ALL staged indices

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:609-617
When multiple entity types are being reindexed concurrently, getWriteFanoutTargets appends every entry in activeStagedIndices — not just those relevant to the alias being targeted. For example, a domain-update routed through GLOBAL_SEARCH_ALIAS would also target a staged index for domain_search_index or any other entity type being reindexed.

Because conflicts(Conflicts.Proceed) is set and the scripts are idempotent, this is functionally correct — irrelevant staged indices will simply match zero documents. However, it sends unnecessary update-by-query RPCs to indices that can never contain matching docs, which is wasted I/O proportional to the number of concurrent reindex jobs.

This is unlikely to matter in practice (concurrent reindexes are rare, and extra zero-hit queries are cheap), so this is informational.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Integration/Unit Tests backend

3 participants