Skip to content

fix(search): restore live settings on per-entity promote path#27920

Open
harshach wants to merge 18 commits intomainfrom
harshach/search-alias-promote
Open

fix(search): restore live settings on per-entity promote path#27920
harshach wants to merge 18 commits intomainfrom
harshach/search-alias-promote

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 5, 2026

Summary

Reindex promotion succeeds but the swapped-in index keeps the bulk-build settings (refresh_interval=-1, replicas=0, translog=async), so live writes after promotion are buffered indefinitely and never become searchable until a manual _refresh. This surfaces in production as "indexing is fast, but searches/UIs return nothing on freshly created entities."

Two coupled bugs in the per-entity distributed promote path:

  • DefaultRecreateHandler.promoteEntityIndex was missing the live-settings restore. finalizeReindex calls applyLiveServingSettings + maybeForceMerge before the alias swap; the per-entity sibling jumped straight from doc-count check to alias swap. Mirror the call sequence.
  • DistributedSearchIndexExecutor constructed a second RecreateIndexHandler without wiring jobData. With jobData=null, applyLiveServingSettings resolves no overrides, buildRevertJson returns null, and the call silently no-ops — so even if the previous fix were in place, the per-entity path would still miss settings restoration. Pass currentJob.getJobConfiguration() into the handler at construction, mirroring what DistributedIndexingStrategy already does.

Provenance

These two source changes are cherry-picks of b272de85f9 and 68dad610ed, which currently sit unmerged on the context_center feature branch (PR #27558) — a Knowledge → Context Center rename PR that is unrelated to search indexing. Splitting them out so they can ship independently.

A regression test (testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap) is added to lock the behavior in: it builds a handler with bulk overrides on EventPublisherJob, calls promoteEntityIndex, and asserts searchClient.updateIndexSettings is invoked with refresh_interval=1s, number_of_replicas=1, translog.durability=request before the alias swap. Verified locally: the test passes with both source commits applied; reverting either one fails the assertion with Wanted but not invoked.

Test plan

  • mvn test -pl openmetadata-service -Dtest='DefaultRecreateHandlerTest,DistributedSearchIndexExecutorTest,DistributedIndexingStrategyTest,SearchIndexExecutorControlFlowTest' → 136 tests, 0 failures
  • Sanity-check the regression test fails when applyLiveServingSettings/maybeForceMerge are removed from promoteEntityIndex (Mockito reports Wanted but not invoked)
  • Manual: trigger a Search Indexing app run with bulk overrides configured (replicas=0, refresh=-1, translog=async) and confirm post-promotion _settings show refresh_interval=1s, not -1

🤖 Generated with Claude Code


Summary by Gitar

  • Search promotion logic:
    • Unified finalizeReindex and promoteEntityIndex into a single promote path to prevent feature drift.
    • Added applyLiveServingSettings and maybeForceMerge to promoteEntityIndex to ensure production-ready index settings are restored before alias swapping.
  • Handler configuration:
    • Fixed a bug where DistributedSearchIndexExecutor constructed a handler without jobData, causing settings restoration to no-op.
  • Reliability and error handling:
    • Implemented a deferred canonical-index deletion process (3-step swap) with post-state checks (indexExists, getAliases) to ensure alias integrity.
    • Added state tracking (failedPromotions, dataLossPromotions) to surface job failures and data-loss scenarios in DistributedIndexingStrategy and SearchIndexExecutor.
  • Regression testing:
    • Added SearchIndexAliasPromotionIT to verify live-setting restoration and idempotent alias promotion.
    • Added extensive unit tests in DefaultRecreateHandlerTest for promotion edge cases, including error handling and state verification.

This will update automatically on new commits.

harshach and others added 3 commits May 5, 2026 13:59
DefaultRecreateHandler exposes two finalization paths:

  - finalizeReindex(...)        — centralized end-of-job promotion. Calls
                                  applyLiveServingSettings + maybeForceMerge
                                  before the alias swap, reverting the bulk
                                  overrides (refresh_interval=-1, replicas=0,
                                  async translog) back to live values
                                  (refresh=1s, replicas=1, durable translog).

  - promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
                                  search-indexer's "promote as soon as all
                                  partitions for an entity complete" callback
                                  (DistributedSearchIndexExecutor.promoteEntityIndex).
                                  Swaps the alias and cleans up old indices —
                                  but never restored live settings.

When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.

Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DefaultRecreateHandler.applyLiveServingSettings reads from the handler's
jobData field (live + bulk index-settings overrides on the EventPublisherJob).
The per-entity distributed-promotion path in DistributedSearchIndexExecutor
created its own DefaultRecreateHandler instance and never called
withJobData(jobData) on it. With jobData=null, buildRevertJson returns null
and applyLiveServingSettings silently no-ops — meaning the previous fix
(b272de8) never actually re-applied live settings on the per-entity
promote path, even though the call was reached.

currentJob.getJobConfiguration() is the EventPublisherJob the strategy
created. Wire it into the new handler at construction time, mirroring the
withJobData call DistributedIndexingStrategy already makes on the strategy's
own handler instance.

With this change, the per-entity promote path now logs

  "Applying live serving settings to staged index '...' for entity 'page':
   {\"number_of_replicas\":1,\"refresh_interval\":\"1s\", ...}"

before the alias swap, and post-promotion `_settings` show
refresh_interval=1s instead of the stuck -1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Asserts that DefaultRecreateHandler.promoteEntityIndex calls
searchClient.updateIndexSettings with the live-revert JSON
(refresh_interval=1s, replicas=1, translog.durability=request) before
swapping the alias, given a handler with bulk overrides wired through
withJobData. Without the two preceding fixes the assertion fails with
"Wanted but not invoked" — applyLiveServingSettings was never reached
on the per-entity promotion path, so the staged index inherited
refresh_interval=-1 and post-promotion live writes never became
searchable until a manual _refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 21:20
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the distributed per-entity search-index promotion path so a promoted staged index is returned to live-serving settings before it starts serving reads/writes. This sits in the search reindexing flow and aligns the per-entity distributed promote path with the existing full finalizeReindex behavior.

Changes:

  • Restores live-serving index settings and optional force-merge in promoteEntityIndex before alias promotion.
  • Passes distributed job configuration into the recreate handler so per-entity promotion can resolve bulk/live index settings.
  • Adds a regression test around DefaultRecreateHandler.promoteEntityIndex.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java Adds live-settings restore + force-merge to the per-entity promotion path.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java Wires currentJob configuration into the recreate handler used by distributed per-entity promotion.
openmetadata-service/src/test/java/org/openmetadata/service/search/DefaultRecreateHandlerTest.java Adds a regression test for live-settings restoration during per-entity promotion.

harshach and others added 2 commits May 5, 2026 14:37
DefaultRecreateHandlerTest.PromoteEntityIndexTests:
  - testPromoteEntityIndexAppliesSettingsBeforeAliasSwap: InOrder
    verification that updateIndexSettings runs BEFORE swapAliases. Catches a
    swap-then-revert misordering (which would briefly serve live reads against
    refresh=-1 settings).
  - testPromoteEntityIndexForceMergesWhenConfigured: forceMerge(staged, 1) is
    invoked when bulkIndexSettings.forceMergeOnPromote=true. Catches a
    regression where the force-merge call gets dropped without anyone
    noticing.
  - testPromoteEntityIndexSkipsSettingsWithoutJobData: locks in the safe no-op
    behavior when a handler is constructed without withJobData. Documents that
    no-jobData → no settings call (vs. crash or silent revert to defaults).

DistributedSearchIndexExecutorTest:
  - initializeEntityTrackerWiresJobDataIntoDefaultRecreateHandler: triggers
    the private initializeEntityTracker with currentJob holding a populated
    jobConfiguration and verifies recreateHandler.withJobData(jobData) is
    called on the per-entity handler. This catches the second half of the
    original regression: even if applyLiveServingSettings is reached on
    promoteEntityIndex, jobData=null makes it a silent no-op. Future edits
    that drop the wiring or move handler construction elsewhere will fail
    here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggers SearchIndexingApplication with bulkIndexSettings configured
(refresh_interval=-1, number_of_replicas=0, translog.durability=async),
waits for the run to terminate, then queries _settings on the promoted
table_search_index alias against the real OpenSearch/Elasticsearch
container (via TestSuiteBootstrap.createSearchClient()). Asserts that
each concrete index resolved by the alias has the live values applied
(refresh=1s, replicas=1, translog.durability=request) and not the bulk
overrides.

This is the end-to-end counterpart to the unit-level regression test in
DefaultRecreateHandlerTest. Catches the same class of bug at the layer
where it actually surfaced in production: an alias swap that completed
successfully according to logs but left the new live index unsearchable
because refresh was disabled and writes were buffered indefinitely.

Modeled on SearchIndexingFieldsParityIT for run-trigger / poll structure;
adds the post-completion _settings verification step that no other IT
performs today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dant test

DefaultRecreateHandler: move applyLiveServingSettings + maybeForceMerge
inside the try/finally that unregisters the staged index. Without this, a
transient OS/ES failure during _settings update or _forcemerge propagated out
before the finally ran, leaving searchRepository.unregisterStagedIndex
permanently registered — so live writes kept routing to a staged index
nothing reads from. Same fix applied to finalizeReindex for consistency
(its window is shorter since it runs at end-of-job, but the leak shape is
identical). Per gitar-bot review.

DefaultRecreateHandlerTest:
  - testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap: replace
    independent verify() calls with InOrder so the test actually locks the
    "settings before alias swap" ordering its name and the PR description
    promise. A swap-then-revert refactor would have passed before this.
    Per Copilot review.
  - Drop testPromoteEntityIndexAppliesSettingsBeforeAliasSwap (the standalone
    InOrder test added in the previous commit) — folded back into the test
    above, which now covers both ordering and JSON content in one place.
  - Add testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure —
    regression test for the gitar-bot fix above. Verified to fail with
    "IllegalState connection reset" when the calls are moved back outside
    the try block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 21:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

The why-it's-in-a-try and why-jobData-is-wired blocks read like commit
messages, not code annotations. Tests and commit history carry the
rationale; the code itself reads fine without them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-state verification I added in 9a7fa49 (indexExists after delete,
getAliases.contains(canonical) after add) called the search client
directly. If those calls themselves threw — network timeout, transport
error — the exception escaped promoteWithDeferredCanonicalDelete and was
caught by the outer phase=exception handler with markPromotionFailed(
dataLoss=false). For the getAliases case the canonical index has already
been deleted at that point, so dataLoss=false misclassifies a real data-
unavailability state.

Three small helpers:
  - safeIndexExists(client, index, entityType): gate-time check; returns
    false on throw (conservative — skip delete attempt; if canonical
    actually exists, step-3 addAliases will fail with name collision and
    the alias-attached post-check will record the right blast radius).
  - checkIndexExists(client, index): tristate Boolean for post-delete
    check; null on throw means "couldn't determine state".
  - checkAliasAttached(client, staged, alias): tristate Boolean for
    post-add check; null on throw means "couldn't determine state".

Caller logic:
  - delete-canonical post-check returning null → markPromotionFailed(
    dataLoss=false). Conservative: we don't know if delete actually took.
  - add-aliases post-check returning null → markPromotionFailed(
    dataLoss=true). Canonical IS deleted; alias state unknown is the
    worst case.

Tests:
  - testPromoteEntityIndexHandlesIndexExistsPostCheckThrow: gate returns
    true, post-delete check throws → failed but NOT data loss.
  - testPromoteEntityIndexHandlesGetAliasesPostCheckThrow: post-add check
    throws → failed AND data loss (canonical already gone).

Per gitar-bot review on PR #27920. All 40 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 02:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

…ic IT

Six findings from copilot-pull-request-reviewer.

1+4. Track canonicalDeleted via positive evidence only (DefaultRecreateHandler).
   ES/OS indexExists/getAliases/listIndicesByPrefix all swallow transport
   errors and return false/empty. A "negative" probe result cannot be
   distinguished from "probe failed". The previous shape blindly trusted
   probe values and could misclassify a transient failure as data loss
   when canonical was actually still serving (gate-says-no-but-actually-yes
   case) or, conversely, as not-data-loss when canonical was actually
   gone. Now: a `canonicalDeleted` boolean defaults to false and only
   flips to true after both the delete call and a positive post-state
   check confirm the index is gone. dataLoss classification uses this
   flag — never claim data loss without positive evidence canonical was
   deleted. Added regression test
   `testPromoteEntityIndexAmbiguousGateProbeIsNotDataLoss` for the gate-
   ambiguity case.

2. Wire onJobFailed in DistributedIndexingStrategy.
   Previously only SearchIndexExecutor emitted the listener callback for
   FAILED. The distributed strategy returned the status but never invoked
   listeners.onJobFailed, so jobData.failure stayed empty and the
   AppRunRecord/WebSocket update had no failure context. Now mirrors the
   single-server behavior with an IllegalStateException naming the
   data-loss entities.

3. IT: assertEquals("request", durability) instead of assertNotEquals
   ("async"). The non-equals assertion would pass if a silent translog-
   revert drop left durability at any non-async cluster default, missing
   the regression. Pin the exact configured value.

5. IT: assert exactly one concrete index resolves the alias, not
   at-least-one. A broken swap that leaves the alias attached to BOTH
   the pre-existing live index AND the new *_rebuild_* index would
   satisfy "any rebuild present" but duplicate search results in
   production. Use assertEquals(1, settingsByIndex.size()).

6. IT hermeticity: snapshot/restore SearchIndexingApplication's
   appConfiguration. /v1/apps/trigger/{name} merges payload into the
   persisted config, so without restore later tests in the suite
   inherit this test's bulkIndexSettings / liveIndexSettings /
   useDistributedIndexing values — making suite ordering change what
   they exercise. Both IT methods now do a try/finally with
   snapshotAppConfig() + restoreAppConfig().

All 151 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment on lines +350 to +354
Boolean stillExistsAfterDelete = checkIndexExists(searchClient, canonicalIndex);
if (stillExistsAfterDelete == null) {
// post-check itself failed; we have no positive evidence of deletion; bail conservatively
LOG.error(
"[ALIAS_PROMOTE_FAILED phase=delete-canonical entity={} stagedIndex={} canonicalIndex={} reason=post-check-failed]",
Comment on lines +390 to +399
Boolean aliasAttached = checkAliasAttached(searchClient, stagedIndex, canonicalIndex);
if (aliasAttached == null) {
LOG.error(
"[ALIAS_PROMOTE_FAILED phase=swap2 entity={} stagedIndex={} canonicalIndex={} reason=post-check-failed canonicalDeleted={}] "
+ "{}: alias-attached check failed.",
entityType,
stagedIndex,
canonicalIndex,
canonicalDeleted,
canonicalDeleted ? "DATA UNAVAILABLE" : "DEGRADED");
harshach and others added 2 commits May 5, 2026 21:54
restoreAppConfig POSTs to /v1/apps/trigger/{name}, which both merges the
body into the persisted config AND starts a new run. Returning without
waiting left SearchIndexingApplication running into the next test class
(AppsResourceIT.test_triggerApp_200), which then timed out for 2 minutes
on "already running".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The terminal-state check compared status against uppercase "SUCCESS",
"FAILED", "COMPLETED", but appRunRecord.json defines the status enum
in lowercase ("success", "failed", "completed", ...). The check never
matched and the 30s wait silently fell through to the catch block,
making it a no-op. test_triggerApp_200 then relied on its 2-minute
"already running" trigger retry, which timed out whenever a longer
reindex (e.g. SearchIndexingFieldsParityIT's "all entities" reindex)
was still in flight.

Switch the terminal check to "not running and not started"
case-insensitively, and raise the ceiling to 5 minutes so the wait
actually covers a long in-flight reindex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment on lines 134 to 138
if (canonicalIndex == null || stagedIndex == null) {
LOG.error(
"Cannot finalize reindex for entity '{}'. canonicalIndex={}, stagedIndex={}",
"Cannot promote index for entity '{}'. canonicalIndex={}, stagedIndex={}",
entityType,
canonicalIndex,
Comment on lines +252 to +267
@SuppressWarnings("unchecked")
private static Map<String, Object> snapshotAppConfig(HttpClient httpClient) {
try {
String body =
httpClient.executeForString(
HttpMethod.GET, "/v1/apps/name/" + APP_NAME + "?fields=appConfiguration", null, null);
JsonNode root = MAPPER.readTree(body);
JsonNode cfg = root.path("appConfiguration");
if (cfg.isMissingNode() || cfg.isNull()) {
return Map.of();
}
return MAPPER.convertValue(cfg, Map.class);
} catch (Exception ex) {
// Snapshot is best-effort; the test still runs even if we can't restore.
return Map.of();
}
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 6, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Consolidates the per-entity promotion path to correctly restore live settings and wires job configurations to prevent index setting drift. Integration tests were hardened to improve reliability and address previous findings regarding resource leaks and race conditions.

✅ 3 resolved
Edge Case: applyLiveServingSettings/maybeForceMerge exception skips unregisterStagedIndex

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:330-331 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:334 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:408-409
Lines 330-331 in promoteEntityIndex call applyLiveServingSettings and maybeForceMerge outside the try/catch/finally block (lines 334-410). If either throws (e.g., Elasticsearch connectivity issue during updateIndexSettings), the finally at line 408 that calls searchRepository.unregisterStagedIndex(entityType, stagedIndex) is never reached. This leaves the staged index permanently registered, potentially routing live writes to an un-promoted index that nothing reads from.

This mirrors the same pattern in finalizeReindex (lines 139-140 are also outside its try block), so it's a pre-existing design choice being replicated here. However, since the per-entity path runs mid-job (not at terminal cleanup), the window for a transient ES failure is wider.

Quality: Rest5Client created but never closed in integration test

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchIndexAliasPromotionIT.java:250
In readIndexSettings, TestSuiteBootstrap.createSearchClient() returns a Rest5Client that is never closed. Over repeated test runs or if the test is retried, this leaks HTTP connections to Elasticsearch. Use try-with-resources or close in a finally block.

Edge Case: Post-state checks can throw unhandled, leaving promotion in limbo

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:344 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java:371
The new post-state verification calls at lines 344 and 371 (searchClient.indexExists(canonicalIndex) and searchClient.getAliases(stagedIndex)) are outside any try-catch block. If these calls throw (network timeout, transport error), the exception propagates unhandled from promoteWithDeferredCanonicalDelete.

The getAliases call at line 371 is especially dangerous: at that point the canonical index has already been deleted. An unhandled exception means:

  • The canonical index is gone (data loss state)
  • markPromotionFailed is never called, so the failure isn't recorded
  • The alias may or may not have been attached

Similarly, if indexExists at line 344 throws after deleteIndexWithBackoff succeeded, we never proceed to add the canonical alias.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants