fix(search): restore live settings on per-entity promote path#27920
fix(search): restore live settings on per-entity promote path#27920
Conversation
DefaultRecreateHandler exposes two finalization paths:
- finalizeReindex(...) — centralized end-of-job promotion. Calls
applyLiveServingSettings + maybeForceMerge
before the alias swap, reverting the bulk
overrides (refresh_interval=-1, replicas=0,
async translog) back to live values
(refresh=1s, replicas=1, durable translog).
- promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
search-indexer's "promote as soon as all
partitions for an entity complete" callback
(DistributedSearchIndexExecutor.promoteEntityIndex).
Swaps the alias and cleans up old indices —
but never restored live settings.
When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.
Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DefaultRecreateHandler.applyLiveServingSettings reads from the handler's jobData field (live + bulk index-settings overrides on the EventPublisherJob). The per-entity distributed-promotion path in DistributedSearchIndexExecutor created its own DefaultRecreateHandler instance and never called withJobData(jobData) on it. With jobData=null, buildRevertJson returns null and applyLiveServingSettings silently no-ops — meaning the previous fix (b272de8) never actually re-applied live settings on the per-entity promote path, even though the call was reached. currentJob.getJobConfiguration() is the EventPublisherJob the strategy created. Wire it into the new handler at construction time, mirroring the withJobData call DistributedIndexingStrategy already makes on the strategy's own handler instance. With this change, the per-entity promote path now logs "Applying live serving settings to staged index '...' for entity 'page': {\"number_of_replicas\":1,\"refresh_interval\":\"1s\", ...}" before the alias swap, and post-promotion `_settings` show refresh_interval=1s instead of the stuck -1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Asserts that DefaultRecreateHandler.promoteEntityIndex calls searchClient.updateIndexSettings with the live-revert JSON (refresh_interval=1s, replicas=1, translog.durability=request) before swapping the alias, given a handler with bulk overrides wired through withJobData. Without the two preceding fixes the assertion fails with "Wanted but not invoked" — applyLiveServingSettings was never reached on the per-entity promotion path, so the staged index inherited refresh_interval=-1 and post-promotion live writes never became searchable until a manual _refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes the distributed per-entity search-index promotion path so a promoted staged index is returned to live-serving settings before it starts serving reads/writes. This sits in the search reindexing flow and aligns the per-entity distributed promote path with the existing full finalizeReindex behavior.
Changes:
- Restores live-serving index settings and optional force-merge in
promoteEntityIndexbefore alias promotion. - Passes distributed job configuration into the recreate handler so per-entity promotion can resolve bulk/live index settings.
- Adds a regression test around
DefaultRecreateHandler.promoteEntityIndex.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
openmetadata-service/src/main/java/org/openmetadata/service/search/DefaultRecreateHandler.java |
Adds live-settings restore + force-merge to the per-entity promotion path. |
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java |
Wires currentJob configuration into the recreate handler used by distributed per-entity promotion. |
openmetadata-service/src/test/java/org/openmetadata/service/search/DefaultRecreateHandlerTest.java |
Adds a regression test for live-settings restoration during per-entity promotion. |
DefaultRecreateHandlerTest.PromoteEntityIndexTests:
- testPromoteEntityIndexAppliesSettingsBeforeAliasSwap: InOrder
verification that updateIndexSettings runs BEFORE swapAliases. Catches a
swap-then-revert misordering (which would briefly serve live reads against
refresh=-1 settings).
- testPromoteEntityIndexForceMergesWhenConfigured: forceMerge(staged, 1) is
invoked when bulkIndexSettings.forceMergeOnPromote=true. Catches a
regression where the force-merge call gets dropped without anyone
noticing.
- testPromoteEntityIndexSkipsSettingsWithoutJobData: locks in the safe no-op
behavior when a handler is constructed without withJobData. Documents that
no-jobData → no settings call (vs. crash or silent revert to defaults).
DistributedSearchIndexExecutorTest:
- initializeEntityTrackerWiresJobDataIntoDefaultRecreateHandler: triggers
the private initializeEntityTracker with currentJob holding a populated
jobConfiguration and verifies recreateHandler.withJobData(jobData) is
called on the per-entity handler. This catches the second half of the
original regression: even if applyLiveServingSettings is reached on
promoteEntityIndex, jobData=null makes it a silent no-op. Future edits
that drop the wiring or move handler construction elsewhere will fail
here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggers SearchIndexingApplication with bulkIndexSettings configured (refresh_interval=-1, number_of_replicas=0, translog.durability=async), waits for the run to terminate, then queries _settings on the promoted table_search_index alias against the real OpenSearch/Elasticsearch container (via TestSuiteBootstrap.createSearchClient()). Asserts that each concrete index resolved by the alias has the live values applied (refresh=1s, replicas=1, translog.durability=request) and not the bulk overrides. This is the end-to-end counterpart to the unit-level regression test in DefaultRecreateHandlerTest. Catches the same class of bug at the layer where it actually surfaced in production: an alias swap that completed successfully according to logs but left the new live index unsearchable because refresh was disabled and writes were buffered indefinitely. Modeled on SearchIndexingFieldsParityIT for run-trigger / poll structure; adds the post-completion _settings verification step that no other IT performs today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dant test
DefaultRecreateHandler: move applyLiveServingSettings + maybeForceMerge
inside the try/finally that unregisters the staged index. Without this, a
transient OS/ES failure during _settings update or _forcemerge propagated out
before the finally ran, leaving searchRepository.unregisterStagedIndex
permanently registered — so live writes kept routing to a staged index
nothing reads from. Same fix applied to finalizeReindex for consistency
(its window is shorter since it runs at end-of-job, but the leak shape is
identical). Per gitar-bot review.
DefaultRecreateHandlerTest:
- testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap: replace
independent verify() calls with InOrder so the test actually locks the
"settings before alias swap" ordering its name and the PR description
promise. A swap-then-revert refactor would have passed before this.
Per Copilot review.
- Drop testPromoteEntityIndexAppliesSettingsBeforeAliasSwap (the standalone
InOrder test added in the previous commit) — folded back into the test
above, which now covers both ordering and JSON content in one place.
- Add testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure —
regression test for the gitar-bot fix above. Verified to fail with
"IllegalState connection reset" when the calls are moved back outside
the try block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The why-it's-in-a-try and why-jobData-is-wired blocks read like commit messages, not code annotations. Tests and commit history carry the rationale; the code itself reads fine without them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-state verification I added in 9a7fa49 (indexExists after delete, getAliases.contains(canonical) after add) called the search client directly. If those calls themselves threw — network timeout, transport error — the exception escaped promoteWithDeferredCanonicalDelete and was caught by the outer phase=exception handler with markPromotionFailed( dataLoss=false). For the getAliases case the canonical index has already been deleted at that point, so dataLoss=false misclassifies a real data- unavailability state. Three small helpers: - safeIndexExists(client, index, entityType): gate-time check; returns false on throw (conservative — skip delete attempt; if canonical actually exists, step-3 addAliases will fail with name collision and the alias-attached post-check will record the right blast radius). - checkIndexExists(client, index): tristate Boolean for post-delete check; null on throw means "couldn't determine state". - checkAliasAttached(client, staged, alias): tristate Boolean for post-add check; null on throw means "couldn't determine state". Caller logic: - delete-canonical post-check returning null → markPromotionFailed( dataLoss=false). Conservative: we don't know if delete actually took. - add-aliases post-check returning null → markPromotionFailed( dataLoss=true). Canonical IS deleted; alias state unknown is the worst case. Tests: - testPromoteEntityIndexHandlesIndexExistsPostCheckThrow: gate returns true, post-delete check throws → failed but NOT data loss. - testPromoteEntityIndexHandlesGetAliasesPostCheckThrow: post-add check throws → failed AND data loss (canonical already gone). Per gitar-bot review on PR #27920. All 40 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic IT
Six findings from copilot-pull-request-reviewer.
1+4. Track canonicalDeleted via positive evidence only (DefaultRecreateHandler).
ES/OS indexExists/getAliases/listIndicesByPrefix all swallow transport
errors and return false/empty. A "negative" probe result cannot be
distinguished from "probe failed". The previous shape blindly trusted
probe values and could misclassify a transient failure as data loss
when canonical was actually still serving (gate-says-no-but-actually-yes
case) or, conversely, as not-data-loss when canonical was actually
gone. Now: a `canonicalDeleted` boolean defaults to false and only
flips to true after both the delete call and a positive post-state
check confirm the index is gone. dataLoss classification uses this
flag — never claim data loss without positive evidence canonical was
deleted. Added regression test
`testPromoteEntityIndexAmbiguousGateProbeIsNotDataLoss` for the gate-
ambiguity case.
2. Wire onJobFailed in DistributedIndexingStrategy.
Previously only SearchIndexExecutor emitted the listener callback for
FAILED. The distributed strategy returned the status but never invoked
listeners.onJobFailed, so jobData.failure stayed empty and the
AppRunRecord/WebSocket update had no failure context. Now mirrors the
single-server behavior with an IllegalStateException naming the
data-loss entities.
3. IT: assertEquals("request", durability) instead of assertNotEquals
("async"). The non-equals assertion would pass if a silent translog-
revert drop left durability at any non-async cluster default, missing
the regression. Pin the exact configured value.
5. IT: assert exactly one concrete index resolves the alias, not
at-least-one. A broken swap that leaves the alias attached to BOTH
the pre-existing live index AND the new *_rebuild_* index would
satisfy "any rebuild present" but duplicate search results in
production. Use assertEquals(1, settingsByIndex.size()).
6. IT hermeticity: snapshot/restore SearchIndexingApplication's
appConfiguration. /v1/apps/trigger/{name} merges payload into the
persisted config, so without restore later tests in the suite
inherit this test's bulkIndexSettings / liveIndexSettings /
useDistributedIndexing values — making suite ordering change what
they exercise. Both IT methods now do a try/finally with
snapshotAppConfig() + restoreAppConfig().
All 151 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Boolean stillExistsAfterDelete = checkIndexExists(searchClient, canonicalIndex); | ||
| if (stillExistsAfterDelete == null) { | ||
| // post-check itself failed; we have no positive evidence of deletion; bail conservatively | ||
| LOG.error( | ||
| "[ALIAS_PROMOTE_FAILED phase=delete-canonical entity={} stagedIndex={} canonicalIndex={} reason=post-check-failed]", |
| Boolean aliasAttached = checkAliasAttached(searchClient, stagedIndex, canonicalIndex); | ||
| if (aliasAttached == null) { | ||
| LOG.error( | ||
| "[ALIAS_PROMOTE_FAILED phase=swap2 entity={} stagedIndex={} canonicalIndex={} reason=post-check-failed canonicalDeleted={}] " | ||
| + "{}: alias-attached check failed.", | ||
| entityType, | ||
| stagedIndex, | ||
| canonicalIndex, | ||
| canonicalDeleted, | ||
| canonicalDeleted ? "DATA UNAVAILABLE" : "DEGRADED"); |
restoreAppConfig POSTs to /v1/apps/trigger/{name}, which both merges the
body into the persisted config AND starts a new run. Returning without
waiting left SearchIndexingApplication running into the next test class
(AppsResourceIT.test_triggerApp_200), which then timed out for 2 minutes
on "already running".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The terminal-state check compared status against uppercase "SUCCESS",
"FAILED", "COMPLETED", but appRunRecord.json defines the status enum
in lowercase ("success", "failed", "completed", ...). The check never
matched and the 30s wait silently fell through to the catch block,
making it a no-op. test_triggerApp_200 then relied on its 2-minute
"already running" trigger retry, which timed out whenever a longer
reindex (e.g. SearchIndexingFieldsParityIT's "all entities" reindex)
was still in flight.
Switch the terminal check to "not running and not started"
case-insensitively, and raise the ceiling to 5 minutes so the wait
actually covers a long in-flight reindex.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| if (canonicalIndex == null || stagedIndex == null) { | ||
| LOG.error( | ||
| "Cannot finalize reindex for entity '{}'. canonicalIndex={}, stagedIndex={}", | ||
| "Cannot promote index for entity '{}'. canonicalIndex={}, stagedIndex={}", | ||
| entityType, | ||
| canonicalIndex, |
| @SuppressWarnings("unchecked") | ||
| private static Map<String, Object> snapshotAppConfig(HttpClient httpClient) { | ||
| try { | ||
| String body = | ||
| httpClient.executeForString( | ||
| HttpMethod.GET, "/v1/apps/name/" + APP_NAME + "?fields=appConfiguration", null, null); | ||
| JsonNode root = MAPPER.readTree(body); | ||
| JsonNode cfg = root.path("appConfiguration"); | ||
| if (cfg.isMissingNode() || cfg.isNull()) { | ||
| return Map.of(); | ||
| } | ||
| return MAPPER.convertValue(cfg, Map.class); | ||
| } catch (Exception ex) { | ||
| // Snapshot is best-effort; the test still runs even if we can't restore. | ||
| return Map.of(); | ||
| } |
Code Review ✅ Approved 3 resolved / 3 findingsConsolidates the per-entity promotion path to correctly restore live settings and wires job configurations to prevent index setting drift. Integration tests were hardened to improve reliability and address previous findings regarding resource leaks and race conditions. ✅ 3 resolved✅ Edge Case: applyLiveServingSettings/maybeForceMerge exception skips unregisterStagedIndex
✅ Quality: Rest5Client created but never closed in integration test
✅ Edge Case: Post-state checks can throw unhandled, leaving promotion in limbo
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Summary
Reindex promotion succeeds but the swapped-in index keeps the bulk-build settings (
refresh_interval=-1,replicas=0,translog=async), so live writes after promotion are buffered indefinitely and never become searchable until a manual_refresh. This surfaces in production as "indexing is fast, but searches/UIs return nothing on freshly created entities."Two coupled bugs in the per-entity distributed promote path:
DefaultRecreateHandler.promoteEntityIndexwas missing the live-settings restore.finalizeReindexcallsapplyLiveServingSettings+maybeForceMergebefore the alias swap; the per-entity sibling jumped straight from doc-count check to alias swap. Mirror the call sequence.DistributedSearchIndexExecutorconstructed a secondRecreateIndexHandlerwithout wiringjobData. WithjobData=null,applyLiveServingSettingsresolves no overrides,buildRevertJsonreturnsnull, and the call silently no-ops — so even if the previous fix were in place, the per-entity path would still miss settings restoration. PasscurrentJob.getJobConfiguration()into the handler at construction, mirroring whatDistributedIndexingStrategyalready does.Provenance
These two source changes are cherry-picks of
b272de85f9and68dad610ed, which currently sit unmerged on thecontext_centerfeature branch (PR #27558) — a Knowledge → Context Center rename PR that is unrelated to search indexing. Splitting them out so they can ship independently.A regression test (
testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap) is added to lock the behavior in: it builds a handler with bulk overrides onEventPublisherJob, callspromoteEntityIndex, and assertssearchClient.updateIndexSettingsis invoked withrefresh_interval=1s,number_of_replicas=1,translog.durability=requestbefore the alias swap. Verified locally: the test passes with both source commits applied; reverting either one fails the assertion withWanted but not invoked.Test plan
mvn test -pl openmetadata-service -Dtest='DefaultRecreateHandlerTest,DistributedSearchIndexExecutorTest,DistributedIndexingStrategyTest,SearchIndexExecutorControlFlowTest'→ 136 tests, 0 failuresapplyLiveServingSettings/maybeForceMergeare removed frompromoteEntityIndex(Mockito reportsWanted but not invoked)replicas=0,refresh=-1,translog=async) and confirm post-promotion_settingsshowrefresh_interval=1s, not-1🤖 Generated with Claude Code
Summary by Gitar
finalizeReindexandpromoteEntityIndexinto a singlepromotepath to prevent feature drift.applyLiveServingSettingsandmaybeForceMergetopromoteEntityIndexto ensure production-ready index settings are restored before alias swapping.DistributedSearchIndexExecutorconstructed a handler withoutjobData, causing settings restoration to no-op.indexExists,getAliases) to ensure alias integrity.failedPromotions,dataLossPromotions) to surface job failures and data-loss scenarios inDistributedIndexingStrategyandSearchIndexExecutor.SearchIndexAliasPromotionITto verify live-setting restoration and idempotent alias promotion.DefaultRecreateHandlerTestfor promotion edge cases, including error handling and state verification.This will update automatically on new commits.