Skip to content

fix(os-migration): harden startup validation, add emergency halt path, fix thread-safe formatter#35632

Merged
fabrizzio-dotCMS merged 10 commits into
mainfrom
fix/os-migration-startup-hardening-35631
May 11, 2026
Merged

fix(os-migration): harden startup validation, add emergency halt path, fix thread-safe formatter#35632
fabrizzio-dotCMS merged 10 commits into
mainfrom
fix/os-migration-startup-hardening-35631

Conversation

@fabrizzio-dotCMS
Copy link
Copy Markdown
Member

Closes #35631

Summary

  • IndexStartupValidator: connectivity failures are now hard errors (throws DotRuntimeException); validateIndexingConfig() returns boolean so callers can react without catching exceptions.
  • IndexConfigHelper: added MigrationPhase.reset() and haltMigration() — runtime-only emergency rollback that writes ordinal 0 back to FEATURE_FLAG_OPEN_SEARCH_PHASE without a restart.
  • ContentletIndexAPIImpl: calls haltMigration() when startup validation fails; hasEmptyIndices() is now phase-aware (empty OS index in phases 1/2 no longer triggers unintended full ES reindex).
  • ESIndexResource: replaces non-thread-safe SimpleDateFormat with immutable DateTimeFormatter to prevent corrupted index names under concurrent requests.
  • docs: adds OPENSEARCH_MIGRATION_TEST_PLAN.md — 530-line test plan covering all migration phases, rollback scenarios, and dual-write validation.

Test plan

  • Phase 0 (ES-only): startup unchanged, checkAndInitializeIndex() skips OS validator.
  • Phase 1/2 with OS unreachable: validateIndexingConfig() returns falsehaltMigration() resets phase to 0 → fatal log emitted → dotCMS falls back to ES-only.
  • Phase 1/2 with empty OS index + populated ES: no full reindex triggered; ES index preserved.
  • Phase 3 with empty OS index: full reindex is triggered (expected).
  • Concurrent calls to /api/v1/esindex/create: index names are well-formed (no SimpleDateFormat race).
  • See docs/backend/OPENSEARCH_MIGRATION_TEST_PLAN.md for full test matrix.

🤖 Generated with Claude Code

…, fix thread-safe formatter

Closes #35631

- IndexStartupValidator: promote OS connectivity failures to hard errors
  (throws DotRuntimeException); validateIndexingConfig() now returns boolean
  so callers can react instead of only catching exceptions.

- IndexConfigHelper: add MigrationPhase.reset() and haltMigration() for
  runtime emergency rollback — writes ordinal 0 back to
  FEATURE_FLAG_OPEN_SEARCH_PHASE without a restart.

- ContentletIndexAPIImpl: wire haltMigration() when validateIndexingConfig()
  returns false; fix hasEmptyIndices() to be phase-aware so an empty OS
  index in phases 1/2 does not trigger an unintended full ES reindex.

- ESIndexResource: replace non-thread-safe SimpleDateFormat (timestampFormatter)
  with immutable DateTimeFormatter (threadSafeTimestampFormatter) to prevent
  corrupted index names under concurrent requests.

- docs: add OPENSEARCH_MIGRATION_TEST_PLAN.md (530-line test plan covering
  all 6 migration phases, rollback scenarios, and dual-write validation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Claude finished @fabrizzio-dotCMS's task in 2m 7s —— View job


Review

  • Read changed Java files
  • Analyze IndexStartupValidator changes
  • Analyze IndexConfigHelper.haltMigration / reset
  • Analyze ContentletIndexAPIImpl (haltMigration call, hasEmptyIndices phase-awareness)
  • Analyze ESIndexResource DateTimeFormatter swap
  • Post review

The shape is good — version + endpoint checks gate boot, halt path is runtime-only so testers can recover, and the formatter swap is correct. A few things look wrong or worth a second look:

1. Phase 3 "abort loudly" doesn't actually abort — the surrounding try/catch swallows it

ContentletIndexAPIImpl.java:528 throws a DotRuntimeException when validation fails in Phase 3, with a comment saying it's safer to abort than serve stale ES data. But checkAndInitializeIndex() is wrapped in try { … } catch (Exception e) { Logger.fatal(…); } (line 551), which catches the throw and lets startup proceed.

End state in Phase 3 + OS unreachable today:

  • Logger.error in validateIndexingConfig() ← from the validator
  • DotRuntimeException thrown ← intended abort
  • Caught by outer handler → Logger.fatal("Failed to create new indexes: …")
  • Method returns, boot continues, no indexes get created

That's effectively the same outcome as the non-Phase-3 path, just with a different log line. If the intent is genuinely "do not boot with stale ES," either let the exception escape, narrow the outer catch, or set a flag that disables traffic. Otherwise the Phase 3 branch is documentation, not behavior.

2. The thread-safe formatter swap is incomplete

PR says SimpleDateFormat was replaced "to prevent corrupted index names under concurrent requests." Only ESIndexResource.java was updated. The same shared SimpleDateFormat timestampFormatter (declared as an interface constant on ContentletIndexAPI:33, still annotated @Deprecated(forRemoval = true)) is still used by:

  • IndiciesInfo.java:78 (declaration), :143 (parse), :160 (format)
  • dotmarketing/sitesearch/ajax/SiteSearchAjaxAction.java:117 — formatting a brand-new index name in a request handler (same race the PR was fixing)
  • dotmarketing/portlets/cmsmaintenance/ajax/IndexAjaxAction.java:126 — same pattern
  • ContentletIndexAPIImpl.java:1316IndiciesInfo.timestampFormatter.parse(ts)

The race in SiteSearchAjaxAction and IndexAjaxAction is the exact same shape as the one in ESIndexResource. Either finish migrating callers and delete the deprecated field, or scope the PR description to "fixed the one in ESIndexResource."

3. Possible behavior regression in hasEmptyIndices() for the initial Phase 0 → Phase 1 transition

hasEmptyIndices() now returns isEsWorkingIndexEmpty() for phases 0–2. The comment explains why this avoids spurious reindexes during dual-write catchup, which is correct for steady-state. But the first time an existing installation flips from phase 0 to phase 1, ES is populated and OS is empty — the old code would have triggered an automatic backfill reindex; the new code won't. If backfill is expected to happen through an explicit operator action / endpoint, fine, but worth a one-liner in the test plan / phase-1 cutover docs so nobody finds out the hard way.

4. haltMigration() may silently no-op depending on the config source

MigrationPhase.reset()'s javadoc already calls this out: if FEATURE_FLAG_OPEN_SEARCH_PHASE lives in ConfigSystemTable (DB-backed config), Config.setProperty writes to in-memory props but the DB value still wins on the next current() read. That means in a real prod scenario where the flag was toggled via the dotCMS config UI, haltMigration() does nothing — and the fatal log gives a false sense that fallback occurred. Two suggestions:

  • Read current() again after Config.setProperty and log loudly if it didn't actually flip, so operators know they need to clear the system-table entry.
  • Add a test plan row for "halt path with flag set via system-table" — the current matrix only covers the happy in-memory case.

5. Minor

  • validateIndexingConfig() already logs Logger.error on failure; the caller in checkAndInitializeIndex() logs another Logger.error before haltMigration() (line 534). One of the two is redundant.
  • PR description says "530-line test plan"; the file is 788 lines. Cosmetic.
  • normalizeEndpoint uses default port 9200 when no port is given — fine for the common case, but two endpoints foo and foo:9200 will compare equal even if one was actually intended to be foo:443 behind a proxy. Probably acceptable given the "best-effort" framing already in the javadoc.

Not blocking

The MigrationPhase.reset() design (runtime-only flag flip + warn log) is a reasonable tester-friendly choice and matches the author's stated rationale in the review thread.


• Branch: fix/os-migration-startup-hardening-35631

@github-actions github-actions Bot added Area : Backend PR changes Java/Maven backend code Area : Documentation PR changes documentation files labels May 8, 2026
- ESIndexResource: fix LocalTime.now() → LocalDateTime.now() so the
  yyyyMMddHHmmss formatter no longer throws UnsupportedTemporalTypeException
  at runtime; drop unused java.util.Date import
- IndexStartupValidator: fix stale class Javadoc (remove 'demoted to warning'
  claim), fix validateOSVersion Javadoc (remove 'JVM exit'), fix log message
  to say 'halting OS migration / fall back to ES-only' instead of 'cannot start'
- IndexConfigHelper.MigrationPhase.reset(): add Javadoc warning that
  system-table config source shadows Config.setProperty(), making reset() a
  no-op when the phase flag was set via system-table
- ContentletIndexAPIImpl.checkAndInitializeIndex(): add phase-3 guard — refuse
  to auto-rollback to ES when OS validation fails in PHASE_3_OPENSEARCH_ONLY
  (ES may be decommissioned/stale); throw DotRuntimeException so operator sees
  a clear fatal log instead of a silent demotion to a stale ES index
- ContentletIndexAPIImpl.hasEmptyIndices(): collapse redundant phase-0 branch
  (phases 0–2 all fall through to isEsWorkingIndexEmpty())
- ContentletIndexAPIImpl: remove leading spaces from two error/log strings
  (lines 791, 796)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rdening

Adds Scenarios G–K (TC-001–TC-025) covering startup validation, haltMigration()
emergency halt, phase-aware index init, thread-safe formatter, multi-node cluster
Phase 1, content/CT lifecycle sync, index delete behavior, and REST security.

Updates Scenario B-7 to reflect breaking change from PR #35632 (haltMigration()
now triggered on OS connectivity failure instead of silent boot). Updates
Scenario C-3 similarly. Adds QA Epic and Fix PR links to header. Renumbers
sections to accommodate new scenarios (F moved to 15, DB Queries→16, Bugs→17,
Execution Order→18, Out of Scope→19). Bugs table gains 3 new rows for PR #35632.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread dotCMS/src/main/java/com/dotcms/content/index/IndexStartupValidator.java Outdated
fabrizzio-dotCMS and others added 2 commits May 11, 2026 09:57
…rtupValidator

Logger.fatal was halting startup unnecessarily for a non-fatal OS config
error — the fallback to ES-only is intentional and should not surface as
a fatal-level log entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etIndexAPIImpl

Consistent with the same fix in IndexStartupValidator — halting the
migration is a controlled fallback, not a fatal condition, so Logger.error
is the correct severity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eference

The file had been overwritten with a QA test plan by the design-fix-test-plan
skill. Restores it to its canonical architecture documentation content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fabrizzio-dotCMS fabrizzio-dotCMS added this pull request to the merge queue May 11, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks May 11, 2026
@fabrizzio-dotCMS fabrizzio-dotCMS added this pull request to the merge queue May 11, 2026
Merged via the queue into main with commit f07b1df May 11, 2026
51 checks passed
@fabrizzio-dotCMS fabrizzio-dotCMS deleted the fix/os-migration-startup-hardening-35631 branch May 11, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code Area : Documentation PR changes documentation files

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

fix(os-migration): harden startup validation, add emergency halt path, fix thread-safe formatter

3 participants