Skip to content

Streaming daemon: Phase 2 layer 1 — hot store + lifecycle (#816)#820

Draft
chowbao wants to merge 8 commits into
streaming-phase1-daemonfrom
streaming-phase2-lifecycle
Draft

Streaming daemon: Phase 2 layer 1 — hot store + lifecycle (#816)#820
chowbao wants to merge 8 commits into
streaming-phase1-daemonfrom
streaming-phase2-lifecycle

Conversation

@chowbao

@chowbao chowbao commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Part of #816 — Phase 2 (Live ingestion + lifecycle), layer 1 of 2. Stacked on #PR1C (streaming-phase1-daemon, Phase 1).

The hot tier + lifecycle machinery:

  • the per-chunk hotchunk DB (multi-CF: ledgers + events + tx-hash; transient/ready state machine; read-only freeze view)
  • the hot catalog key family + one-write protocol, and hotsource (the ready-hot-DB read source for backfillSource + freeze)
  • progress hot refinement (positional hot-key term + the probe-based MaxCommittedSeq read)
  • the lifecycle goroutine (freeze via the primitives' hot source → index-aware discard → prune) and the live ingestion loop, driving the hot tier through ingest.HotService
  • folded-in fix: deletes the now-dead RunHot/RunCold stream-drain orchestration + its ingest_test cases (verified zero production callers)

The machinery is tested standalone here (seeded hot DBs + direct ticks); the daemon wiring lands in the capstone layer.

Verification: go build + go vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain). Note: the full cmd/stellar-rpc binary link requires the Rust libpreflight/libxdr2json (CI make build-libs); go vet ./cmd/stellar-rpc/ type-checks the entrypoint locally. golangci-lint runs in CI.

Stack: streaming-phase2-lifecyclestreaming-phase1-daemon

@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 3d12c9e to 84ff8c2 Compare June 24, 2026 20:39
@chowbao chowbao force-pushed the streaming-phase2-lifecycle branch 2 times, most recently from df8ec80 to 17b5c39 Compare June 24, 2026 20:51
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 84ff8c2 to 419f7ec Compare June 24, 2026 22:05
@chowbao chowbao force-pushed the streaming-phase2-lifecycle branch from 17b5c39 to 145c1cc Compare June 24, 2026 22:05
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 04f9931 to 7f8e58f Compare June 25, 2026 04:09
@chowbao chowbao force-pushed the streaming-phase2-lifecycle branch from e15575a to bc56b0a Compare June 25, 2026 04:25
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 7f8e58f to f3431cd Compare June 25, 2026 05:19
@chowbao chowbao force-pushed the streaming-phase2-lifecycle branch from bc56b0a to 440443b Compare June 25, 2026 12:24
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from fa0083f to cbc80ab Compare June 25, 2026 14:42
@chowbao chowbao force-pushed the streaming-phase2-lifecycle branch from 440443b to c4944fb Compare June 25, 2026 15:07
chowbao added a commit that referenced this pull request Jun 25, 2026
…hanges

Rebased onto the updated #820 and propagated #817's API changes into the Phase 2
live-ingestion/daemon layer:
- window -> tx-hash index rename + key prefix index: -> txhash_index:
  (TxHashIndexCoverage.Index, Catalog.txhashIndex), Catalog.Get/Has -> get/has,
  config sections regrouped (cfg.Retention/Layout/Storage/Ingestion), pins via
  PinLayout.
- daemon.go merge: kept #821's live-ingestion wiring (LifecycleConfig + Core) and
  deduped the HotProbe line (#821's Phase-2 wiring already set it, so #820's
  HotProbe fix is redundant here).
- removed the #819 cold-only catch-up E2E (TestRunDaemon_CatchUpMaterializes...)
  + its someTxBackend/oneTxLCMBytes helpers: #821's daemon now requires
  Boundaries.Core and runs a continuous live loop, so a cold-only "catch up then
  return" test can't fit — and TestE2E_DaemonLifecycle covers it end to end.

Mechanical propagation only; build/vet/test -short green (the heavy lifecycle E2E
stays -short-gated).
chowbao added 7 commits June 25, 2026 14:23
Rebased onto #817's foundations after the round-2 review reworked them:
the #824 geometry+catalog subpackage split, Windows support dropped, and
the crash-test hooks removed. The primitives spine (processChunk,
buildTxhashIndex, buildThenSweep + the cold backfill source order) stays in
package streaming and now imports the new subpackages:

  - *Catalog -> *catalog.Catalog. The two former *Catalog helpers
    (txhashBinInputs, windowDemotedTxhashRefs) become free functions in
    streaming over the catalog's exported API, since the type now lives in
    another package and methods can't be added across the boundary.
  - Key/state/layout/index types and the fsync barrier (BarrierNewFile)
    resolve through geometry.*; the ArtifactSet -> ingest.Config translation
    stays in streaming (ingestConfigFor) so catalog keeps its one-way
    dependency on geometry alone (the #824 split invariant).

Crash-test hooks dropped to match #817. The in-method ordering observations
(afterMarkFreezing / afterBarrier / afterIndexMark / afterCommitBeforeSweep)
are gone; the §7.6 crash matrix is reconstructed hook-free through the public
protocol and the buildTxhashIndex(commit) / buildThenSweep(commit+sweep) seam,
asserting recovery convergence on the durable states a crash leaves behind.
The pure mid-method ordering checks are deferred to the fault-injection harness
(#823), mirroring catalog's TestCrashSafety_FileWrittenKeyNotFlipped.

go build, go vet, gofmt and go test (streaming + catalog + geometry + ingest,
RocksDB 10.9.1 cgo toolchain) are all green.

Part of #815.
… service wiring (closes #815)

Derived progress/watermark (recomputed from durable keys), the resolve catalog
diff -> Plan, executePlan (one bounded worker pool; index builds wait on their
in-coverage chunk builds; withRetries with exponential backoff), and the
cold-only startStreaming (networkTip-bounded catch-up loop -> serveReads
handoff; no hot tier / live loop / lifecycle goroutine in Phase 1). Wires the
daemon entrypoint (LoadConfig -> validateConfig -> locks -> supervised loop),
the CLI full-history-streaming subcommand, and the folded-in cold metrics:
the daemon builds a Prometheus registry and drives the cold tier through
ingest.ColdService + NewPrometheusSink (ProcessConfig.Sink).

Closes #815.
Comment-only reviewability pass over the orchestration + daemon layer
(no code changes — verified via comment-stripped diff). Keeps each
canonical explanation once, shrinks per-function docs to what is unique,
and collapses multi-line inline re-explanations; invariants, the "why"
behind non-obvious choices, design-doc citations, and all directive
comments are preserved. ~25% fewer comment words across the eight files.

Also corrects stale references that drifted from earlier layers:
  - resolve.go / execute.go: IndexBuild and BuildConfig live in txindex.go,
    not the never-named "build.go".
  - startup.go: validateConfig is wired now (RunDaemon calls it before
    startStreaming); drop the "Phase D / not done here" note.
  - doc.go: the file map now lists the files this cold-only package
    actually has (the hot tier, lifecycle, recovery, and audit files are
    Phase 2), instead of files that do not exist yet.
Foundations (#817) renamed Paths.LockRoots() -> RootsToLock() to disambiguate
from the package-level LockRoots() that acquires the flocks. This adapts the
daemon's lock-acquisition call after rebasing onto that foundations; the
package func call LockRoots(...) is unchanged.

Surfaced by the rebase as a build break (paths.LockRoots undefined); no
behavior change. build/vet/test -short green.
… exit

Review follow-ups to the Phase 1 orchestration + daemon layer:

- Executor test coverage:
  - Cancel while an INDEX build is parked in its dependency wait (holding no
    slot) now has a direct test: it must unblock via <-gctx.Done() and never
    run on a chunk that never froze. The existing ContextCancelAborts only
    covered a parked CHUNK build.
  - The Rebuild metric emitted from the real executePlan index path is now
    asserted end-to-end (one Rebuild per IndexBuild, chunks == Hi-Lo+1); it was
    previously only exercised by a direct unit call on the sink.
  Both run green under -race.

- startup.go: log a clean-shutdown line after ServeReads returns. In cold-only
  Phase 1 the production ServeReads is a no-op (reads stay on the v1 SQLite
  daemon until the #772 cutover), so the daemon exits immediately after
  catch-up — the new line makes that an explicit, expected event rather than
  looking like a misconfiguration. The "serving reads" log is reworded to
  "handing off to the read server" to match.

- observability.go: note on the Metrics interface that LastCommitted /
  ChunkBoundary / Freeze / Prune / ColdTierBytes are Phase-2-wired and have no
  caller in this cold-only layer, so a reader doesn't hunt for one.

gofmt clean; go test -race green on the streaming package (RocksDB 10.9.1 cgo).

Closes #815.
Closes the one #815 acceptance criterion previously proven only at the
primitive level: that the daemon, booted from one TOML, catches up to the
tip and materializes all three cold data types PLUS the window index
through the REAL entrypoint (RunDaemon -> validateConfig -> catchUp ->
executePlan -> processChunk -> buildTxhashIndex -> buildThenSweep), then
serves. The existing daemon happy-path test sits the tip inside chunk 0,
so its catch-up is a deliberate no-op.

The injected backend serves a complete chunk 0 of mostly zero-tx ledgers
with a sparse few carrying one transaction, so the chunk's txhash .bin has
keys to index (a wholly zero-tx chunk cannot build an index). cpi=1 makes
the single-chunk window index terminal, so the test asserts the full
txhash lifecycle: ledgers + events frozen on disk, one frozen index
coverage [0,0] with its .idx present, and the per-chunk .bin key demoted +
swept. workers=1 also exercises the index-waits-on-its-chunk path through
the daemon.

Cheap ledger bodies keep the full ~10k-ledger catch-up under ~0.3s, so it
stays in -short. Test-only change; no production code touched.

Part of #815.
Rebased the orchestration/daemon layer onto the reorganized #818 (geometry +
catalog subpackages) and propagated #817's API changes:
- qualify moved symbols: catalog.{Catalog,ArtifactSet,NewArtifactSet,AllArtifacts,
  NewCatalog}; geometry.{Kind*,State*,Layout,NewLayout,TxHashIndexID,
  TxHashIndexCoverage,TxHashIndexLayout,NewTxHashIndexLayout,MaxChunksPerTxhashIndex,
  LastCompleteChunkAt}
- window -> tx-hash-index rename (Windows->TxHashIndexLayout, FrozenCoverage->
  FrozenTxHashIndex, AllIndexKeys->AllTxHashIndexKeys, IndexBuild.Window->.Index)
- config regroup: cfg.Backfill.ChunksPerTxhashIndex->cfg.Layout.*,
  cfg.Streaming.*->cfg.Retention.*/cfg.Storage.*; daemon_test config literals updated
- tests reach the index layout via the public cat.TxHashIndexLayout() (txhashIndex
  is now an unexported field of the catalog package)

build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from cbc80ab to ae91d20 Compare June 25, 2026 19:14
Restacked on the split/no-hooks #819 and ported the hot tier across the new
package boundary:
- hot key schema -> geometry (HotState/HotReady/HotTransient, exported
  HotChunkKey/ParseHotChunkKey/HotChunkPrefix); hot catalog methods -> catalog
  (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys)
- processChunk hot-source branch + progress hot refinement
  (lastCommittedLedger(cat, probe), highestReadyChunkSigned, refineWithHotDB)
- new files: pkg/stores/hotchunk, streaming/{eligibility,hotsource,ingest,lifecycle}
- daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe)
- crash-hooks REMOVED to match #817/#818 (the split makes cat.hooks unreachable
  from streaming); the one beforeHotTransient hook test is dropped, the rest are
  the structural crash tests #817/#818 established
- propagated renames: window->tx-hash-index, RetentionGate->RetentionFloor,
  cat.Has->public HotState, cat.layout->Layout()

build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
@chowbao chowbao force-pushed the streaming-phase2-lifecycle branch from c4944fb to aeca6a0 Compare June 25, 2026 20:02
chowbao added a commit that referenced this pull request Jun 25, 2026
Rebased the live-ingestion capstone onto the reorganized #820 and propagated:
- qualify moved symbols (geometry./catalog.) in daemon.go, startup.go, e2e_test.go
- window->tx-hash-index + RetentionGate->RetentionFloor renames; cat.layout->Layout(),
  cat.Has->public HotState shim, .IndexFilePath->.TxHashIndexFilePath
- config regroup: cfg.Streaming.CaptiveCoreConfig -> cfg.Ingestion.CaptiveCoreConfig
- restored #821's daemon_test.go (drops the cold-only catch-up test the full daemon
  supersedes; adds the supervise/backend-tip/boundaries tests) + the HotProbe/Core wiring
- avoided the txhash_txhash_index find-replace corruption (was only in the dropped restack)

build + vet + go test -short green EXCEPT the lifecycle E2E, whose generated TOML
still uses the pre-regroup [streaming]/[backfill] schema (follow-up; per maintainer
the stack will be re-rebased).
@chowbao chowbao force-pushed the streaming-phase1-daemon branch 6 times, most recently from 14aa4c8 to aafbe0d Compare June 26, 2026 14:02
chowbao added a commit that referenced this pull request Jun 26, 2026
Relocate the one-write protocol ordering helper (mark -> create -> barrier ->
flip) from backfill/process.go to catalog_protocol.go, where the protocol's
states and mark/flip steps already live, and export it as catalog.OneWrite.
processChunk and buildTxhashIndex now call it across the package boundary.

It is a zero-dependency pure function and catalog never imports backfill, so
there is no import cycle; #820's hot-tier openHotTierForChunk adopts it as the
third caller by import alone, with no later relocation. Addresses the #818
review thread that asked to establish the shared helper here rather than
deferring the move to #820.
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 2967c7a to 240aa5e Compare June 26, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant