Skip to content

Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819

Draft
chowbao wants to merge 13 commits into
feature/full-historyfrom
streaming-phase1-daemon
Draft

Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819
chowbao wants to merge 13 commits into
feature/full-historyfrom
streaming-phase1-daemon

Conversation

@chowbao

@chowbao chowbao commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Closes #815 — Phase 1 (Backfill), layer 3 of 3. Stacked on PR1B (streaming-phase1-primitives).

Orchestration + the daemon entrypoint, completing the cold-only backfill daemon:

  • derived progress/watermark (recomputed from durable keys), resolve (catalog-diff → Plan), executePlan (one bounded worker pool; index builds wait on their in-coverage chunk builds; withRetries with exponential backoff)
  • the cold-only startStreaming: a networkTip-bounded catch-up loop → serveReads handoff (no hot tier / live loop / lifecycle in Phase 1)
  • the daemon entrypoint (LoadConfig → validateConfig → locks → supervised loop) + the full-history CLI subcommand
  • folded-in fix: the daemon builds a Prometheus registry and drives the cold tier through ingest.ColdService + NewPrometheusSink (ProcessConfig.Sink)

This branch is a runnable backfill daemon: boots from one TOML, catches up from a backend, and serves.

Verification: go build + go vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain). Note: the full cmd/stellar-rpc binary link requires the Rust libpreflight/libxdr2json (CI make build-libs); go vet ./cmd/stellar-rpc/ type-checks the entrypoint locally. golangci-lint runs in CI.

Stack: streaming-phase1-daemonstreaming-phase1-primitives

@chowbao

chowbao commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Note (non-blocking): a zero-txhash-key window → infinite restart loop on synthetic/standalone networks

Surfaced while writing the daemon catch-up E2E (TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex, fa0083f):

A complete chunk whose 10,000 ledgers are all transaction-free produces an empty txhash .bin. buildTxhashIndex then fails with txhash: cannot build a cold index with zero keys; that error propagates runBackfill → startStreaming, and superviseStreaming classifies it as restartable (not a fatal sentinel). So the daemon restarts (~5s backoff), re-derives the watermark, re-resolves the same still-unbuilt window, fails again → an infinite restart loop (~7s/cycle). It is not slow ingest — a successful catch-up of the same chunk runs in ~0.3s.

Not a production concern: mainnet/testnet never produce an all-zero-tx 10k-ledger chunk. But a standalone/quickstart network, or a naive synthetic test backend (e.g. an all-zero-tx fixture), hits it easily. The new E2E sidesteps it by giving the chunk a sparse few one-tx ledgers so the .bin has keys to index.

Decision: no fix for now — it can't happen in prod. If ever revisited: either build a degenerate/empty index for a zero-key window, or make the zero-key error a fatal sentinel so superviseStreaming surfaces it instead of retrying forever.

@chowbao chowbao force-pushed the streaming-phase1-primitives branch from 72368b2 to 531dca9 Compare June 25, 2026 14:36
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from fa0083f to cbc80ab Compare June 25, 2026 14:42
chowbao added a commit that referenced this pull request Jun 25, 2026
…hanges

Rebased onto the updated #819 and propagated #817's API changes into the Phase 2
hot-store/lifecycle layer:
- window -> tx-hash index rename + key prefix index: -> txhash_index:
  (TxHashIndexCoverage.Window -> .Index, Catalog.windows -> .txhashIndex); merged
  the hot-catalog methods (HotChunkKeys/ReadyHotChunkKeys/hotChunkKeysWith) onto
  #817's renamed Catalog.
- Catalog.Get/Has -> get/has; config sections regrouped (cfg.Retention.*,
  cfg.Layout.*, cfg.Storage.*, cfg.Ingestion.*); pins via PinLayout.
- RetentionGate -> RetentionFloor: eligibility.go uses floor.Excludes(c) and
  floor.Excludes(layout.LastChunk(idx)) for the discard/prune scans; the reader-
  retention test dropped its Admits/Floor assertions (deferred to #772) and keeps
  the straddling-window prune-survival check via Excludes.
- renamed the lifecycle test's oneTxLCMBytes -> oneTxLCMRand (the #819 daemon-E2E
  added a same-named helper).

Also fixes a latent daemon bug surfaced by the catch-up E2E: startConfig built
ProcessConfig without a HotProbe, so any real catch-up aborted with "HotProbe is
nil" and the supervisor looped. Wire the production NewRocksHotProbe.

build/vet/test -short green (streaming + hotchunk + ingest + stores).
chowbao added a commit that referenced this pull request Jun 25, 2026
…hanges

Rebased onto the updated #820 and propagated #817's API changes into the Phase 2
live-ingestion/daemon layer:
- window -> tx-hash index rename + key prefix index: -> txhash_index:
  (TxHashIndexCoverage.Index, Catalog.txhashIndex), Catalog.Get/Has -> get/has,
  config sections regrouped (cfg.Retention/Layout/Storage/Ingestion), pins via
  PinLayout.
- daemon.go merge: kept #821's live-ingestion wiring (LifecycleConfig + Core) and
  deduped the HotProbe line (#821's Phase-2 wiring already set it, so #820's
  HotProbe fix is redundant here).
- removed the #819 cold-only catch-up E2E (TestRunDaemon_CatchUpMaterializes...)
  + its someTxBackend/oneTxLCMBytes helpers: #821's daemon now requires
  Boundaries.Core and runs a continuous live loop, so a cold-only "catch up then
  return" test can't fit — and TestE2E_DaemonLifecycle covers it end to end.

Mechanical propagation only; build/vet/test -short green (the heavy lifecycle E2E
stays -short-gated).
@chowbao chowbao force-pushed the streaming-phase1-primitives branch 2 times, most recently from 0433d50 to 2e5403a Compare June 25, 2026 18:48
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from cbc80ab to ae91d20 Compare June 25, 2026 19:14
chowbao added a commit that referenced this pull request Jun 25, 2026
Rebased the orchestration/daemon layer onto the reorganized #818 (geometry +
catalog subpackages) and propagated #817's API changes:
- qualify moved symbols: catalog.{Catalog,ArtifactSet,NewArtifactSet,AllArtifacts,
  NewCatalog}; geometry.{Kind*,State*,Layout,NewLayout,TxHashIndexID,
  TxHashIndexCoverage,TxHashIndexLayout,NewTxHashIndexLayout,MaxChunksPerTxhashIndex,
  LastCompleteChunkAt}
- window -> tx-hash-index rename (Windows->TxHashIndexLayout, FrozenCoverage->
  FrozenTxHashIndex, AllIndexKeys->AllTxHashIndexKeys, IndexBuild.Window->.Index)
- config regroup: cfg.Backfill.ChunksPerTxhashIndex->cfg.Layout.*,
  cfg.Streaming.*->cfg.Retention.*/cfg.Storage.*; daemon_test config literals updated
- tests reach the index layout via the public cat.TxHashIndexLayout() (txhashIndex
  is now an unexported field of the catalog package)

build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
@chowbao chowbao force-pushed the streaming-phase1-primitives branch from 2e5403a to 03147f0 Compare June 25, 2026 19:51
chowbao added a commit that referenced this pull request Jun 25, 2026
Restacked on the split/no-hooks #819 and ported the hot tier across the new
package boundary:
- hot key schema -> geometry (HotState/HotReady/HotTransient, exported
  HotChunkKey/ParseHotChunkKey/HotChunkPrefix); hot catalog methods -> catalog
  (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys)
- processChunk hot-source branch + progress hot refinement
  (lastCommittedLedger(cat, probe), highestReadyChunkSigned, refineWithHotDB)
- new files: pkg/stores/hotchunk, streaming/{eligibility,hotsource,ingest,lifecycle}
- daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe)
- crash-hooks REMOVED to match #817/#818 (the split makes cat.hooks unreachable
  from streaming); the one beforeHotTransient hook test is dropped, the rest are
  the structural crash tests #817/#818 established
- propagated renames: window->tx-hash-index, RetentionGate->RetentionFloor,
  cat.Has->public HotState, cat.layout->Layout()

build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
@chowbao chowbao force-pushed the streaming-phase1-primitives branch 5 times, most recently from ccebeb0 to d169a0d Compare June 25, 2026 21:05
@chowbao chowbao force-pushed the streaming-phase1-primitives branch 4 times, most recently from e39a052 to 808eb62 Compare June 25, 2026 22:34
@chowbao chowbao force-pushed the streaming-phase1-daemon branch 3 times, most recently from 169d4f6 to 65bd508 Compare June 26, 2026 00:26
@@ -0,0 +1,333 @@
package fullhistory

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daemon files could stay at the root of fullhistory/

We can always move it later and have the file reorg/restructure deferred to #824

@chowbao chowbao force-pushed the streaming-phase1-daemon branch 2 times, most recently from f3bbce1 to 14aa4c8 Compare June 26, 2026 02:55
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 14aa4c8 to aafbe0d Compare June 26, 2026 14:02
@socket-security

socket-security Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgolang/​go.uber.org/​goleak@​v1.3.0100100100100100

View full report

@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 2967c7a to 240aa5e Compare June 26, 2026 20:46
chowbao added a commit that referenced this pull request Jun 26, 2026
… loop

Per tamirms's #818 review ("the executor's own retry logic (#819) could
share the same primitive too"), replace the hand-rolled withRetries +
retryBackoffFor + ctxSleep with cenkalti/backoff — the same primitive
backfillSource's waitForCoverage already uses. Behavior-preserving:
exponential x2 from RetryBackoff, capped at maxRetryBackoff, no jitter,
MaxRetries+1 total attempts, ctx-cancellation aborts. Drops the retrySleep
test seam; the retry tests now assert the policy's NextBackOff sequence
directly and use a tiny base for the call-count tests.
Base automatically changed from streaming-phase1-primitives to feature/full-history June 27, 2026 17:47
chowbao added 12 commits June 27, 2026 23:14
Rebase the daemon layer onto the split #818 and follow the issue #824
package layout instead of the old single streaming package:

- backfill/: resolve.go + execute.go join process.go/txindex.go (the
  planner + bounded-worker executor). RunBackfill and the RunChunk/
  RunIndex test seams are exported so the daemon package's catch-up
  tests can inject fakes across the package boundary.
- observability/: the daemon's control-plane Metrics sink (NopMetrics
  and MetricsOrNop exported for backfill, which consumes the interface).
- fullhistory/ (top level, package fullhistory): daemon, startup
  (catch-up + serve), config_validate, progress, doc.

Adopt #817's chunks_per_txhash_index-is-a-constant change in the daemon
layer: drop the cpi config field + its form/restart validation, replace
PinLayout(cpi, earliest) with PinEarliestLedger(earliest), and build the
TxHashIndexLayout from geometry.ChunksPerTxhashIndex. The daemon E2E test
now exercises a non-terminal (rolling) index coverage, since one complete
chunk no longer finalizes a 1000-chunk window; the terminal demote+sweep
stays unit-tested in backfill/txindex_test.

Propagate #818's window->index naming into the daemon: windowsOverlapping
-> indexesOverlapping, the TxHashIndexLayout vars -> txLayout.

closes #815
…l tests

Both packages spawn goroutines (the errgroup worker pool in executePlan,
the supervised daemon loop), so a leaked goroutine should fail the suite
rather than slip through. Add goleak.VerifyTestMain to each package's
TestMain and promote go.uber.org/goleak to a direct dependency.
#815's "Daemon / CLI entrypoint" deliverable: register a
full-history-streaming cobra subcommand with a --config loader that runs
fullhistory.RunDaemon under a SIGINT/SIGTERM-cancelled context, so the
supervised loop shuts down cleanly on signal.
The package is fullhistory, not streaming; remove the redundant prefix
from this layer's error and log messages. Capitalized identifier leads
(ExecConfig/StartConfig/Boundaries ...) are reworded lowercase to satisfy
stylecheck ST1005. The parent layers (#817/#818) still carry the prefix.
Shorter placeholder name (likely to change at the #772 cutover); drops
the redundant -streaming suffix from the command token.
…seams

The runChunk/runIndex seams took a third ExecConfig arg that no
implementation read — the production closures and every test seam ignore
it (`_ ExecConfig`) and capture cfg/procCfg/buildCfg instead. Drop it from
the seam types, both default closures, the two call sites, and the test
signatures.
#818's review unified its bulk-source seams: the separate ChunkSource
(OpenStream) and BackendWaiter (WaitForCoverage) collapsed into one
backfill.Backend interface (ledgerbackend.LedgerStream + Tip), with the
coverage wait now internal to backfill (waitForCoverage, driven by
Backend.Tip). Adapt the daemon layer to the frozen #818 surface:

- Boundaries.Backend is now a backfill.Backend; drop the BackendWaiter
  field, its validate() check, and the ProcessConfig wiring.
- backendTip adapts a backfill.Backend to NetworkTipBackend via Tip, so
  the catch-up tip and the freeze's coverage frontier are one source and
  cannot disagree. Its old WaitForCoverage half (and the four
  TestBackendTip_WaitForCoverage* tests) are dropped — that logic now
  lives in and is tested by backfill.waitForCoverage.
- E2E + adapter tests use a fakeBackend (LedgerStream + Tip) in place of
  the old countingChunkSource/fakeWaiter/fakeLedgerBackend fakes.
The resolver's kind-rules comment used windowFirstChunk/windowLastChunk —
names that read like identifiers but don't exist; the code calls
txLayout.FirstChunk(w)/LastChunk(w). Name the real API so the pseudo-code
is greppable and accurate. Prose still says "window" for the index unit,
matching #818's txindex.go convention and the design doc.
… loop

Per tamirms's #818 review ("the executor's own retry logic (#819) could
share the same primitive too"), replace the hand-rolled withRetries +
retryBackoffFor + ctxSleep with cenkalti/backoff — the same primitive
backfillSource's waitForCoverage already uses. Behavior-preserving:
exponential x2 from RetryBackoff, capped at maxRetryBackoff, no jitter,
MaxRetries+1 total attempts, ctx-cancellation aborts. Drops the retrySleep
test seam; the retry tests now assert the policy's NextBackOff sequence
directly and use a tiny base for the call-count tests.
backendTip/newBackendTip had no production caller (buildProductionBoundaries
wires notConfiguredTip + a nil Backend); only a test exercised it. Per the
#817 review pattern (remove speculative API ahead of its caller — Admits,
Floor, PutEarliestLedger were all cut), drop it until #772 wires the real
backend/read path. The fakeBackend test helper loses its now-unused tipErr.
The len(needs)==0 early return is unreachable: an empty needs produces an
empty ids slice, which the existing len(ids)==0 guard already returns nil for.
Wire the control-plane metrics that meter work Phase-1 backfill actually does,
rather than leaving them all for Phase 2:

- Watermark: re-emit per catch-up pass so watermark_ledger and
  retention_floor_ledger track the rising floor instead of freezing at the
  pre-catch-up value.
- ColdTierBytes: sample the cold-tier footprint once per pass as backfill
  writes it (exported observability.MeasureColdTierBytes).
- Freeze: record plan sizes (chunks/indexes built) per RunBackfill pass.
- Prune: count the eager sweep's reclaimed artifacts in buildThenSweep
  (threaded observability.Metrics through BuildConfig).

LastCommitted and ChunkBoundary stay unwired: live-ingestion (Phase 2) signals
with no Phase-1 referent.

recordingMetrics now captures Freeze/Prune; the RunBackfill and terminal-sweep
tests assert they fire.
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from 8165fc9 to f32d3b2 Compare June 28, 2026 03:15
Collapse multi-line doc blocks to one-liners and drop redundant comments across
the daemon, executor, resolver, progress, and observability files. Pure comment
reduction — no logic change.

Also fix docs the Phase-1 metric wiring left stale: the Metrics interface
'callers today' note, and the Freeze/Prune descriptions that named Phase-2-only
callers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant