Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819
Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819chowbao wants to merge 13 commits into
Conversation
3d12c9e to
84ff8c2
Compare
672d01f to
55cabde
Compare
84ff8c2 to
419f7ec
Compare
aac0f4b to
1b47f8e
Compare
04f9931 to
7f8e58f
Compare
1b47f8e to
72368b2
Compare
7f8e58f to
f3431cd
Compare
|
Note (non-blocking): a zero-txhash-key window → infinite restart loop on synthetic/standalone networks Surfaced while writing the daemon catch-up E2E ( A complete chunk whose 10,000 ledgers are all transaction-free produces an empty txhash Not a production concern: mainnet/testnet never produce an all-zero-tx 10k-ledger chunk. But a standalone/quickstart network, or a naive synthetic test backend (e.g. an all-zero-tx fixture), hits it easily. The new E2E sidesteps it by giving the chunk a sparse few one-tx ledgers so the Decision: no fix for now — it can't happen in prod. If ever revisited: either build a degenerate/empty index for a zero-key window, or make the zero-key error a fatal sentinel so |
72368b2 to
531dca9
Compare
fa0083f to
cbc80ab
Compare
…hanges Rebased onto the updated #819 and propagated #817's API changes into the Phase 2 hot-store/lifecycle layer: - window -> tx-hash index rename + key prefix index: -> txhash_index: (TxHashIndexCoverage.Window -> .Index, Catalog.windows -> .txhashIndex); merged the hot-catalog methods (HotChunkKeys/ReadyHotChunkKeys/hotChunkKeysWith) onto #817's renamed Catalog. - Catalog.Get/Has -> get/has; config sections regrouped (cfg.Retention.*, cfg.Layout.*, cfg.Storage.*, cfg.Ingestion.*); pins via PinLayout. - RetentionGate -> RetentionFloor: eligibility.go uses floor.Excludes(c) and floor.Excludes(layout.LastChunk(idx)) for the discard/prune scans; the reader- retention test dropped its Admits/Floor assertions (deferred to #772) and keeps the straddling-window prune-survival check via Excludes. - renamed the lifecycle test's oneTxLCMBytes -> oneTxLCMRand (the #819 daemon-E2E added a same-named helper). Also fixes a latent daemon bug surfaced by the catch-up E2E: startConfig built ProcessConfig without a HotProbe, so any real catch-up aborted with "HotProbe is nil" and the supervisor looped. Wire the production NewRocksHotProbe. build/vet/test -short green (streaming + hotchunk + ingest + stores).
…hanges Rebased onto the updated #820 and propagated #817's API changes into the Phase 2 live-ingestion/daemon layer: - window -> tx-hash index rename + key prefix index: -> txhash_index: (TxHashIndexCoverage.Index, Catalog.txhashIndex), Catalog.Get/Has -> get/has, config sections regrouped (cfg.Retention/Layout/Storage/Ingestion), pins via PinLayout. - daemon.go merge: kept #821's live-ingestion wiring (LifecycleConfig + Core) and deduped the HotProbe line (#821's Phase-2 wiring already set it, so #820's HotProbe fix is redundant here). - removed the #819 cold-only catch-up E2E (TestRunDaemon_CatchUpMaterializes...) + its someTxBackend/oneTxLCMBytes helpers: #821's daemon now requires Boundaries.Core and runs a continuous live loop, so a cold-only "catch up then return" test can't fit — and TestE2E_DaemonLifecycle covers it end to end. Mechanical propagation only; build/vet/test -short green (the heavy lifecycle E2E stays -short-gated).
0433d50 to
2e5403a
Compare
cbc80ab to
ae91d20
Compare
Rebased the orchestration/daemon layer onto the reorganized #818 (geometry + catalog subpackages) and propagated #817's API changes: - qualify moved symbols: catalog.{Catalog,ArtifactSet,NewArtifactSet,AllArtifacts, NewCatalog}; geometry.{Kind*,State*,Layout,NewLayout,TxHashIndexID, TxHashIndexCoverage,TxHashIndexLayout,NewTxHashIndexLayout,MaxChunksPerTxhashIndex, LastCompleteChunkAt} - window -> tx-hash-index rename (Windows->TxHashIndexLayout, FrozenCoverage-> FrozenTxHashIndex, AllIndexKeys->AllTxHashIndexKeys, IndexBuild.Window->.Index) - config regroup: cfg.Backfill.ChunksPerTxhashIndex->cfg.Layout.*, cfg.Streaming.*->cfg.Retention.*/cfg.Storage.*; daemon_test config literals updated - tests reach the index layout via the public cat.TxHashIndexLayout() (txhashIndex is now an unexported field of the catalog package) build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
2e5403a to
03147f0
Compare
Restacked on the split/no-hooks #819 and ported the hot tier across the new package boundary: - hot key schema -> geometry (HotState/HotReady/HotTransient, exported HotChunkKey/ParseHotChunkKey/HotChunkPrefix); hot catalog methods -> catalog (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys) - processChunk hot-source branch + progress hot refinement (lastCommittedLedger(cat, probe), highestReadyChunkSigned, refineWithHotDB) - new files: pkg/stores/hotchunk, streaming/{eligibility,hotsource,ingest,lifecycle} - daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe) - crash-hooks REMOVED to match #817/#818 (the split makes cat.hooks unreachable from streaming); the one beforeHotTransient hook test is dropped, the rest are the structural crash tests #817/#818 established - propagated renames: window->tx-hash-index, RetentionGate->RetentionFloor, cat.Has->public HotState, cat.layout->Layout() build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
ccebeb0 to
d169a0d
Compare
e39a052 to
808eb62
Compare
169d4f6 to
65bd508
Compare
| @@ -0,0 +1,333 @@ | |||
| package fullhistory | |||
There was a problem hiding this comment.
daemon files could stay at the root of fullhistory/
We can always move it later and have the file reorg/restructure deferred to #824
f3bbce1 to
14aa4c8
Compare
14aa4c8 to
aafbe0d
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
2967c7a to
240aa5e
Compare
… loop Per tamirms's #818 review ("the executor's own retry logic (#819) could share the same primitive too"), replace the hand-rolled withRetries + retryBackoffFor + ctxSleep with cenkalti/backoff — the same primitive backfillSource's waitForCoverage already uses. Behavior-preserving: exponential x2 from RetryBackoff, capped at maxRetryBackoff, no jitter, MaxRetries+1 total attempts, ctx-cancellation aborts. Drops the retrySleep test seam; the retry tests now assert the policy's NextBackOff sequence directly and use a tiny base for the call-count tests.
Rebase the daemon layer onto the split #818 and follow the issue #824 package layout instead of the old single streaming package: - backfill/: resolve.go + execute.go join process.go/txindex.go (the planner + bounded-worker executor). RunBackfill and the RunChunk/ RunIndex test seams are exported so the daemon package's catch-up tests can inject fakes across the package boundary. - observability/: the daemon's control-plane Metrics sink (NopMetrics and MetricsOrNop exported for backfill, which consumes the interface). - fullhistory/ (top level, package fullhistory): daemon, startup (catch-up + serve), config_validate, progress, doc. Adopt #817's chunks_per_txhash_index-is-a-constant change in the daemon layer: drop the cpi config field + its form/restart validation, replace PinLayout(cpi, earliest) with PinEarliestLedger(earliest), and build the TxHashIndexLayout from geometry.ChunksPerTxhashIndex. The daemon E2E test now exercises a non-terminal (rolling) index coverage, since one complete chunk no longer finalizes a 1000-chunk window; the terminal demote+sweep stays unit-tested in backfill/txindex_test. Propagate #818's window->index naming into the daemon: windowsOverlapping -> indexesOverlapping, the TxHashIndexLayout vars -> txLayout. closes #815
…l tests Both packages spawn goroutines (the errgroup worker pool in executePlan, the supervised daemon loop), so a leaked goroutine should fail the suite rather than slip through. Add goleak.VerifyTestMain to each package's TestMain and promote go.uber.org/goleak to a direct dependency.
#815's "Daemon / CLI entrypoint" deliverable: register a full-history-streaming cobra subcommand with a --config loader that runs fullhistory.RunDaemon under a SIGINT/SIGTERM-cancelled context, so the supervised loop shuts down cleanly on signal.
Shorter placeholder name (likely to change at the #772 cutover); drops the redundant -streaming suffix from the command token.
…seams The runChunk/runIndex seams took a third ExecConfig arg that no implementation read — the production closures and every test seam ignore it (`_ ExecConfig`) and capture cfg/procCfg/buildCfg instead. Drop it from the seam types, both default closures, the two call sites, and the test signatures.
#818's review unified its bulk-source seams: the separate ChunkSource (OpenStream) and BackendWaiter (WaitForCoverage) collapsed into one backfill.Backend interface (ledgerbackend.LedgerStream + Tip), with the coverage wait now internal to backfill (waitForCoverage, driven by Backend.Tip). Adapt the daemon layer to the frozen #818 surface: - Boundaries.Backend is now a backfill.Backend; drop the BackendWaiter field, its validate() check, and the ProcessConfig wiring. - backendTip adapts a backfill.Backend to NetworkTipBackend via Tip, so the catch-up tip and the freeze's coverage frontier are one source and cannot disagree. Its old WaitForCoverage half (and the four TestBackendTip_WaitForCoverage* tests) are dropped — that logic now lives in and is tested by backfill.waitForCoverage. - E2E + adapter tests use a fakeBackend (LedgerStream + Tip) in place of the old countingChunkSource/fakeWaiter/fakeLedgerBackend fakes.
The resolver's kind-rules comment used windowFirstChunk/windowLastChunk — names that read like identifiers but don't exist; the code calls txLayout.FirstChunk(w)/LastChunk(w). Name the real API so the pseudo-code is greppable and accurate. Prose still says "window" for the index unit, matching #818's txindex.go convention and the design doc.
… loop Per tamirms's #818 review ("the executor's own retry logic (#819) could share the same primitive too"), replace the hand-rolled withRetries + retryBackoffFor + ctxSleep with cenkalti/backoff — the same primitive backfillSource's waitForCoverage already uses. Behavior-preserving: exponential x2 from RetryBackoff, capped at maxRetryBackoff, no jitter, MaxRetries+1 total attempts, ctx-cancellation aborts. Drops the retrySleep test seam; the retry tests now assert the policy's NextBackOff sequence directly and use a tiny base for the call-count tests.
backendTip/newBackendTip had no production caller (buildProductionBoundaries wires notConfiguredTip + a nil Backend); only a test exercised it. Per the #817 review pattern (remove speculative API ahead of its caller — Admits, Floor, PutEarliestLedger were all cut), drop it until #772 wires the real backend/read path. The fakeBackend test helper loses its now-unused tipErr.
The len(needs)==0 early return is unreachable: an empty needs produces an empty ids slice, which the existing len(ids)==0 guard already returns nil for.
Wire the control-plane metrics that meter work Phase-1 backfill actually does, rather than leaving them all for Phase 2: - Watermark: re-emit per catch-up pass so watermark_ledger and retention_floor_ledger track the rising floor instead of freezing at the pre-catch-up value. - ColdTierBytes: sample the cold-tier footprint once per pass as backfill writes it (exported observability.MeasureColdTierBytes). - Freeze: record plan sizes (chunks/indexes built) per RunBackfill pass. - Prune: count the eager sweep's reclaimed artifacts in buildThenSweep (threaded observability.Metrics through BuildConfig). LastCommitted and ChunkBoundary stay unwired: live-ingestion (Phase 2) signals with no Phase-1 referent. recordingMetrics now captures Freeze/Prune; the RunBackfill and terminal-sweep tests assert they fire.
8165fc9 to
f32d3b2
Compare
Collapse multi-line doc blocks to one-liners and drop redundant comments across the daemon, executor, resolver, progress, and observability files. Pure comment reduction — no logic change. Also fix docs the Phase-1 metric wiring left stale: the Metrics interface 'callers today' note, and the Freeze/Prune descriptions that named Phase-2-only callers.
Closes #815 — Phase 1 (Backfill), layer 3 of 3. Stacked on PR1B (
streaming-phase1-primitives).Orchestration + the daemon entrypoint, completing the cold-only backfill daemon:
resolve(catalog-diff →Plan),executePlan(one bounded worker pool; index builds wait on their in-coverage chunk builds;withRetrieswith exponential backoff)startStreaming: anetworkTip-bounded catch-up loop →serveReadshandoff (no hot tier / live loop / lifecycle in Phase 1)full-historyCLI subcommandingest.ColdService+NewPrometheusSink(ProcessConfig.Sink)This branch is a runnable backfill daemon: boots from one TOML, catches up from a backend, and serves.
Verification:
go build+go vet+go test -shortgreen on./cmd/stellar-rpc/internal/fullhistory/...(cgo RocksDB toolchain). Note: the fullcmd/stellar-rpcbinary link requires the Rustlibpreflight/libxdr2json(CImake build-libs);go vet ./cmd/stellar-rpc/type-checks the entrypoint locally.golangci-lintruns in CI.Stack:
streaming-phase1-daemon→streaming-phase1-primitives