stellar · chowbao · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/cmd/stellar-rpc/scripts/bench-fullhistory/README.md b/cmd/stellar-rpc/scripts/bench-fullhistory/README.md
@@ -121,9 +121,13 @@ Shared flags:
 | flag | meaning |
 |---|---|
 | `--types=ledgers,txhash,events` | which data types to ingest (any subset; required) |
-| `--source=pack\|bsb` | `pack` reads a local cold packfile; `bsb` reads from a GCS `BufferedStorageBackend` |
+| `--source=pack\|bsb\|lcm` | `pack` reads a local cold packfile; `bsb` reads from a GCS `BufferedStorageBackend`; `lcm` reads a framed `LedgerCloseMeta` file from stellar-core `apply-load` (see [Synthetic ledgers](#synthetic-ledgers-via-apply-load)) |
 | `--cold-dir=DIR` | source cold-store dir (required for `--source=pack`) |
 | `--bucket-path=...` | GCS `destination_bucket_path` (for `--source=bsb`); ADC credentials required |
+| `--lcm-file=FILE` | apply-load `meta.xdr` (required for `--source=lcm`) |
+| `--lcm-checkpoint=N` | skip leading ledgers with seq ≤ N (apply-load setup ledgers; for `--source=lcm`) |
+| `--lcm-fix-tx-hashes` | repair apply-load's tx-hash/envelope mismatch so the roundtrip reader can consume the meta (default `true`; `--source=lcm`) |
+| `--lcm-allow-partial` | allow a short final chunk when the run was sized below 10k ledgers (default `true`; `--source=lcm`) |
 | `--bsb-buffer-size`, `--bsb-num-workers` | BSB prefetch tuning |
 | `--chunk=N` | first chunk ID to ingest (required) |
 | `--xdr-views` | extract via zero-copy XDR views instead of `UnmarshalBinary` + struct walk |
@@ -192,6 +196,80 @@ bench-fullhistory cold-ingest --types=txhash --source=pack \
 bench-fullhistory build-txhash-index --in-dir=/path/to/out/cold/txhash
 ```
 
+## Synthetic ledgers via `apply-load`
+
+When you don't have (or don't want) real pubnet chunks, you can generate
+**fully synthetic, density-controlled** packfiles with stellar-core's
+`apply-load` command. `apply-load-gen.sh` drives the whole pipeline:
+
+```
+apply-load  →  meta.xdr (framed LedgerCloseMeta)  →  cold-ingest --source=lcm  →  packfiles  →  build-txhash-index
+```
+
+```sh
+# A small SAC run is enough to exercise the read benches: TPS is set by
+# per-ledger DENSITY, not ledger count, so a few hundred ledgers already hit
+# the profile's target throughput.
+CORE_BIN=/path/to/stellar-core PROFILE=sac NUM_LEDGERS=300 \
+  ./apply-load-gen.sh
+```
+
+**Workload profiles** (`PROFILE=`) map to apply-load's model transactions and
+target throughputs. TPS = txs-per-ledger ÷ block-time, and the target is taken at
+the network's **600 ms block time** (`CLOSE_TIME_MS` default), so the per-ledger
+tx count = `TPS × 0.6`:
+
+| `PROFILE` | model tx (`APPLY_LOAD_MODEL_TX`) | target | txs/ledger @600ms |
+|---|---|---|---|
+| `sac` | `sac` (Stellar Asset Contract transfer) | ~10k SAC TPS | 6,000 |
+| `token` (`oz`) | `custom_token` (OpenZeppelin-style token) | ~9k OZ TPS | 5,400 |
+| `soroswap` | `soroswap` (AMM swap, real mainnet wasm) | ~2.5k TPS | 1,500 |
+
+> The ledger header `closeTime` is whole **seconds** in XDR, so a 600 ms block
+> cadence can't be a timestamp — it's modeled purely by per-ledger density.
+
+Key env knobs: `NUM_LEDGERS` (total ledgers to generate; **prefer this for a
+quick run** — the final chunk may be partial), `CHUNKS` (10k-ledger chunks to
+fill, default 16; ignored when `NUM_LEDGERS` is set), `CLOSE_TIME_MS` (block
+time for the TPS math, default 600), `TXS_PER_LEDGER` (override the derived
+density), `CLUSTERS` (`APPLY_LOAD_LEDGER_MAX_DEPENDENT_TX_CLUSTERS` — parallel
+apply threads; **generation-speed only**, default 8, don't exceed 8),
+`TYPES`, `CHUNK_WORKERS`, `OUT_ROOT`, `KEEP_META`, `BENCH_BIN`.
+
+**Requirements & caveats:**
+
+- Needs a stellar-core built with **`BUILD_TESTS`** (the CI build tagged
+  `…~buildtests`) — `apply-load` + `ARTIFICIALLY_GENERATE_LOAD_FOR_TESTING`
+  are test-only.
+- **Cost scales with density, not just count.** apply-load close time grows with
+  txs/ledger and accumulated state: `sac` (1 fat batched tx/ledger) runs at
+  ~0.1 s/ledger, but `token`/`soroswap` apply ~9k txs/ledger at ~9 s/ledger and
+  rising. A full 10k-ledger chunk of dense Soroban load is **hours to days** —
+  so size dense profiles with a small `NUM_LEDGERS` (a few hundred), which still
+  meets the TPS target.
+- **apply-load tx-hash fixup (automatic).** `apply-load`'s streamed meta records
+  the same transactions in the tx-set and in `TxProcessing`, but the stored
+  result hash does **not** equal the envelope's real hash, so the go-stellar-sdk
+  ingest `LedgerTransactionReader` (which pairs envelope↔result by hash) rejects
+  it with *"unknown tx hash in LedgerCloseMeta"* — breaking the roundtrip
+  tx-page / tx-hash benches. `cold-ingest --source=lcm` repairs this by default
+  (`--lcm-fix-tx-hashes`): it pairs each result to its envelope via the
+  fee-charged account and stamps the correct hash. See `lcm_fixup.go`.
+- The `lcm` source assigns ledger sequences **positionally** per chunk (chunk 1
+  → seqs 10002…20001, etc.), skipping apply-load setup ledgers
+  (`--lcm-checkpoint`). The final chunk may be **partial** when the run was
+  sized below a full chunk (`--lcm-allow-partial`, on by default); the read
+  benches clamp their cursors to each chunk's actual ledger range.
+- **`cold-events` works for `sac` and `soroswap`, not `token`.** The corpus
+  builder needs enough unique *terms* (contract anchors + topic values) to fill
+  the K-bucket sweep (≥ max K, default 15) — it does **not** require 3 distinct
+  contracts. `sac` (one SAC contract whose `transfer` events vary `from`/`to`
+  over thousands of accounts) and `soroswap` (router + pair contracts) both
+  reach 15 terms from a single/few contracts. `token` (`custom_token`) emits
+  events that are not 4-topic, so it yields no usable terms — use `sac` or
+  `soroswap` for event benches. `cold-ledgers`/`cold-txpage`/`cold-txhash` work
+  for all profiles.
+
 ## Interpreting ingest output
 
 - **`total wall`** — end-to-end wall time. For multi-chunk cold runs it is
@@ -210,7 +288,8 @@ bench-fullhistory build-txhash-index --in-dir=/path/to/out/cold/txhash
 - `bench_concurrent_runner.go`, `bench_grid.go` — the `--query-concurrency` sweep scaffolding.
 - `bench_{hot,cold}_ingest.go` — ingest drivers.
 - `ingest_{ledgers,txhash,events}.go` — per-type ingesters + collectors.
-- `ingester.go`, `ledger.go`, `extract_{views,parsed}.go`, `sources.go` — ingest plumbing.
+- `ingester.go`, `ledger.go`, `extract_{views,parsed}.go`, `sources.go` — ingest plumbing (`sources.go` has the `pack`/`bsb`/`lcm` ledger sources).
+- `apply-load-gen.sh` — synthetic-ledger driver: stellar-core `apply-load` → `meta.xdr` → packfiles.
 - `bench_build_txhash_index.go`, `streamhash_merge.go` — phase-2 index build.
 - `corpus.go`, `cache*.go`, `tx_hash_helpers.go`, `metrics_helpers.go` — shared helpers.
 - `run-all-benches.sh` — suite driver (builds once, runs every read + ingest

diff --git a/cmd/stellar-rpc/scripts/bench-fullhistory/SYNTHETIC-LEDGERS.md b/cmd/stellar-rpc/scripts/bench-fullhistory/SYNTHETIC-LEDGERS.md
@@ -0,0 +1,129 @@
+# Synthetic-ledger generation + benchmarking — runbook
+
+End-to-end recipe to generate controllable synthetic full-history datasets with
+stellar-core `apply-load` and run the read bench suite on them. This is the
+hands-off path: prepare the host once, then `synthetic-run.sh` does
+generate → bench → (optional) upload for every profile.
+
+Scripts (all in this directory):
+- `apply-load-gen.sh` — generate ONE profile (apply-load → meta.xdr → cold packfiles + tx-hash index)
+- `bench-suite.sh` — run the cold/hot read benches against generated stores
+- `synthetic-run.sh` — orchestrator: loop profiles → generate → bench → optional GCS upload
+
+## 1. Host prerequisites
+
+### a) stellar-core with `apply-load` (BUILD_TESTS)
+Released/Docker cores **strip** `apply-load`. Install the `~buildtests` build from
+SDF's unstable apt channel (Ubuntu 24.04 / noble shown):
+
+```sh
+sudo wget -qO /etc/apt/keyrings/SDF.asc https://apt.stellar.org/SDF.asc
+echo "deb [signed-by=/etc/apt/keyrings/SDF.asc] https://apt.stellar.org noble unstable" \
+  | sudo tee /etc/apt/sources.list.d/SDF-unstable.list
+sudo apt-get update
+apt-cache madison stellar-core | grep buildtests        # pick newest, protocol you want
+sudo apt-get install -y stellar-core=<EXACT-…~buildtests-version>   # pin: it sorts below stable
+stellar-core apply-load --help                          # must succeed
+```
+
+### b) Go + RocksDB (to build the `bench-fullhistory` binary)
+The bench binary uses cgo against RocksDB v10 (grocksdb v1.10.x). The system
+`librocksdb` (8.x) is too old.
+
+```sh
+# Go: match go.mod's toolchain (1.26 at time of writing) — e.g. /usr/local/go
+# RocksDB v10.9.1 (shared lib + headers):
+PREFIX=$HOME/.rocksdb ./scripts/install-rocksdb.sh        # repo root script
+
+export CGO_CFLAGS="-I$HOME/.rocksdb/include"
+export CGO_LDFLAGS="-L$HOME/.rocksdb/lib -lrocksdb"
+export LD_LIBRARY_PATH="$HOME/.rocksdb/lib"               # needed at RUN time too
+```
+
+### c) Disk + RAM — the two real constraints
+- **Disk:** use a fast **local** volume (NVMe instance store, not network EBS) for
+  `OUT_ROOT`. The transient `meta.xdr` is large (a 10k-ledger SAC chunk ≈ ~100+ GB
+  before it's deleted post-ingest). Budget hundreds of GB free.
+- **RAM — this caps how many ledgers you can generate.** Each dense apply-load holds
+  in-memory soroban state that **grows with ledger count**. Measured: SAC at
+  6000 tx/ledger ≈ **8.5 MB/ledger** → ~32 GB at 3,760 ledgers, ~85 GB at 10,000.
+
+  | box RAM | sac/token (6000/5400 tx/ledger) | soroswap (1500 tx/ledger) |
+  |---|---|---|
+  | 61 GB (c6id.8xlarge) | ~6,000 ledgers | ~20,000 (2 chunks) |
+  | 128 GB | ~14,000 | full chunks easily |
+  | 256 GB | ~28,000 (≈3 chunks) | many chunks |
+
+  **A full 10k-ledger chunk of 10k-TPS SAC needs ~96–128 GB RAM.** If a run exceeds
+  RAM the kernel OOM-kills apply-load mid-generation. Size `NUM_LEDGERS` to the box.
+
+## 2. Profiles and the TPS model
+
+`MODEL_TX` + per-ledger density define the workload. TPS is taken at a **600 ms**
+block time (`CLOSE_TIME_MS`), so per-ledger tx count = `TPS × 0.6`:
+
+| PROFILE | model tx | target | tx/ledger @600ms |
+|---|---|---|---|
+| `sac` | SAC transfer | 10,000 TPS | 6,000 |
+| `token` (`oz`) | custom_token | 9,000 TPS | 5,400 |
+| `soroswap` | AMM swap | 2,500 TPS | 1,500 |
+
+Notes baked into the scripts:
+- `BATCH_SAC=1` so each transfer is its own tx (pack tx-density == TPS target).
+- `CLUSTERS=8` (`APPLY_LOAD_LEDGER_MAX_DEPENDENT_TX_CLUSTERS`) — generation-speed
+  only; don't exceed 8 (known multi-threaded-apply perf issues above that).
+- `HTTP_PORT=0` so parallel generations don't collide on core's HTTP port.
+- The streamed meta needs a **tx-hash fixup** (cold-ingest does it by default,
+  `--lcm-fix-tx-hashes`) or the roundtrip txpage/txhash benches reject it; the
+  passphrase is pubnet to match the bench binary. (Details in this dir's README.)
+- `cold-events`/`hot-events` work for `sac` and `soroswap`; **not** `token`
+  (custom_token events aren't 4-topic).
+
+## 3. Run it
+
+```sh
+cd cmd/stellar-rpc/scripts/bench-fullhistory
+
+# env from §1b (CGO_*, LD_LIBRARY_PATH) must be exported in this shell.
+CORE_BIN=/usr/bin/stellar-core \
+OUT_ROOT=/mnt/nvme/synth \
+PROFILES="sac token soroswap" \
+NUM_LEDGERS=6000 \                 # size to your RAM (see §1c)
+PARALLEL=0 \                       # sequential (safe); 1 only if combined RSS fits RAM
+GCS_DEST=gs://rpc-full-history/synthetic-ledgers/<run-name> \   # optional upload
+  ./synthetic-run.sh
+```
+
+For a long unattended run, detach it:
+```sh
+setsid nohup env CORE_BIN=… OUT_ROOT=… NUM_LEDGERS=… ./synthetic-run.sh > run.out 2>&1 < /dev/null &
+```
+
+`soroswap` reaches full 10k chunks on modest RAM, so a common split is:
+`NUM_LEDGERS=20000 PROFILES=soroswap` (2 chunks) plus
+`NUM_LEDGERS=<RAM-safe> PROFILES="sac token"`.
+
+## 4. Outputs
+
+```
+$OUT_ROOT/<profile>/cold/{ledgers/00000/*.pack, txhash.idx, events/00000/*}
+$OUT_ROOT/<profile>/work/apply-load.cfg          # exact config (reproducibility input)
+$OUT_ROOT/bench-results/run-<ts>/<profile>/*.csv # latency/throughput sweeps
+```
+
+Point the read benches at a cold store directly, e.g.:
+```sh
+LD_LIBRARY_PATH=$HOME/.rocksdb/lib ./bench-fullhistory cold-txpage \
+  --cold-dir=$OUT_ROOT/sac/cold/ledgers --page-size=20 --iters=200 \
+  --query-concurrency=1,4,8,16 --xdr-views --out=results-sac
+```
+
+## 5. Reproducibility caveat
+
+The **config + genesis are deterministic** (same root account each run), but the
+**transactions are not byte-reproducible**: stellar-core seeds its RNG from
+wall-clock time and `apply-load` exposes no seed. Runs match in *shape*
+(profile/density/op-mix), not bytes. To pin an exact dataset, keep the generated
+cold packs (and their SHA256s) — that's the canonical artifact. Uploading to GCS
+(`GCS_DEST`) is also how you make NVMe-instance-store output durable (it's wiped
+on instance stop/terminate).