Bench vs p3 nightly publish by jotabulacios · Pull Request #590 · yetanotherco/lambda_vm

jotabulacios · 2026-05-18T19:24:03Z

New bench_vs_plonky3/ proves a shared Fibonacci AIR on both Lambda and Plonky3. Plonky3's MMCS is configured scalar (no vector lanes) so Merkle compression is one Keccak-f per call on both sides. bench_vs_plonky3/run.sh runs the bench, reports median + CV, and writes TSV + metrics to --report-dir.
Workflow runs the bench after the SP1 + ethrex steps and appends a "Lambda VM vs Plonky3" section to the same Slack post.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…ch_vs

…tion to bench_vs

--steps flag for nightly 1M/2M/4M/8M benchmarks

CLI flag

bench_vs/run.sh by switching to sed

claude · 2026-05-19T19:14:45Z

Review Summary

This PR adds a new bench_vs_plonky3 crate with a Fibonacci STARK benchmark comparing Lambda VM against Plonky3, plus CI automation.

Medium

Diagnostic output swallowed by command substitution (run.sh:357–358): run_prover is called as lambda_median=$(run_prover ...). Failure messages and log output written to stdout are captured and discarded. Redirect to stderr. (See inline comment.)
P3TimingLayer undercounts time under Rayon parallelism (prove_bench.rs:362–379): When Rayon enters a span concurrently on N threads, only the first thread's interval is recorded; the other N−1 are silently dropped. CPU-parallel phases (FFT, Merkle) will show ~1/N of their real wall-clock contribution in the breakdown. (See inline comment.)
max_log_arity = 1 is non-representative of Plonky3 production config (plonky3_config.rs:66): Forcing radix-2 FRI folding at 2^22 domain requires 22 fold rounds; typical P3 production configs use arity 8–16 (~7–8 rounds). The headline Lambda/P3 ratio is influenced by this sub-optimal P3 setting. Please document the reasoning or adjust the arity. (See inline comment.)

Low

maxrss as u64 cast without sign check (prove_bench.rs:413–416): On Linux ru_maxrss is i64; a negative return value wraps silently to a huge number. Add a guard. (See inline comment.)
P3TimingLayer is duplicated between prove_bench.rs:330–394 and lib.rs:215–267; the two copies will diverge. Consolidate in lib.rs and re-export. (See inline comment.)

jotabulacios · 2026-05-22T17:54:03Z

/claude /codex

github-actions · 2026-05-22T17:56:12Z

Codex Code Review

Findings

High Security: The nightly workflow builds and runs unpinned code from Plonky3 main on a self-hosted runner. The new manifest uses git dependencies without a pinned rev, and the workflow explicitly refreshes them before building. Cargo build scripts or dependency code from a compromised/upstream-changed repo would execute on the persistent self-hosted runner. Pin the Plonky3 revision for CI, or run this job in an isolated disposable environment without sensitive filesystem/network access.
bench_vs_plonky3/Cargo.toml, .github/workflows/bench-vs-nightly.yml
Medium Bug: The benchmark intentionally updates Plonky3 to latest main, but the generated artifact only records the Lambda repo SHA and git_tree=dirty; it does not record the resolved Plonky3 commit. That makes nightly results non-reproducible and hard to audit, especially if performance changes or upstream breaks. Record the resolved git revision from Cargo.lock/cargo metadata, or archive the post-update lockfile in bench_vs_artifacts.
.github/workflows/bench-vs-nightly.yml, bench_vs_plonky3/run.sh

I attempted cargo check -p bench-vs-plonky3 --bin prove_bench --benches, but it could not run because rustup tried to create files under the read-only /home/runner/.rustup.

claude · 2026-05-22T17:58:30Z

Review: Bench vs Plonky3

Overall this is a well-structured, well-documented benchmarking crate. The fairness choices (scalar MMCS, matching AIR shape, same blowup/queries) are clearly reasoned and the README is thorough. A few issues worth fixing:

Medium

Verify benchmark skew (benches/stark_comparison.rs:154): matched_params_config() is called inside b.iter, so config construction is measured on every verify iteration. The Lambda verify bench only constructs a lightweight DefaultTranscript per iteration. Move the config outside the iter closure — see inline comment.

Low

Population vs sample variance (run.sh cv_pct_file): Uses biased population variance (÷ n) rather than sample variance (÷ n−1). For the default 10 runs this reports CV ~5% lower than the statistically correct value — inline comment has a one-line fix.

Info

Unpinned git deps (Cargo.toml): Intentional for nightly tracking, but a brief comment near [dependencies] would help future contributors understand that cargo update -p p3-* is expected before nightly runs.
P3TimingLayer in parallel proofs: Rayon tasks can create multiple spans with the same name concurrently. Each closes independently, so --breakdown output shows repeated span names rather than one aggregated line. Fine for a debug tool, but worth knowing when reading breakdown output.

No security or correctness issues found in the AIR, prover, or verifier wiring.

jotabulacios · 2026-05-26T13:22:48Z

/codex /claude

github-actions · 2026-05-26T13:24:17Z

Codex Code Review

Findings

High Security: .github/workflows/bench-vs-nightly.yml:65 adds a scheduled self-hosted runner step that runs cargo update against unpinned Plonky3 git dependencies from bench_vs_plonky3/Cargo.toml:15. The next step builds and runs that freshly fetched code on the self-hosted, bench runner. Any compromised upstream commit or dependency build script can execute arbitrary code on the runner and access runner-local state and the Slack webhook used later in the job. Pin Plonky3 to a reviewed rev and remove the nightly cargo update, or run this benchmark in a disposable isolated environment with no secrets.

Verification

I attempted cargo check -p bench-vs-plonky3 --locked, but the sandbox cannot write to /home/runner/.rustup/tmp, so I could not complete a compile check.

claude · 2026-05-26T13:42:36Z

Review: bench_vs_plonky3 nightly benchmark

Overall the crate is well-structured and the fairness considerations (scalar MMCS, matched blowup/queries, base-field AIR path, extension field differences) are clearly documented. Two actionable items:

Medium — CI: P3 steps missing `continue-on-error: true` (see inline)

The ethrex steps carry continue-on-error: true so an optional benchmark failure doesn't lose the SP1 artifact. The two new P3 steps don't follow that pattern. If cargo update hits a network blip or Plonky3 HEAD breaks the API, GitHub Actions skips the artifact upload, discarding all collected SP1 results for that night. The Slack post still fires (if: always()) with no artifact attached.

Both new steps need continue-on-error: true (see inline suggestion).

Low — `prove_bench.rs`: `Instant::now()` precedes subscriber installation in breakdown path (see inline)

In the --breakdown branch of run_p3, the timer starts before set_default(subscriber). set_default is a thread-local write (nanoseconds), so the delta is negligible, but it means prove_total in breakdown mode technically includes setup overhead that is absent in the standard (nightly) path. Not a problem for published numbers since the nightly does not pass --breakdown\.

Low — `prove_bench.rs`: `partial_cmp().unwrap()` on f64 span durations (line ~256)

span_data.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()) panics on NaN. Durations from Instant::elapsed() are never NaN in practice, but sort_by(|a, b| b.1.total_cmp(&a.1)) is the idiomatic panic-free alternative (stable since Rust 1.62).

Nit — `run_lambda` uses `_proof` on an actively-used variable

_proof is Rust convention for intentionally unused. The proof is later passed to serde_cbor::to_vec and Verifier::verify. Renaming to proof removes the misleading signal.

No issues with the AIR constraint correctness, timing windows for the headline benchmark, or the Keccak scalar-MMCS configuration.

jotabulacios · 2026-05-26T15:36:48Z

/claude

claude · 2026-05-26T15:42:44Z

Code Review - Bench vs Plonky3 Nightly. Overall: Well-structured. Fairness docs and trace-equivalence test are solid. Issues below. (1) BUG Low - Wrong comment path: publish_bench_vs.sh says 'Built when bench_vs_artifacts/p3/headline/metrics.txt exists' but the actual path is bench_vs_artifacts/p3/metrics.txt - no headline/ dir. (2) BUG Low - Dead code: plonky3_fibonacci.rs exports pub fn rows_for_lambda_trace but it is never called in the crate. Remove if unused. (3) MEDIUM - Silent failures: Cargo.toml pins all p3 crates to the Plonky3 git repo with no rev. Both new CI steps use continue-on-error: true so a Plonky3 breaking change silently drops the P3 section from nightly reports with no alert. Add a Slack warning when bench_vs_artifacts/p3/metrics.txt is absent. (4) Nit - p3_fmt_seconds uses %.0fms which rounds to 0ms for verify times under 0.5ms; use %.1fms.

jotabulacios · 2026-05-26T17:56:25Z

/claude

claude · 2026-05-26T18:00:01Z

Review: Bench vs Plonky3

Overview: Adds a bench_vs_plonky3/ crate with a shared Fibonacci AIR proven by both Lambda STARK and Plonky3, a run.sh harness with median/CV statistics, and wires the result into the nightly Slack report. The approach is sound and the fairness rationale (scalar MMCS, same blowup/queries, extension field degree parity) is well-documented.

Bug

Low — partial_cmp().unwrap() vs total_cmp() (inline on lib.rs:229)
The instruments_breakdown test sorts spans with partial_cmp().unwrap(); prove_bench.rs correctly uses total_cmp. Duration values shouldn't be NaN in practice, but the inconsistency is a latent panic.

Design Note

Unpinned Plonky3 git deps (inline on Cargo.toml)
All p3-* entries use git = "..." without rev =. The cargo update -p p3-* step in CI intentionally advances these nightly, and continue-on-error: true prevents blocking other steps on failure. The risk is that a silent API break in Plonky3 produces no workflow failure and just drops the P3 section from Slack — potentially for several days before anyone notices. Worth documenting this tradeoff inline.

No issues found in:

Rust unsafe usage (peak_rss_kb / getrusage) — correctly guarded by return-code check before assume_init
Cryptographic correctness — no new crypto, just wrapping existing Lambda and P3 provers
AIR correctness — transition constraints match the Fibonacci recurrence; lambda_pair_trace_matches_plonky3_trace provides a cell-by-cell check
Mutex handling in span_timing.rs — spans lock is released before results lock is acquired (no deadlock risk)
Shell arithmetic in publish_bench_vs.sh — all formatted values are numeric (awk output), so JSON injection into the Slack payload is not a practical concern

MauroToscano and others added 30 commits February 25, 2026 19:00

bench vs others

f7f842b

Standarize guest

472ddc3

Fix decimals

88171d0

Benchmark vs sp1

ef7d5bb

Add lambda program

e11346a

Remove stray blank line from .gitignore

687d8d8

Apply suggestion from @gabrielbosio

0069659

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Add instruments

dabe54c

Merge branch 'main' of github.com:yetanotherco/lambda_vm

19e569e

Merge branch 'main' of github.com:yetanotherco/lambda_

3955884

Merge branch 'main' into bench_vs

15984c1

Merge branch 'bench_vs' of github.com:yetanotherco/lambda_vm into ben…

c2dca55

…ch_vs

Merge branch 'main' into bench_vs

8d043eb

Merge branch 'main' into bench_vs

4858ad8

Merge

5605c6b

Fix syscall number

af6aab3

Merge branch 'main' into bench_vs

a1169d9

Add runtime private inputs, nightly CI workflow, and 500M step projec…

687af82

…tion to bench_vs

Merge branch 'main' into bench_vs

5ea0c40

save work

0315a80

Fix bench_vs 5x projection inflation and add

3f3c7ac

--steps flag for nightly 1M/2M/4M/8M benchmarks

Merge branch 'main' into bench_vs

1288f27

Switch bench_vs projection to measured cycles and add --cycles

e9e9fbd

CLI flag

Fix grep pipefail in

1e1459e

bench_vs/run.sh by switching to sed

Merge branch 'main' into bench_vs

8d5c152

Add bench_vs_plonky3

2e65d5a

lint

1fd1a58

Replace summary.md with metrics.txt in bench_vs_plonky3 and add README

c9f9df9

remove line

ba4c7cd

add README

f7fe4d3

jotabulacios added 2 commits May 20, 2026 11:50

address comments

bb7d8ba

add missing file

cd41dd7

claude Bot reviewed May 22, 2026

View reviewed changes

Comment thread bench_vs_plonky3/benches/stark_comparison.rs

claude Bot reviewed May 22, 2026

View reviewed changes

Comment thread bench_vs_plonky3/run.sh

claude Bot reviewed May 22, 2026

View reviewed changes

Comment thread bench_vs_plonky3/Cargo.toml

address comments

a6a27a9