Skip to content

Bench vs p3 nightly publish#590

Open
jotabulacios wants to merge 55 commits into
mainfrom
bench_vs_p3_nightly_publish
Open

Bench vs p3 nightly publish#590
jotabulacios wants to merge 55 commits into
mainfrom
bench_vs_p3_nightly_publish

Conversation

@jotabulacios
Copy link
Copy Markdown
Collaborator

@jotabulacios jotabulacios commented May 18, 2026

New bench_vs_plonky3/ proves a shared Fibonacci AIR on both Lambda and Plonky3. Plonky3's MMCS is configured scalar (no vector lanes) so Merkle compression is one Keccak-f per call on both sides. bench_vs_plonky3/run.sh runs the bench, reports median + CV, and writes TSV + metrics to --report-dir.
Workflow runs the bench after the SP1 + ethrex steps and appends a "Lambda VM vs Plonky3" section to the same Slack post.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Review Summary

This PR adds a new bench_vs_plonky3 crate with a Fibonacci STARK benchmark comparing Lambda VM against Plonky3, plus CI automation.

Medium

  • Diagnostic output swallowed by command substitution (run.sh:357–358): run_prover is called as lambda_median=$(run_prover ...). Failure messages and log output written to stdout are captured and discarded. Redirect to stderr. (See inline comment.)

  • P3TimingLayer undercounts time under Rayon parallelism (prove_bench.rs:362–379): When Rayon enters a span concurrently on N threads, only the first thread's interval is recorded; the other N−1 are silently dropped. CPU-parallel phases (FFT, Merkle) will show ~1/N of their real wall-clock contribution in the breakdown. (See inline comment.)

  • max_log_arity = 1 is non-representative of Plonky3 production config (plonky3_config.rs:66): Forcing radix-2 FRI folding at 2^22 domain requires 22 fold rounds; typical P3 production configs use arity 8–16 (~7–8 rounds). The headline Lambda/P3 ratio is influenced by this sub-optimal P3 setting. Please document the reasoning or adjust the arity. (See inline comment.)

Low

  • maxrss as u64 cast without sign check (prove_bench.rs:413–416): On Linux ru_maxrss is i64; a negative return value wraps silently to a huge number. Add a guard. (See inline comment.)

  • P3TimingLayer is duplicated between prove_bench.rs:330–394 and lib.rs:215–267; the two copies will diverge. Consolidate in lib.rs and re-export. (See inline comment.)

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/claude /codex

@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  1. High Security: The nightly workflow builds and runs unpinned code from Plonky3 main on a self-hosted runner. The new manifest uses git dependencies without a pinned rev, and the workflow explicitly refreshes them before building. Cargo build scripts or dependency code from a compromised/upstream-changed repo would execute on the persistent self-hosted runner. Pin the Plonky3 revision for CI, or run this job in an isolated disposable environment without sensitive filesystem/network access.
    bench_vs_plonky3/Cargo.toml, .github/workflows/bench-vs-nightly.yml

  2. Medium Bug: The benchmark intentionally updates Plonky3 to latest main, but the generated artifact only records the Lambda repo SHA and git_tree=dirty; it does not record the resolved Plonky3 commit. That makes nightly results non-reproducible and hard to audit, especially if performance changes or upstream breaks. Record the resolved git revision from Cargo.lock/cargo metadata, or archive the post-update lockfile in bench_vs_artifacts.
    .github/workflows/bench-vs-nightly.yml, bench_vs_plonky3/run.sh

I attempted cargo check -p bench-vs-plonky3 --bin prove_bench --benches, but it could not run because rustup tried to create files under the read-only /home/runner/.rustup.

Comment thread bench_vs_plonky3/benches/stark_comparison.rs
Comment thread bench_vs_plonky3/run.sh
Comment thread bench_vs_plonky3/Cargo.toml
@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Review: Bench vs Plonky3

Overall this is a well-structured, well-documented benchmarking crate. The fairness choices (scalar MMCS, matching AIR shape, same blowup/queries) are clearly reasoned and the README is thorough. A few issues worth fixing:

Medium

  • Verify benchmark skew (benches/stark_comparison.rs:154): matched_params_config() is called inside b.iter, so config construction is measured on every verify iteration. The Lambda verify bench only constructs a lightweight DefaultTranscript per iteration. Move the config outside the iter closure — see inline comment.

Low

  • Population vs sample variance (run.sh cv_pct_file): Uses biased population variance (÷ n) rather than sample variance (÷ n−1). For the default 10 runs this reports CV ~5% lower than the statistically correct value — inline comment has a one-line fix.

Info

  • Unpinned git deps (Cargo.toml): Intentional for nightly tracking, but a brief comment near [dependencies] would help future contributors understand that cargo update -p p3-* is expected before nightly runs.
  • P3TimingLayer in parallel proofs: Rayon tasks can create multiple spans with the same name concurrently. Each closes independently, so --breakdown output shows repeated span names rather than one aggregated line. Fine for a debug tool, but worth knowing when reading breakdown output.

No security or correctness issues found in the AIR, prover, or verifier wiring.

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/codex /claude

@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  • High Security: .github/workflows/bench-vs-nightly.yml:65 adds a scheduled self-hosted runner step that runs cargo update against unpinned Plonky3 git dependencies from bench_vs_plonky3/Cargo.toml:15. The next step builds and runs that freshly fetched code on the self-hosted, bench runner. Any compromised upstream commit or dependency build script can execute arbitrary code on the runner and access runner-local state and the Slack webhook used later in the job. Pin Plonky3 to a reviewed rev and remove the nightly cargo update, or run this benchmark in a disposable isolated environment with no secrets.

Verification

I attempted cargo check -p bench-vs-plonky3 --locked, but the sandbox cannot write to /home/runner/.rustup/tmp, so I could not complete a compile check.

Comment thread .github/workflows/bench-vs-nightly.yml
Comment thread bench_vs_plonky3/src/bin/prove_bench.rs
@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Review: bench_vs_plonky3 nightly benchmark

Overall the crate is well-structured and the fairness considerations (scalar MMCS, matched blowup/queries, base-field AIR path, extension field differences) are clearly documented. Two actionable items:


Medium — CI: P3 steps missing continue-on-error: true (see inline)

The ethrex steps carry continue-on-error: true so an optional benchmark failure doesn't lose the SP1 artifact. The two new P3 steps don't follow that pattern. If cargo update hits a network blip or Plonky3 HEAD breaks the API, GitHub Actions skips the artifact upload, discarding all collected SP1 results for that night. The Slack post still fires (if: always()) with no artifact attached.

Both new steps need continue-on-error: true (see inline suggestion).


Low — prove_bench.rs: Instant::now() precedes subscriber installation in breakdown path (see inline)

In the --breakdown branch of run_p3, the timer starts before set_default(subscriber). set_default is a thread-local write (nanoseconds), so the delta is negligible, but it means prove_total in breakdown mode technically includes setup overhead that is absent in the standard (nightly) path. Not a problem for published numbers since the nightly does not pass --breakdown\.


Low — prove_bench.rs: partial_cmp().unwrap() on f64 span durations (line ~256)

span_data.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()) panics on NaN. Durations from Instant::elapsed() are never NaN in practice, but sort_by(|a, b| b.1.total_cmp(&a.1)) is the idiomatic panic-free alternative (stable since Rust 1.62).


Nit — run_lambda uses _proof on an actively-used variable

_proof is Rust convention for intentionally unused. The proof is later passed to serde_cbor::to_vec and Verifier::verify. Renaming to proof removes the misleading signal.


No issues with the AIR constraint correctness, timing windows for the headline benchmark, or the Keccak scalar-MMCS configuration.

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/claude

@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Code Review - Bench vs Plonky3 Nightly. Overall: Well-structured. Fairness docs and trace-equivalence test are solid. Issues below. (1) BUG Low - Wrong comment path: publish_bench_vs.sh says 'Built when bench_vs_artifacts/p3/headline/metrics.txt exists' but the actual path is bench_vs_artifacts/p3/metrics.txt - no headline/ dir. (2) BUG Low - Dead code: plonky3_fibonacci.rs exports pub fn rows_for_lambda_trace but it is never called in the crate. Remove if unused. (3) MEDIUM - Silent failures: Cargo.toml pins all p3 crates to the Plonky3 git repo with no rev. Both new CI steps use continue-on-error: true so a Plonky3 breaking change silently drops the P3 section from nightly reports with no alert. Add a Slack warning when bench_vs_artifacts/p3/metrics.txt is absent. (4) Nit - p3_fmt_seconds uses %.0fms which rounds to 0ms for verify times under 0.5ms; use %.1fms.

Comment thread .github/scripts/publish_bench_vs.sh Outdated
Comment thread bench_vs_plonky3/src/plonky3_fibonacci.rs Outdated
Comment thread .github/workflows/bench-vs-nightly.yml
Comment thread .github/scripts/publish_bench_vs.sh Outdated
@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/claude

Comment thread bench_vs_plonky3/src/lib.rs
Comment thread bench_vs_plonky3/Cargo.toml
@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Review: Bench vs Plonky3

Overview: Adds a bench_vs_plonky3/ crate with a shared Fibonacci AIR proven by both Lambda STARK and Plonky3, a run.sh harness with median/CV statistics, and wires the result into the nightly Slack report. The approach is sound and the fairness rationale (scalar MMCS, same blowup/queries, extension field degree parity) is well-documented.


Bug

Low — partial_cmp().unwrap() vs total_cmp() (inline on lib.rs:229)
The instruments_breakdown test sorts spans with partial_cmp().unwrap(); prove_bench.rs correctly uses total_cmp. Duration values shouldn't be NaN in practice, but the inconsistency is a latent panic.


Design Note

Unpinned Plonky3 git deps (inline on Cargo.toml)
All p3-* entries use git = "..." without rev =. The cargo update -p p3-* step in CI intentionally advances these nightly, and continue-on-error: true prevents blocking other steps on failure. The risk is that a silent API break in Plonky3 produces no workflow failure and just drops the P3 section from Slack — potentially for several days before anyone notices. Worth documenting this tradeoff inline.


No issues found in:

  • Rust unsafe usage (peak_rss_kb / getrusage) — correctly guarded by return-code check before assume_init
  • Cryptographic correctness — no new crypto, just wrapping existing Lambda and P3 provers
  • AIR correctness — transition constraints match the Fibonacci recurrence; lambda_pair_trace_matches_plonky3_trace provides a cell-by-cell check
  • Mutex handling in span_timing.rsspans lock is released before results lock is acquired (no deadlock risk)
  • Shell arithmetic in publish_bench_vs.sh — all formatted values are numeric (awk output), so JSON injection into the Slack payload is not a practical concern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants