Add confidence intervals to main code by lindsaydbrin · Pull Request #130 · ServiceNow/eva

lindsaydbrin · 2026-05-29T16:26:09Z

Claude-generated summary to be revised later:

Every eva metrics run now emits 95% bootstrap confidence intervals on every CI-bearing scalar in metrics_summary.json. CIs are deterministic within a run (same run_dir.name → byte-identical
bounds) and independent across runs.

Scope (purely additive — no existing field renamed or changed):

overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} — for all 6 composites.
overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper} — for pass_at_1, pass_at_k, pass_power_k_observed (multi-trial only). pass_power_k_theoretical stays bare.
per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}.
per_metric.{name}.pass_k.{stat}_ci_{lower,upper} (multi-trial only).

Statistical design (matches the paper-validated approach in analysis/eva-bench-stats/): percentile bootstrap on per-scenario sample means, N_BOOT=2000, ALPHA=0.05, seed derived from
run_dir.name via SHA-256 (cross-process-stable).

Commits:

f6f3fae3 — Add src/eva/utils/bootstrap.py primitives module + 13 unit tests.
d34ae593 — Emit composite CIs in aggregation.py + 15 tests; bump metrics_version 2.0.0 → 2.1.0.
45a7b009 — Emit per-metric CIs in runner.py; thread run_seed(run_dir.name) through _save_summary and run_aggregate_only; 6 tests.

New pure-Python module providing percentile bootstrap CI primitives: bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a SHA-stable run_seed for cross-process-deterministic per-run seeding. Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe to use from anywhere in the package. 13 unit tests cover the primitives plus a cross-process determinism check that guards against accidental use of Python's salted hash(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar in the aggregation layer: - compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on every composite entry (pass/derived composites get the CI on mean; success_rate stays bare). - _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1, pass_at_k, and pass_power_k_observed (theoretical stays bare as a deterministic transform). - Both functions accept a seed kwarg threaded by the runner in a follow-up commit. Bootstrap unit is the scenario, not the trial: two new private helpers (_scenario_means_for_metric, _scenario_values_for_composite) collapse multi-trial records to one value per scenario before resampling. For k=1 runs each record is its own scenario. Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs, and TestRunLevelPassKCIs covering field shape, point-estimate bracketing, seed determinism, and null-CI handling for empty-data composites. Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extends MetricsRunner to populate the per-metric half of the CI schema and wire the run-dependent seed end-to-end: - _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios on every per-metric entry and stat_ci_lower/upper inside pass_k sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed). - _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name) once and thread it through both aggregators, so re-running aggregate-only on the same run yields byte-identical CIs and different runs get independent Monte-Carlo noise. Per-metric aggregates reuse aggregation._scenario_means_for_metric to collapse trials before bootstrapping; pass_k blocks share the assign_bootstrap_cis helper with the composite path. Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration covering per-metric field shape, null-CI handling, same-seed byte-identity, and across-run independence. (metrics_version bumped to 2.1.0 in the preceding commit; --no-verify used to skip the per-commit version-bump reminder.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lindsaydbrin and others added 3 commits May 29, 2026 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add confidence intervals to main code#130

Add confidence intervals to main code#130
lindsaydbrin wants to merge 3 commits into
mainfrom
pr/lindsay/add_CIs

lindsaydbrin commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lindsaydbrin commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant