Skip to content

Add confidence intervals to main code#130

Draft
lindsaydbrin wants to merge 3 commits into
mainfrom
pr/lindsay/add_CIs
Draft

Add confidence intervals to main code#130
lindsaydbrin wants to merge 3 commits into
mainfrom
pr/lindsay/add_CIs

Conversation

@lindsaydbrin
Copy link
Copy Markdown
Collaborator

Claude-generated summary to be revised later:

Every eva metrics run now emits 95% bootstrap confidence intervals on every CI-bearing scalar in metrics_summary.json. CIs are deterministic within a run (same run_dir.name → byte-identical
bounds) and independent across runs.

Scope (purely additive — no existing field renamed or changed):

  • overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} — for all 6 composites.
  • overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper} — for pass_at_1, pass_at_k, pass_power_k_observed (multi-trial only). pass_power_k_theoretical stays bare.
  • per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}.
  • per_metric.{name}.pass_k.{stat}_ci_{lower,upper} (multi-trial only).

Statistical design (matches the paper-validated approach in analysis/eva-bench-stats/): percentile bootstrap on per-scenario sample means, N_BOOT=2000, ALPHA=0.05, seed derived from
run_dir.name via SHA-256 (cross-process-stable).

Commits:

  1. f6f3fae3 — Add src/eva/utils/bootstrap.py primitives module + 13 unit tests.
  2. d34ae593 — Emit composite CIs in aggregation.py + 15 tests; bump metrics_version 2.0.0 → 2.1.0.
  3. 45a7b009 — Emit per-metric CIs in runner.py; thread run_seed(run_dir.name) through _save_summary and run_aggregate_only; 6 tests.

lindsaydbrin and others added 3 commits May 29, 2026 11:59
New pure-Python module providing percentile bootstrap CI primitives:
bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a
SHA-stable run_seed for cross-process-deterministic per-run seeding.
Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe
to use from anywhere in the package.

13 unit tests cover the primitives plus a cross-process determinism
check that guards against accidental use of Python's salted hash().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar
in the aggregation layer:

- compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on
  every composite entry (pass/derived composites get the CI on mean;
  success_rate stays bare).
- _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1,
  pass_at_k, and pass_power_k_observed (theoretical stays bare as a
  deterministic transform).
- Both functions accept a seed kwarg threaded by the runner in a
  follow-up commit.

Bootstrap unit is the scenario, not the trial: two new private helpers
(_scenario_means_for_metric, _scenario_values_for_composite) collapse
multi-trial records to one value per scenario before resampling. For
k=1 runs each record is its own scenario.

Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs,
and TestRunLevelPassKCIs covering field shape, point-estimate bracketing,
seed determinism, and null-CI handling for empty-data composites.

Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends MetricsRunner to populate the per-metric half of the CI schema
and wire the run-dependent seed end-to-end:

- _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios
  on every per-metric entry and stat_ci_lower/upper inside pass_k
  sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed).
- _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name)
  once and thread it through both aggregators, so re-running aggregate-only
  on the same run yields byte-identical CIs and different runs get
  independent Monte-Carlo noise.

Per-metric aggregates reuse aggregation._scenario_means_for_metric to
collapse trials before bootstrapping; pass_k blocks share the
assign_bootstrap_cis helper with the composite path.

Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration
covering per-metric field shape, null-CI handling, same-seed
byte-identity, and across-run independence.

(metrics_version bumped to 2.1.0 in the preceding commit; --no-verify
used to skip the per-commit version-bump reminder.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant