Add confidence intervals to main code#130
Draft
lindsaydbrin wants to merge 3 commits into
Draft
Conversation
New pure-Python module providing percentile bootstrap CI primitives: bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a SHA-stable run_seed for cross-process-deterministic per-run seeding. Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe to use from anywhere in the package. 13 unit tests cover the primitives plus a cross-process determinism check that guards against accidental use of Python's salted hash(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar in the aggregation layer: - compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on every composite entry (pass/derived composites get the CI on mean; success_rate stays bare). - _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1, pass_at_k, and pass_power_k_observed (theoretical stays bare as a deterministic transform). - Both functions accept a seed kwarg threaded by the runner in a follow-up commit. Bootstrap unit is the scenario, not the trial: two new private helpers (_scenario_means_for_metric, _scenario_values_for_composite) collapse multi-trial records to one value per scenario before resampling. For k=1 runs each record is its own scenario. Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs, and TestRunLevelPassKCIs covering field shape, point-estimate bracketing, seed determinism, and null-CI handling for empty-data composites. Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends MetricsRunner to populate the per-metric half of the CI schema and wire the run-dependent seed end-to-end: - _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios on every per-metric entry and stat_ci_lower/upper inside pass_k sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed). - _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name) once and thread it through both aggregators, so re-running aggregate-only on the same run yields byte-identical CIs and different runs get independent Monte-Carlo noise. Per-metric aggregates reuse aggregation._scenario_means_for_metric to collapse trials before bootstrapping; pass_k blocks share the assign_bootstrap_cis helper with the composite path. Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration covering per-metric field shape, null-CI handling, same-seed byte-identity, and across-run independence. (metrics_version bumped to 2.1.0 in the preceding commit; --no-verify used to skip the per-commit version-bump reminder.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude-generated summary to be revised later:
Every
eva metricsrun now emits 95% bootstrap confidence intervals on every CI-bearing scalar inmetrics_summary.json. CIs are deterministic within a run (samerun_dir.name→ byte-identicalbounds) and independent across runs.
Scope (purely additive — no existing field renamed or changed):
overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}— for all 6 composites.overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper}— forpass_at_1,pass_at_k,pass_power_k_observed(multi-trial only).pass_power_k_theoreticalstays bare.per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios}.per_metric.{name}.pass_k.{stat}_ci_{lower,upper}(multi-trial only).Statistical design (matches the paper-validated approach in
analysis/eva-bench-stats/): percentile bootstrap on per-scenario sample means,N_BOOT=2000,ALPHA=0.05, seed derived fromrun_dir.namevia SHA-256 (cross-process-stable).Commits:
f6f3fae3— Addsrc/eva/utils/bootstrap.pyprimitives module + 13 unit tests.d34ae593— Emit composite CIs inaggregation.py+ 15 tests; bumpmetrics_version2.0.0 → 2.1.0.45a7b009— Emit per-metric CIs inrunner.py; threadrun_seed(run_dir.name)through_save_summaryandrun_aggregate_only; 6 tests.