Add threshold sweep tool for perturbation scoring by dangng2004 · Pull Request #90 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-05-21T20:54:07Z

Summary

threshold_sweep.py recomputes detection recall at a range of fuzzy-coverage thresholds by reusing already-completed reviewer outputs
LLM-judge decisions are threshold-independent, so they're cached on disk: one judgment per (perturbation, comment) pair, then sweeping is a pure re-aggregation
Pairs with the --threshold / --substring-gate flags added in PR Fix substring-match scorer bug; add gate/threshold flags #87 so the operating point can be chosen from data rather than guessed

Test plan

Point at an existing results/<run>/<reviewer-model>/ tree and confirm it produces a CSV of recall-by-threshold
Confirm second run uses the cache and skips LLM-judge calls

🤖 Generated with Claude Code

threshold_sweep.py recomputes recall at a range of fuzzy-coverage thresholds by reusing already-completed reviewer outputs and an on-disk cache of LLM-judge decisions (which are threshold-independent). One judgment per (perturbation, comment) pair; sweeping is then a pure re-aggregation. Pairs with the --threshold / --substring-gate flags added in PR #87 to support picking an operating point from data rather than guessing.

dangng2004 marked this pull request as draft May 21, 2026 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add threshold sweep tool for perturbation scoring#90

Add threshold sweep tool for perturbation scoring#90
dangng2004 wants to merge 1 commit into
mainfrom
feat/threshold-sweep

dangng2004 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dangng2004 commented May 21, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant