Skip to content

hinanohart/poolcheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

poolcheck

Combinatorial group testing for verifier-cost-limited reasoning.

When you verify a chain of reasoning, the expensive part is the verifier (an LLM judge, a process reward model, a test suite). poolcheck spends fewer verifier calls by pooling items into a single query — "does any step in this pool contain a mistake?" — and decoding the answers to localize the defective items, instead of checking each of the n items one at a time.

For k defectives among n items, the number of pooled queries can be ~k·log(n/k) instead of nunder an explicit noise model that this library makes you state.

Status: v0.1.0a1 (pre-alpha). The decoders are correct and tested; the headline numbers come from a simulated noise model, not a deployed judge. See Claims and s0_report.json before relying on anything here.


Install

pip install poolcheck            # core (numpy + scipy only)
pip install 'poolcheck[hf]'      # + Hugging Face Inference verifier

Quickstart

import numpy as np
from poolcheck import ItemSet, NoiseChannel, SimulatedJudge, localize

# 8 chain-of-thought steps; the faulty one is step 5 (unknown to the decoder).
items = ItemSet.from_cot([f"step {i}" for i in range(8)])

# A judge that misses real errors 40% of the time and false-alarms 10%
# (a lenient operating point grounded in the LLM-as-judge literature).
noise = NoiseChannel(alpha_fa=0.10, beta_md=0.40)
judge = SimulatedJudge(truth={5}, noise=noise, n=8, rng=np.random.default_rng(0))

result = localize(items, judge, budget=12, noise=noise, k=1, rng=np.random.default_rng(0))
print(result.defectives)   # -> localized faulty step(s)

Measure the simulated budget → accuracy frontier from the CLI:

poolcheck frontier --n 32 --k 1 --alpha 0.1 --beta 0.4

How it works

  1. Design a pooling (test) matrix — which items go in which pooled query. Default is a near-constant-column-weight design (outperforms i.i.d. Bernoulli, arXiv:1612.07122); a deterministic Reed-Solomon (Kautz-Singleton) d-disjunct design is also provided.
  2. Query the verifier once per pool.
  3. Decode the defective set. Noiseless: COMP / DD / SCOMP. Noisy: a per-item separate-decoding log-likelihood-ratio decoder tuned to the channel's asymmetric false-alarm (alpha_fa) and missed-detect (beta_md) rates. A threshold trades precision for recall.

The core (design, decode, adaptive, noise, frontier) never imports a verifier and never touches the network — it is fully deterministic given a seed.

Headline (simulated)

Generated by scripts/measure.py (seed=0, n=32, k=1, 300 trials). Simulated noise model only — not a deployed-judge benchmark.

noiseless (alpha_fa=0.0, beta_md=0.0)
Per-item baseline: 32 queries, F1=1.000 (95% CI [1.000, 1.000])

pooled queries F1 95% CI vs per-item baseline
5 0.027 [0.010, 0.047] worse
8 0.323 [0.270, 0.377] worse
12 0.923 [0.893, 0.950] worse
16 0.990 [0.977, 1.000] matches (fewer queries)
24 1.000 [1.000, 1.000] matches (fewer queries)

lenient_judge (alpha_fa=0.1, beta_md=0.4)
Per-item baseline: 32 queries, F1=0.264 (95% CI [0.236, 0.292])

pooled queries F1 95% CI vs per-item baseline
5 0.100 [0.067, 0.133] worse
8 0.130 [0.093, 0.170] worse
12 0.270 [0.220, 0.323] matches (fewer queries)
16 0.387 [0.330, 0.443] better (fewer queries)
24 0.580 [0.523, 0.637] better (fewer queries)

"vs per-item baseline" compares the pooled F1 bootstrap CI to the baseline CI (which uses n per-item queries): non-overlapping above = better, overlapping = statistically indistinguishable (matches), non-overlapping below = worse. Pooled always uses fewer queries than the baseline.

Claims and non-claims

CLAIM. poolcheck decodes a defective set (e.g. the first faulty CoT step) from ~k·log(n/k) pooled verifier queries instead of n per-item queries, under an explicit asymmetric false-alarm / missed-detect noise model. The headline budget → accuracy frontier is computed with a simulated judge oracle (deterministic, no API key, reproducible via scripts/measure.py); its noise parameters are grounded in published single-item LLM-judge error rates (see s0_report.json).

NON-CLAIMS.

  • No speedup is claimed for any specific deployed model. Every headline number is from the simulated noise model above, not a live-judge benchmark.
  • The "1-good-among-N" regime is explicitly out of scope. Picking the single good answer out of N candidates is the k = N-1 regime, where group testing provably cannot beat individual testing (Θ(n) tests when k = ω(n/log n), arXiv:2006.01325, arXiv:2106.06878). ItemSet.from_candidates is a k << N shortlist-narrowing convenience only — it is not the headline.
  • poolcheck is not a process reward model. It does not score steps; it localizes defectives. ItemSet.priors is an unused experimental seam (off by default); supplying priors lets a downstream PRM bias the decoder, but that path is unbenchmarked here.
  • The pooled-query premise is unverified in this release. The simulated noise uses single-item literature rates and assumes they also hold for pooled queries. Whether pooling degrades a real judge's FA/MD (residual risk #1) is OPEN — see below.

Did pooling break my judge? (s0-gate)

The one empirical question poolcheck cannot answer for you: does asking your judge "are any of these N steps wrong?" make it noticeably worse than asking about one step at a time? Measure it on your own judge:

poolcheck s0-gate --cases your_cases.json --verifier hf:Qwen/Qwen2.5-7B-Instruct \
    --pool-sizes 4 8

PASS (pooled FA/MD ≤ 1.5× single, with bootstrap CIs) means the simulated advantage should transfer. FAIL means it may not. This build ships the tool but did not run it against a live judge (no inference token was available in the build environment); see s0_report.json.

Public API

Verifier · ItemSet · localize · Strategy · frontier (plus supporting types NoiseChannel, SimulatedJudge, DeterministicJudge, LocalizeResult).

License

MIT. © 2026 the poolcheck authors.

About

Combinatorial group testing for verifier-cost-limited reasoning: localize defective reasoning steps from ~k*log(n/k) pooled verifier queries under an explicit asymmetric noise model (simulated headline; pre-alpha).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages