Combinatorial group testing for verifier-cost-limited reasoning.
When you verify a chain of reasoning, the expensive part is the verifier (an
LLM judge, a process reward model, a test suite). poolcheck spends fewer verifier
calls by pooling items into a single query — "does any step in this pool
contain a mistake?" — and decoding the answers to localize the defective
items, instead of checking each of the n items one at a time.
For k defectives among n items, the number of pooled queries can be
~k·log(n/k) instead of n — under an explicit noise model that this library
makes you state.
Status:
v0.1.0a1(pre-alpha). The decoders are correct and tested; the headline numbers come from a simulated noise model, not a deployed judge. See Claims ands0_report.jsonbefore relying on anything here.
pip install poolcheck # core (numpy + scipy only)
pip install 'poolcheck[hf]' # + Hugging Face Inference verifierimport numpy as np
from poolcheck import ItemSet, NoiseChannel, SimulatedJudge, localize
# 8 chain-of-thought steps; the faulty one is step 5 (unknown to the decoder).
items = ItemSet.from_cot([f"step {i}" for i in range(8)])
# A judge that misses real errors 40% of the time and false-alarms 10%
# (a lenient operating point grounded in the LLM-as-judge literature).
noise = NoiseChannel(alpha_fa=0.10, beta_md=0.40)
judge = SimulatedJudge(truth={5}, noise=noise, n=8, rng=np.random.default_rng(0))
result = localize(items, judge, budget=12, noise=noise, k=1, rng=np.random.default_rng(0))
print(result.defectives) # -> localized faulty step(s)Measure the simulated budget → accuracy frontier from the CLI:
poolcheck frontier --n 32 --k 1 --alpha 0.1 --beta 0.4- Design a pooling (test) matrix — which items go in which pooled query.
Default is a near-constant-column-weight design (outperforms i.i.d. Bernoulli,
arXiv:1612.07122); a deterministic Reed-Solomon (Kautz-Singleton)
d-disjunct design is also provided. - Query the verifier once per pool.
- Decode the defective set. Noiseless: COMP / DD / SCOMP. Noisy: a per-item
separate-decoding log-likelihood-ratio decoder tuned to the channel's
asymmetric false-alarm (
alpha_fa) and missed-detect (beta_md) rates. A threshold trades precision for recall.
The core (design, decode, adaptive, noise, frontier) never imports a
verifier and never touches the network — it is fully deterministic given a seed.
Generated by scripts/measure.py (seed=0, n=32, k=1, 300 trials). Simulated noise model only — not a deployed-judge benchmark.
noiseless (alpha_fa=0.0, beta_md=0.0)
Per-item baseline: 32 queries, F1=1.000 (95% CI [1.000, 1.000])
| pooled queries | F1 | 95% CI | vs per-item baseline |
|---|---|---|---|
| 5 | 0.027 | [0.010, 0.047] | worse |
| 8 | 0.323 | [0.270, 0.377] | worse |
| 12 | 0.923 | [0.893, 0.950] | worse |
| 16 | 0.990 | [0.977, 1.000] | matches (fewer queries) |
| 24 | 1.000 | [1.000, 1.000] | matches (fewer queries) |
lenient_judge (alpha_fa=0.1, beta_md=0.4)
Per-item baseline: 32 queries, F1=0.264 (95% CI [0.236, 0.292])
| pooled queries | F1 | 95% CI | vs per-item baseline |
|---|---|---|---|
| 5 | 0.100 | [0.067, 0.133] | worse |
| 8 | 0.130 | [0.093, 0.170] | worse |
| 12 | 0.270 | [0.220, 0.323] | matches (fewer queries) |
| 16 | 0.387 | [0.330, 0.443] | better (fewer queries) |
| 24 | 0.580 | [0.523, 0.637] | better (fewer queries) |
"vs per-item baseline" compares the pooled F1 bootstrap CI to the baseline CI (which uses n per-item queries): non-overlapping above = better, overlapping = statistically indistinguishable (matches), non-overlapping below = worse. Pooled always uses fewer queries than the baseline.
CLAIM. poolcheck decodes a defective set (e.g. the first faulty CoT step)
from ~k·log(n/k) pooled verifier queries instead of n per-item queries,
under an explicit asymmetric false-alarm / missed-detect noise model. The
headline budget → accuracy frontier is computed with a simulated judge oracle
(deterministic, no API key, reproducible via scripts/measure.py); its noise
parameters are grounded in published single-item LLM-judge error rates
(see s0_report.json).
NON-CLAIMS.
- No speedup is claimed for any specific deployed model. Every headline number is from the simulated noise model above, not a live-judge benchmark.
- The "1-good-among-N" regime is explicitly out of scope. Picking the single
good answer out of
Ncandidates is thek = N-1regime, where group testing provably cannot beat individual testing (Θ(n)tests whenk = ω(n/log n), arXiv:2006.01325, arXiv:2106.06878).ItemSet.from_candidatesis ak << Nshortlist-narrowing convenience only — it is not the headline. - poolcheck is not a process reward model. It does not score steps; it
localizes defectives.
ItemSet.priorsis an unused experimental seam (off by default); supplying priors lets a downstream PRM bias the decoder, but that path is unbenchmarked here. - The pooled-query premise is unverified in this release. The simulated noise uses single-item literature rates and assumes they also hold for pooled queries. Whether pooling degrades a real judge's FA/MD (residual risk #1) is OPEN — see below.
The one empirical question poolcheck cannot answer for you: does asking your judge "are any of these N steps wrong?" make it noticeably worse than asking about one step at a time? Measure it on your own judge:
poolcheck s0-gate --cases your_cases.json --verifier hf:Qwen/Qwen2.5-7B-Instruct \
--pool-sizes 4 8PASS (pooled FA/MD ≤ 1.5× single, with bootstrap CIs) means the simulated
advantage should transfer. FAIL means it may not. This build ships the tool but
did not run it against a live judge (no inference token was available in the
build environment); see s0_report.json.
Verifier · ItemSet · localize · Strategy · frontier
(plus supporting types NoiseChannel, SimulatedJudge, DeterministicJudge,
LocalizeResult).
MIT. © 2026 the poolcheck authors.