poolcheck

Combinatorial group testing for verifier-cost-limited reasoning.

When you verify a chain of reasoning, the expensive part is the verifier (an LLM judge, a process reward model, a test suite). poolcheck spends fewer verifier calls by pooling items into a single query — "does any step in this pool contain a mistake?" — and decoding the answers to localize the defective items, instead of checking each of the n items one at a time.

For k defectives among n items, the number of pooled queries can be ~k·log(n/k) instead of n — under an explicit noise model that this library makes you state.

Status: v0.1.0a1 (pre-alpha). The decoders are correct and tested; the headline numbers come from a simulated noise model, not a deployed judge. See Claims and s0_report.json before relying on anything here.

Install

pip install poolcheck            # core (numpy + scipy only)
pip install 'poolcheck[hf]'      # + Hugging Face Inference verifier

Quickstart

import numpy as np
from poolcheck import ItemSet, NoiseChannel, SimulatedJudge, localize

# 8 chain-of-thought steps; the faulty one is step 5 (unknown to the decoder).
items = ItemSet.from_cot([f"step {i}" for i in range(8)])

# A judge that misses real errors 40% of the time and false-alarms 10%
# (a lenient operating point grounded in the LLM-as-judge literature).
noise = NoiseChannel(alpha_fa=0.10, beta_md=0.40)
judge = SimulatedJudge(truth={5}, noise=noise, n=8, rng=np.random.default_rng(0))

result = localize(items, judge, budget=12, noise=noise, k=1, rng=np.random.default_rng(0))
print(result.defectives)   # -> localized faulty step(s)

Measure the simulated budget → accuracy frontier from the CLI:

poolcheck frontier --n 32 --k 1 --alpha 0.1 --beta 0.4

How it works

Design a pooling (test) matrix — which items go in which pooled query. Default is a near-constant-column-weight design (outperforms i.i.d. Bernoulli, arXiv:1612.07122); a deterministic Reed-Solomon (Kautz-Singleton) d-disjunct design is also provided.
Query the verifier once per pool.
Decode the defective set. Noiseless: COMP / DD / SCOMP. Noisy: a per-item separate-decoding log-likelihood-ratio decoder tuned to the channel's asymmetric false-alarm (alpha_fa) and missed-detect (beta_md) rates. A threshold trades precision for recall.

The core (design, decode, adaptive, noise, frontier) never imports a verifier and never touches the network — it is fully deterministic given a seed.

Headline (simulated)

Generated by scripts/measure.py (seed=0, n=32, k=1, 300 trials). Simulated noise model only — not a deployed-judge benchmark.

noiseless (alpha_fa=0.0, beta_md=0.0)
Per-item baseline: 32 queries, F1=1.000 (95% CI [1.000, 1.000])

pooled queries	F1	95% CI	vs per-item baseline
5	0.027	[0.010, 0.047]	worse
8	0.323	[0.270, 0.377]	worse
12	0.923	[0.893, 0.950]	worse
16	0.990	[0.977, 1.000]	matches (fewer queries)
24	1.000	[1.000, 1.000]	matches (fewer queries)

lenient_judge (alpha_fa=0.1, beta_md=0.4)
Per-item baseline: 32 queries, F1=0.264 (95% CI [0.236, 0.292])

pooled queries	F1	95% CI	vs per-item baseline
5	0.100	[0.067, 0.133]	worse
8	0.130	[0.093, 0.170]	worse
12	0.270	[0.220, 0.323]	matches (fewer queries)
16	0.387	[0.330, 0.443]	better (fewer queries)
24	0.580	[0.523, 0.637]	better (fewer queries)

"vs per-item baseline" compares the pooled F1 bootstrap CI to the baseline CI (which uses n per-item queries): non-overlapping above = better, overlapping = statistically indistinguishable (matches), non-overlapping below = worse. Pooled always uses fewer queries than the baseline.

Claims and non-claims

CLAIM. poolcheck decodes a defective set (e.g. the first faulty CoT step) from ~k·log(n/k) pooled verifier queries instead of n per-item queries, under an explicit asymmetric false-alarm / missed-detect noise model. The headline budget → accuracy frontier is computed with a simulated judge oracle (deterministic, no API key, reproducible via scripts/measure.py); its noise parameters are grounded in published single-item LLM-judge error rates (see s0_report.json).

NON-CLAIMS.

No speedup is claimed for any specific deployed model. Every headline number is from the simulated noise model above, not a live-judge benchmark.
The "1-good-among-N" regime is explicitly out of scope. Picking the single good answer out of N candidates is the k = N-1 regime, where group testing provably cannot beat individual testing (Θ(n) tests when k = ω(n/log n), arXiv:2006.01325, arXiv:2106.06878). ItemSet.from_candidates is a k << N shortlist-narrowing convenience only — it is not the headline.
poolcheck is not a process reward model. It does not score steps; it localizes defectives. ItemSet.priors is an unused experimental seam (off by default); supplying priors lets a downstream PRM bias the decoder, but that path is unbenchmarked here.
The pooled-query premise is unverified in this release. The simulated noise uses single-item literature rates and assumes they also hold for pooled queries. Whether pooling degrades a real judge's FA/MD (residual risk #1) is OPEN — see below.

Did pooling break my judge? (`s0-gate`)

The one empirical question poolcheck cannot answer for you: does asking your judge "are any of these N steps wrong?" make it noticeably worse than asking about one step at a time? Measure it on your own judge:

poolcheck s0-gate --cases your_cases.json --verifier hf:Qwen/Qwen2.5-7B-Instruct \
    --pool-sizes 4 8

PASS (pooled FA/MD ≤ 1.5× single, with bootstrap CIs) means the simulated advantage should transfer. FAIL means it may not. This build ships the tool but did not run it against a live judge (no inference token was available in the build environment); see s0_report.json.

Public API

Verifier · ItemSet · localize · Strategy · frontier (plus supporting types NoiseChannel, SimulatedJudge, DeterministicJudge, LocalizeResult).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
examples		examples
reports		reports
scripts		scripts
src/poolcheck		src/poolcheck
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
s0_report.json		s0_report.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

poolcheck

Install

Quickstart

How it works

Headline (simulated)

Claims and non-claims

Did pooling break my judge? (`s0-gate`)

Public API

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

poolcheck

Install

Quickstart

How it works

Headline (simulated)

Claims and non-claims

Did pooling break my judge? (s0-gate)

Public API

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Did pooling break my judge? (`s0-gate`)

Packages