CRSBench is the benchmark suite for OSS-CRS, the open-source orchestration framework for LLM-based autonomous bug-finding and bug-fixing systems (Cyber Reasoning Systems). It provides curated benchmarks and an evaluation harness for measuring any OSS-CRS-compatible CRS on vulnerability discovery and program repair.
Unlike traditional fuzzing benchmarks (e.g., FuzzBench) that only report coverage/crashes, CRSBench stores complete ground truth to track whether vulnerabilities are actually found and correctly patched.
| Metric | Value |
|---|---|
| Benchmarks | 124 (87 Delta + 37 Full) |
| Upstream projects | 82 |
| Vulnerabilities (CPVs) | 315 |
| C / C++ | 63 benchmarks, 123 vulnerabilities |
| JVM (Java) | 61 benchmarks, 192 vulnerabilities |
| Distinct CWEs | 91 (covers 21 of the 2025 CWE Top 25) |
| Vulnerabilities per harness | 1.65 average, 12 max |
| PoV variants per vulnerability | 3.89 average |
Full breakdown and regeneration steps: docs/reference/benchmark-statistics.md.
CRSBench is Linux-only and requires Docker. The smallest first run is a
queue-backed single-host experiment against the sanity suite.
git clone https://github.com/sslab-gatech/CRSBench.git && cd CRSBench
git submodule update --init --recursive
uv sync
./scripts/setup-third-party.sh
uv run crsbench prepare
uv run crsbench prepare --coverage # for the bundled starter CRSConfigure environment variables. CRSBench auto-loads .env from the repo root;
edit it for distributed Redis, LiteLLM credentials, etc. CRSBench currently
requires you to bring your own LiteLLM endpoint, either via the local helper
at scripts/litellm-helper.py or an external proxy. Refer to the
LiteLLM docs for configuring providers, routing,
and keys. See
docs/getting-started/configuration.md
for the CRSBench-side wiring.
cp .env.example .envRequest access to the HuggingFace dataset (gated). Open
https://huggingface.co/datasets/sslab-gatech/crsbench-dataset and accept the
Data Use Agreement - access is granted after manual approval. Once approved,
authenticate (either set HF_TOKEN=hf_... in .env, or run hf auth login)
and download the sanity suite:
uv run hf auth login # or set HF_TOKEN in .env
uv run crsbench download --benchmark-suite smoke/sanityRun the bundled quick-start config
experiment-configs/smoke-testing/first-run.yaml.
It targets the smoke/sanity suite (2 benchmarks, 3 harnesses) with the
bundled atlantis-multilang-given_fuzzer CRS, runs 3 trial jobs in parallel,
and does not need external LLM credentials (runtime.litellm.skip: true):
uv run python scripts/valkey-helper.py start
uv run crsbench worker --experiment-config experiment-configs/smoke-testing/first-run.yaml # terminal 1
uv run crsbench run --experiment-config experiment-configs/smoke-testing/first-run.yaml # terminal 2Start with Getting Started:
- Install
- Configuration
- First Experiment
- Experiments - bug-finding, bug-fixing, discovery, replay, merge
- Deployment - single-machine, multi-machine, GCE cloud
Other entry points:
- Benchmark format contract: docs/RFC.md
- Full docs hub: docs/README.md
- Contributing: CONTRIBUTING.md
CRSBench/
├── benchmarks/ # Benchmark projects (RFC format)
├── crsbench/ # Main Python package
│ ├── builder/ # OSS-Fuzz variant building
│ ├── evaluation/ # CRS execution & verification
│ ├── distributed/ # Multi-machine execution (Redis/RQ)
│ ├── benchmark/ # Packaging, canary, seed tools
│ ├── dataset/ # HuggingFace upload/download
│ ├── validation/ # Format validation & schemas
│ ├── reporting/ # Reports & dashboard
│ └── statistics/ # Benchmark statistics
├── oss-crs/ # OSS-CRS runtime and registry (submodule)
├── third_party/oss-fuzz/ # Managed OSS-Fuzz checkout (sparse)
└── docs/ # Documentation hub
CRSBench is licensed under MIT. Bundled upstream source code retains its original license - see LICENSE-THIRD-PARTY.md.