CRSBench - Cyber Reasoning System Benchmark Suite

CRSBench is the benchmark suite for OSS-CRS, the open-source orchestration framework for LLM-based autonomous bug-finding and bug-fixing systems (Cyber Reasoning Systems). It provides curated benchmarks and an evaluation harness for measuring any OSS-CRS-compatible CRS on vulnerability discovery and program repair.

Unlike traditional fuzzing benchmarks (e.g., FuzzBench) that only report coverage/crashes, CRSBench stores complete ground truth to track whether vulnerabilities are actually found and correctly patched.

Benchmark Statistics

Metric	Value
Benchmarks	124 (87 Delta + 37 Full)
Upstream projects	82
Vulnerabilities (CPVs)	315
C / C++	63 benchmarks, 123 vulnerabilities
JVM (Java)	61 benchmarks, 192 vulnerabilities
Distinct CWEs	91 (covers 21 of the 2025 CWE Top 25)
Vulnerabilities per harness	1.65 average, 12 max
PoV variants per vulnerability	3.89 average

Full breakdown and regeneration steps: docs/reference/benchmark-statistics.md.

Quick Start

CRSBench is Linux-only and requires Docker. The smallest first run is a queue-backed single-host experiment against the sanity suite.

git clone https://github.com/sslab-gatech/CRSBench.git && cd CRSBench
git submodule update --init --recursive
uv sync
./scripts/setup-third-party.sh
uv run crsbench prepare
uv run crsbench prepare --coverage   # for the bundled starter CRS

Configure environment variables. CRSBench auto-loads .env from the repo root; edit it for distributed Redis, LiteLLM credentials, etc. CRSBench currently requires you to bring your own LiteLLM endpoint, either via the local helper at scripts/litellm-helper.py or an external proxy. Refer to the LiteLLM docs for configuring providers, routing, and keys. See docs/getting-started/configuration.md for the CRSBench-side wiring.

cp .env.example .env

Request access to the HuggingFace dataset (gated). Open https://huggingface.co/datasets/sslab-gatech/crsbench-dataset and accept the Data Use Agreement - access is granted after manual approval. Once approved, authenticate (either set HF_TOKEN=hf_... in .env, or run hf auth login) and download the sanity suite:

uv run hf auth login   # or set HF_TOKEN in .env
uv run crsbench download --benchmark-suite smoke/sanity

Run the bundled quick-start config experiment-configs/smoke-testing/first-run.yaml. It targets the smoke/sanity suite (2 benchmarks, 3 harnesses) with the bundled atlantis-multilang-given_fuzzer CRS, runs 3 trial jobs in parallel, and does not need external LLM credentials (runtime.litellm.skip: true):

uv run python scripts/valkey-helper.py start
uv run crsbench worker --experiment-config experiment-configs/smoke-testing/first-run.yaml   # terminal 1
uv run crsbench run    --experiment-config experiment-configs/smoke-testing/first-run.yaml   # terminal 2

Documentation

Start with Getting Started:

Install
Configuration
First Experiment
Experiments - bug-finding, bug-fixing, discovery, replay, merge
Deployment - single-machine, multi-machine, GCE cloud

Other entry points:

Benchmark format contract: docs/RFC.md
Full docs hub: docs/README.md
Contributing: CONTRIBUTING.md

Architecture

CRSBench/
├── benchmarks/              # Benchmark projects (RFC format)
├── crsbench/                # Main Python package
│   ├── builder/             #   OSS-Fuzz variant building
│   ├── evaluation/          #   CRS execution & verification
│   ├── distributed/         #   Multi-machine execution (Redis/RQ)
│   ├── benchmark/           #   Packaging, canary, seed tools
│   ├── dataset/             #   HuggingFace upload/download
│   ├── validation/          #   Format validation & schemas
│   ├── reporting/           #   Reports & dashboard
│   └── statistics/          #   Benchmark statistics
├── oss-crs/                 # OSS-CRS runtime and registry (submodule)
├── third_party/oss-fuzz/    # Managed OSS-Fuzz checkout (sparse)
└── docs/                    # Documentation hub

License

CRSBench is licensed under MIT. Bundled upstream source code retains its original license - see LICENSE-THIRD-PARTY.md.

Related Projects

FuzzBench - fuzzer evaluation platform
OSS-Fuzz - continuous fuzzing for open source
AIxCC - AI Cyber Challenge

Name		Name	Last commit message	Last commit date
Latest commit History 2,461 Commits
.claude		.claude
.github		.github
benchmark-suites		benchmark-suites
benchmarks		benchmarks
crsbench		crsbench
dashboard		dashboard
docs		docs
experiment-configs		experiment-configs
oss-crs @ 9ec4b11		oss-crs @ 9ec4b11
scripts		scripts
services		services
tests		tests
third_party/patches		third_party/patches
tla		tla
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-THIRD-PARTY.md		LICENSE-THIRD-PARTY.md
README.md		README.md
README_HF.md		README_HF.md
canary-registry.json		canary-registry.json
install.sh		install.sh
justfile		justfile
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRSBench - Cyber Reasoning System Benchmark Suite

Benchmark Statistics

Quick Start

Documentation

Architecture

License

Related Projects

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CRSBench - Cyber Reasoning System Benchmark Suite

Benchmark Statistics

Quick Start

Documentation

Architecture

License

Related Projects

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages