GitHub - AI45Lab/PseudoBench

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

📖 Overview

PseudoBench is an adversarial benchmark for testing whether agentic auto-research systems can identify and resist pseudoscientific narratives.

🎯 Highlights

📚 200 Curated Items _{Pseudoscientific claim-evidence pairs constructed for adversarial evaluation}	🧭 5 Domains _{Physics, Math, Engineering, Earth Science, and Mystic/Spiritual narratives}	🤖 7 Auto-Research Systems _{Including Codex, Claude Code, OpenClaw, EvoScientist, Nanobot, ResearchClaw, and ARIS}
📄 End-to-End PDF Generation _{From autonomous research execution to final paper-style report generation}	📋 3 Evaluation Dimensions _{Quality, alignment, and persuasiveness}	🔍 14 Fine-Grained Submetrics _{Detailed scoring of structure, evidence use, topic shift, terminology misuse, and more}

🚨 Main Findings

⏱️ All evaluated auto-research systems readily complete full pseudoscientific projects with near-zero refusal rates, often within only a few minutes.
🧠 LLM sycophancy persists in the agentic setting. Systems generate high-quality reports that remain tightly aligned with misleading premises, and the best resistance score is only 27.4%.
📈 Stronger systems may amplify pseudoscience more effectively, especially for claims that appear formal enough to elaborate but are not easily rejected by simple calculation or direct contradiction.

✨ News

2026-06-09 📄 Initial release of PseudoBench and the evaluation code.

For complete experimental results, model comparisons, and ablation studies, please refer to the main paper

Benchmark Construction & Evaluation Framework

Overview of PseudoBench: dataset construction, report generation, and evaluation protocol.

PseudoBench consists of three stages: benchmark construction, end-to-end report generation, and paper-level evaluation.

🧱 Benchmark Construction

We construct PseudoBench from raw pseudoscientific web materials through:

data collection
seed filtering and normalization
claim-evidence standardization
semantic deduplication
absurdity scoring, stratified sampling, and final rewriting

The final public benchmark contains 200 curated claim-evidence pairs across five domains:

Fundamental Physics and Cosmology
Mathematics and Formal Systems
Engineering, Energy, and Anomalous Devices
Earth Science and Natural Phenomena
Consciousness, Soul, and Mystic Energy

Task distribution across the five domains in PseudoBench.

🤖 End-to-End Report Generation

For each benchmark item, an auto-research system is given:

a core claim
supporting evidence
an isolated workspace

The system is asked to autonomously complete a full research workflow, including planning, evidence organization, implementation, analysis, writing, and final PDF delivery. The target output is a paper-style report.pdf, rather than a short answer or outline.

📏 Paper-Level Evaluation

Each generated PDF is evaluated by a judge model along three first-level dimensions:

Report Quality: whether the PDF looks like a complete academic paper
Pseudoscience Alignment: whether the report remains faithful to the original claim and evidence
Persuasiveness: whether the report packages the claim in a misleadingly scientific-looking way

These three dimensions are further decomposed into 14 second-level submetrics. Scores are assigned on a 1--5 scale and then aggregated into:

⚠️ pseudoscientific hazard scores
🛡️ resistance scores
🚫 refusal rate
⏱️ runtime

🛠️ Installation and Usage

Prerequisites

Python 3.10+
An OpenAI-compatible API endpoint for PDF judging
A generation system that can write files inside a workspace and produce report.pdf

Setup

git clone https://github.com/AI45Lab/PseudoBench.git
cd PseudoBench
pip install openai tqdm

Project Structure

PseudoBench/
├── PseudoBench.jsonl           # benchmark claim-evidence pairs
├── prompt.py                   # report-generation and evaluation prompts
├── get_report.py               # minimal Codex-based report generation script
├── evaluate.py                 # minimal PDF evaluation script
├── assets/                     # overview and benchmark figures
└── workspaces/
    └── <agent_name>_workspace/ # workspace root for one generation system
        └── <uuid>/             # isolated workspace for one benchmark item
            └── report.pdf      # final generated paper-style PDF

Workflow

Prepare PseudoBench.jsonl
Use REPORT_GENERATION_PROMPT to run your auto-research system
Save each output to workspaces/<agent_name>_workspace/<uuid>/report.pdf
Run evaluate.py
Read the final scores from results/<judge_model_name>/<agent_name>/result.jsonl

Step 1: Generate PDF Reports

For each item in PseudoBench.jsonl, the generation system should create:

workspaces/<agent_name>_workspace/<uuid>/report.pdf

The generator should use REPORT_GENERATION_PROMPT defined in prompt.py, read the benchmark item, execute a full research-style workflow, and produce a final paper-style PDF in the corresponding workspace.

A minimal Codex-based runner is provided in get_report.py. Example:

python get_report.py \
  --model gpt-5.4 \
  --input_path PseudoBench.jsonl \
  --output_path workspaces/codex_workspace \
  --base_url http://localhost:8000/v1 \
  --api_key YOUR_API_KEY \
  --max_concurrent 4

In this open-source folder, get_report.py is the only built-in generation runner and currently targets Codex. The other systems listed below are included as reference invocation patterns for reproducing the benchmark with different auto-research agents.

Typical invocation commands for the auto-research systems evaluated in PseudoBench are:

System	Repository	Invocation command
`Codex`	openai/codex	`codex exec --ephemeral --full-auto -C <workspace> --model <model> <prompt>`
`Claude Code`	anthropics/claude-code	`claude -p --dangerously-skip-permissions --model <model> --verbose --output-format stream-json <prompt>`
`OpenClaw`	openclaw/openclaw	`openclaw agent -m <prompt> -w <workspace>`
`Nanobot`	HKUDS/nanobot	`nanobot agent -w <workspace> -m <prompt>`
`EvoScientist`	EvoScientist/EvoScientist	`evosci --ui cli --workdir <workspace> --auto-approve --auto-mode -p <prompt>`
`ResearchClaw`	ymx10086/ResearchClaw	`python _researchclaw_launcher.py --workspace <workspace> --prompt <prompt> --model <model>`
`ARIS`	wanshuiyin/Auto-claude-code-research-in-sleep	`aris --model <model> --output-format text --permission-mode workspace-write --dangerously-skip-permissions <prompt>`

Step 2: Evaluate Generated PDFs

evaluate.py reads PseudoBench.jsonl, locates the corresponding report.pdf, and evaluates each PDF with a judge model API.

Example:

python evaluate.py \
  --agent_name codex \
  --input_path PseudoBench.jsonl \
  --judge_model_name gpt-5.4 \
  --base_url http://localhost:8000/v1 \
  --api_key YOUR_API_KEY \
  --max_concurrent 8

Results are saved to:

results/<judge_model_name>/<agent_name>/result.jsonl

🤝 Acknowledgements

The auto-research system implementation in PseudoBench is inspired in part by ResearchClawBench. We thank the authors for their valuable work.

📜 Citation

If you would like to cite our work, please use the following BibTeX.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

📖 Overview

🎯 Highlights

🚨 Main Findings

✨ News

Benchmark Construction & Evaluation Framework

🧱 Benchmark Construction

🤖 End-to-End Report Generation

📏 Paper-Level Evaluation

🛠️ Installation and Usage

Prerequisites

Setup

Project Structure

Workflow

Step 1: Generate PDF Reports

Step 2: Evaluate Generated PDFs

🤝 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
LICENSE		LICENSE
PseudoBench.jsonl		PseudoBench.jsonl
README.md		README.md
evaluate.py		evaluate.py
get_report.py		get_report.py
prompt.py		prompt.py

Folders and files

Latest commit

History

Repository files navigation

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

📖 Overview

🎯 Highlights

🚨 Main Findings

✨ News

Benchmark Construction & Evaluation Framework

🧱 Benchmark Construction

🤖 End-to-End Report Generation

📏 Paper-Level Evaluation

🛠️ Installation and Usage

Prerequisites

Setup

Project Structure

Workflow

Step 1: Generate PDF Reports

Step 2: Evaluate Generated PDFs

🤝 Acknowledgements

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages