DocShell-R1

DocShell-R1 is a minimal, auditable prototype for a multimodal document coding agent. It converts a PDF into a repository-like executable workspace and solves questions through typed actions over files, OCR text, page images, crops, extracted table-like text, evidence notes, and scratch Python scripts.

This implementation intentionally does not depend on prior VRJ/HiPaC experiment documents. It is a clean v1 implementation of:

workspace construction from a PDF
repository-like page/region/text/notes/scratch layout
typed action language
stable tool loop
training-free pilot agent
trajectory export for SFT
dense reward scaffolding for later SPO/RL
lightweight evaluation metrics

Quick Start

cd /workspace/ydy/docshell_r1
source /workspace/ydy/miniconda3/etc/profile.d/conda.sh
conda activate /workspace/ydy/docshell_r1/.conda
pip install -e ".[dev]"

docshell-r1 build --pdf ../2505.22019v1_VRAG-RL.pdf --out runs/vrag_workspace
docshell-r1 ask --workspace runs/vrag_workspace --episode runs/episodes/q_title --question "What is the paper title?"
docshell-r1 export-sft --workspace runs/vrag_workspace \
  --episode runs/episodes/q_title_sft \
  --question "What is the paper title?" \
  --answer "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning" \
  --out runs/sample_sft.jsonl

Workspace Split

DocShell-R1 separates reusable document preprocessing from per-question agent state:

Document workspace: question-agnostic PDF assets such as pages, text, regions, overview, and indexes.
Episode workspace: question-specific state such as blackboard, evidence notes, trace, final answer, and scratch scripts.

For MMLongBench-style evaluation, build one document workspace per PDF and one episode workspace per question.

Experiment Logs

MMLongBench-Doc multimodal experiment log records the DocShell-R2 VLM-native implementation, Claude/GPT gateway checks, visual workspace build, 25-question pilot, full 1082-question run, scores, artifacts, and next optimization targets.
PRC-Bench experiment log records the earlier DocShell-R1 PRC-Bench adaptation and iterations.

Tool Language

Supported v1 actions:

LIST_DIR(path)
READ_FILE(path, start_line, end_line)
GREP(pattern, path)
WRITE_FILE(path, content)
RUN_PYTHON_FILE(path)
PATCH_FILE(path, old, new)
SEARCH_TEXT(query, top_k)
RETRIEVE_PAGE(query, top_k)
READ_PAGE(page_id)
LIST_REGIONS(page_id)
READ_REGION(region_id)
CROP(page_id, bbox_or_region_id)
READ_CROP(crop_path)
EXTRACT_TABLE(page_id_or_region_id)
EXTRACT_CHART(page_id_or_region_id)
MAKE_NOTE(content, evidence_refs)
VERIFY_CLAIM(claim, evidence_refs)
VERIFY_COMPLETENESS(question, evidence_refs)
ABSTAIN(reason, checked_refs)
PYTHON(code)
VERIFY(answer, evidence_refs)
FINAL(answer, evidence_refs, confidence, unanswerable)

The training-free pilot agent uses these same tools, so every answer includes an auditable trace.

Agents

--agent heuristic (default): the deterministic pilot in agent.py.
--agent llm: drives the same 22 tools via a DeepSeek-Chat–compatible API. Set DEEPSEEK_API_KEY, then pass --agent llm --model deepseek-chat. Override --api-base / --api-key-env for other OpenAI-compatible providers.

PRC-Bench (zero-modification path)

Use MinerU's _origin.pdf directly so no loader changes are needed:

export DEEPSEEK_API_KEY=sk-...
docshell-r1 batch-prc \
  --gold-path "/workspace/ydy/PRC-Bench/PRC-Bench-B327 (2)/benchmark/test.json" \
  --mineru-root /workspace/ydy/MemReread/datas/prc_bench/mineru_raw \
  --runs-root  runs/prc_bench \
  --out        runs/prc_bench/predictions.jsonl \
  --agent llm --model deepseek-chat \
  --limit-papers 1 --limit-questions 3            # smoke

Output is a JSONL whose rows match what PRC-Bench/eval.py --pred_path consumes: {id, part_idx, question, category, gen_answer}. Per-question audit traces are written under runs/prc_bench/episodes/<id>/q<NNN>/{trace.json, final.json}. For Claim_Verification questions the agent is instructed to answer True / False, and a post-processor coerces common synonyms back to that vocabulary.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
fixtures_vrag.pdf		fixtures_vrag.pdf
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocShell-R1

Quick Start

Workspace Split

Experiment Logs

Tool Language

Agents

PRC-Bench (zero-modification path)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocShell-R1

Quick Start

Workspace Split

Experiment Logs

Tool Language

Agents

PRC-Bench (zero-modification path)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages