DocShell-R1 is a minimal, auditable prototype for a multimodal document coding agent. It converts a PDF into a repository-like executable workspace and solves questions through typed actions over files, OCR text, page images, crops, extracted table-like text, evidence notes, and scratch Python scripts.
This implementation intentionally does not depend on prior VRJ/HiPaC experiment documents. It is a clean v1 implementation of:
- workspace construction from a PDF
- repository-like page/region/text/notes/scratch layout
- typed action language
- stable tool loop
- training-free pilot agent
- trajectory export for SFT
- dense reward scaffolding for later SPO/RL
- lightweight evaluation metrics
cd /workspace/ydy/docshell_r1
source /workspace/ydy/miniconda3/etc/profile.d/conda.sh
conda activate /workspace/ydy/docshell_r1/.conda
pip install -e ".[dev]"
docshell-r1 build --pdf ../2505.22019v1_VRAG-RL.pdf --out runs/vrag_workspace
docshell-r1 ask --workspace runs/vrag_workspace --episode runs/episodes/q_title --question "What is the paper title?"
docshell-r1 export-sft --workspace runs/vrag_workspace \
--episode runs/episodes/q_title_sft \
--question "What is the paper title?" \
--answer "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning" \
--out runs/sample_sft.jsonlDocShell-R1 separates reusable document preprocessing from per-question agent state:
- Document workspace: question-agnostic PDF assets such as pages, text, regions, overview, and indexes.
- Episode workspace: question-specific state such as blackboard, evidence notes, trace, final answer, and scratch scripts.
For MMLongBench-style evaluation, build one document workspace per PDF and one episode workspace per question.
- MMLongBench-Doc multimodal experiment log records the DocShell-R2 VLM-native implementation, Claude/GPT gateway checks, visual workspace build, 25-question pilot, full 1082-question run, scores, artifacts, and next optimization targets.
- PRC-Bench experiment log records the earlier DocShell-R1 PRC-Bench adaptation and iterations.
Supported v1 actions:
LIST_DIR(path)READ_FILE(path, start_line, end_line)GREP(pattern, path)WRITE_FILE(path, content)RUN_PYTHON_FILE(path)PATCH_FILE(path, old, new)SEARCH_TEXT(query, top_k)RETRIEVE_PAGE(query, top_k)READ_PAGE(page_id)LIST_REGIONS(page_id)READ_REGION(region_id)CROP(page_id, bbox_or_region_id)READ_CROP(crop_path)EXTRACT_TABLE(page_id_or_region_id)EXTRACT_CHART(page_id_or_region_id)MAKE_NOTE(content, evidence_refs)VERIFY_CLAIM(claim, evidence_refs)VERIFY_COMPLETENESS(question, evidence_refs)ABSTAIN(reason, checked_refs)PYTHON(code)VERIFY(answer, evidence_refs)FINAL(answer, evidence_refs, confidence, unanswerable)
The training-free pilot agent uses these same tools, so every answer includes an auditable trace.
--agent heuristic(default): the deterministic pilot inagent.py.--agent llm: drives the same 22 tools via a DeepSeek-Chat–compatible API. SetDEEPSEEK_API_KEY, then pass--agent llm --model deepseek-chat. Override--api-base/--api-key-envfor other OpenAI-compatible providers.
Use MinerU's _origin.pdf directly so no loader changes are needed:
export DEEPSEEK_API_KEY=sk-...
docshell-r1 batch-prc \
--gold-path "/workspace/ydy/PRC-Bench/PRC-Bench-B327 (2)/benchmark/test.json" \
--mineru-root /workspace/ydy/MemReread/datas/prc_bench/mineru_raw \
--runs-root runs/prc_bench \
--out runs/prc_bench/predictions.jsonl \
--agent llm --model deepseek-chat \
--limit-papers 1 --limit-questions 3 # smokeOutput is a JSONL whose rows match what PRC-Bench/eval.py --pred_path consumes:
{id, part_idx, question, category, gen_answer}. Per-question audit traces are
written under runs/prc_bench/episodes/<id>/q<NNN>/{trace.json, final.json}.
For Claim_Verification questions the agent is instructed to answer True /
False, and a post-processor coerces common synonyms back to that vocabulary.