Skip to content

yedeyang/docshell_r2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocShell-R1

DocShell-R1 is a minimal, auditable prototype for a multimodal document coding agent. It converts a PDF into a repository-like executable workspace and solves questions through typed actions over files, OCR text, page images, crops, extracted table-like text, evidence notes, and scratch Python scripts.

This implementation intentionally does not depend on prior VRJ/HiPaC experiment documents. It is a clean v1 implementation of:

  • workspace construction from a PDF
  • repository-like page/region/text/notes/scratch layout
  • typed action language
  • stable tool loop
  • training-free pilot agent
  • trajectory export for SFT
  • dense reward scaffolding for later SPO/RL
  • lightweight evaluation metrics

Quick Start

cd /workspace/ydy/docshell_r1
source /workspace/ydy/miniconda3/etc/profile.d/conda.sh
conda activate /workspace/ydy/docshell_r1/.conda
pip install -e ".[dev]"

docshell-r1 build --pdf ../2505.22019v1_VRAG-RL.pdf --out runs/vrag_workspace
docshell-r1 ask --workspace runs/vrag_workspace --episode runs/episodes/q_title --question "What is the paper title?"
docshell-r1 export-sft --workspace runs/vrag_workspace \
  --episode runs/episodes/q_title_sft \
  --question "What is the paper title?" \
  --answer "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning" \
  --out runs/sample_sft.jsonl

Workspace Split

DocShell-R1 separates reusable document preprocessing from per-question agent state:

  • Document workspace: question-agnostic PDF assets such as pages, text, regions, overview, and indexes.
  • Episode workspace: question-specific state such as blackboard, evidence notes, trace, final answer, and scratch scripts.

For MMLongBench-style evaluation, build one document workspace per PDF and one episode workspace per question.

Experiment Logs

  • MMLongBench-Doc multimodal experiment log records the DocShell-R2 VLM-native implementation, Claude/GPT gateway checks, visual workspace build, 25-question pilot, full 1082-question run, scores, artifacts, and next optimization targets.
  • PRC-Bench experiment log records the earlier DocShell-R1 PRC-Bench adaptation and iterations.

Tool Language

Supported v1 actions:

  • LIST_DIR(path)
  • READ_FILE(path, start_line, end_line)
  • GREP(pattern, path)
  • WRITE_FILE(path, content)
  • RUN_PYTHON_FILE(path)
  • PATCH_FILE(path, old, new)
  • SEARCH_TEXT(query, top_k)
  • RETRIEVE_PAGE(query, top_k)
  • READ_PAGE(page_id)
  • LIST_REGIONS(page_id)
  • READ_REGION(region_id)
  • CROP(page_id, bbox_or_region_id)
  • READ_CROP(crop_path)
  • EXTRACT_TABLE(page_id_or_region_id)
  • EXTRACT_CHART(page_id_or_region_id)
  • MAKE_NOTE(content, evidence_refs)
  • VERIFY_CLAIM(claim, evidence_refs)
  • VERIFY_COMPLETENESS(question, evidence_refs)
  • ABSTAIN(reason, checked_refs)
  • PYTHON(code)
  • VERIFY(answer, evidence_refs)
  • FINAL(answer, evidence_refs, confidence, unanswerable)

The training-free pilot agent uses these same tools, so every answer includes an auditable trace.

Agents

  • --agent heuristic (default): the deterministic pilot in agent.py.
  • --agent llm: drives the same 22 tools via a DeepSeek-Chat–compatible API. Set DEEPSEEK_API_KEY, then pass --agent llm --model deepseek-chat. Override --api-base / --api-key-env for other OpenAI-compatible providers.

PRC-Bench (zero-modification path)

Use MinerU's _origin.pdf directly so no loader changes are needed:

export DEEPSEEK_API_KEY=sk-...
docshell-r1 batch-prc \
  --gold-path "/workspace/ydy/PRC-Bench/PRC-Bench-B327 (2)/benchmark/test.json" \
  --mineru-root /workspace/ydy/MemReread/datas/prc_bench/mineru_raw \
  --runs-root  runs/prc_bench \
  --out        runs/prc_bench/predictions.jsonl \
  --agent llm --model deepseek-chat \
  --limit-papers 1 --limit-questions 3            # smoke

Output is a JSONL whose rows match what PRC-Bench/eval.py --pred_path consumes: {id, part_idx, question, category, gen_answer}. Per-question audit traces are written under runs/prc_bench/episodes/<id>/q<NNN>/{trace.json, final.json}. For Claim_Verification questions the agent is instructed to answer True / False, and a post-processor coerces common synonyms back to that vocabulary.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors