GitHub - StarTrail-org/PixelRAG: The end of web parsing. The beginning of scalable pixel-native search.

Official codebase for PixelRAG: Visually Grounded Retrieval-Augmented Generation with Screenshot Rendering

Yichuan Wang*, Zhifei Li*, Zirui Wang, Paul Teiletche, Lesheng Jin
Matei Zaharia†, Joseph E. Gonzalez†, Sewon Min†

_{* Equal contribution † Equal advising}
_{Work done at Berkeley SkyLab & BAIR & Berkeley NLP}

Search any document by how it looks, not just the text it contains.

What it is · Give Claude eyes · How it works · Pipelines

pip install pixelrag

The two core operations — render a page to screenshots, search a visual index:

# Render any page or document to screenshot tiles
pixelshot https://en.wikipedia.org/wiki/Python --output ./tiles

# Search a hosted index of 8.28M Wikipedia pages — no setup, runs against the live API
curl -X POST http://api.pixelrag.ai:30001/search \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

Live, hosted endpoint — http://api.pixelrag.ai:30001 serves a pre-built index of 8.28M Wikipedia pages. No setup, no API key. It even takes an image as the query (visual search) — see the API reference →.

Or try it in the browser at pixelrag.ai, or run the demo notebook (renders + searches, with the images inline).

What it is

PixelRAG renders documents — web pages, PDFs, images — as screenshots and retrieves over the images directly. Visual structure that HTML parsing throws away — tables, charts, layout, infographics — stays intact, so the reader model can actually answer questions about it. Wikipedia's 8.28M articles ship as a pre-built index; the pipeline itself is general-purpose.

Give Claude eyes

The renderer also ships as a Claude Code plugin — the pixelbrowse skill. Instead of fetching raw HTML, Claude screenshots a page with pixelshot and reads the image, so it sees charts, diagrams, tables, and layout the way a person does.

Install it — no clone needed (pixelshot comes from pip install pixelrag):

pip install pixelrag                                # provides the pixelshot command
claude plugin marketplace add StarTrail-org/PixelRAG
claude plugin install pixelbrowse@pixelrag-plugins

Then just ask Claude to look at a page:

claude -p "screenshot https://news.ycombinator.com and summarize the top stories"
claude -p "screenshot https://arxiv.org/abs/2404.12387 and explain the key findings"

Or use the slash command in an interactive session: /screenshot https://example.com. No MCP server, no backend: the skill just calls pixelshot (Playwright/CDP) on your machine.

How it works

Text-based RAG parses the page to text chunks and loses the table — the reader can't find the answer. PixelRAG renders the page to screenshot tiles, retrieves the right tile, and the reader reads the number straight off the image.

Two pieces make this work: (1) rendering documents to images instead of parsing them to text, and (2) a Qwen3-VL-Embedding model, LoRA-fine-tuned on screenshot data, that embeds page images into a space where visual content is retrievable.

Pipelines

Capture is the standalone pixelshot command; the rest of the pipeline runs through the pixelrag umbrella — pixelrag <stage>. Install only the stages you need:

Command	What it does	Install
`pixelshot`	Document → image tiles (Playwright CDP, PDF)	`pip install pixelrag`
`pixelrag chunk` · `embed` · `build-index`	Tiles → vectors → FAISS index	`pip install 'pixelrag[embed]'`
`pixelrag index`	Orchestrates the full pipeline: source → ingest → embed → index	`pip install 'pixelrag[index]'`
`pixelrag serve`	FAISS search API (FastAPI, CPU or GPU)	`pip install 'pixelrag[serve]'`

render ←── index ──→ embed       serve (independent)       train → serve (HTTP)

train is a separate uv project with its own pinned env (torch==2.9.1+cu129, transformers==4.57.1, cuDNN 9.20) — install it from inside train/, not from the root.

Search a pre-built index

pip install 'pixelrag[serve]'

# Download a pre-built index from Hugging Face. The dataset repo holds four FAISS indexes
# (base/LoRA Wikipedia pixel, Wikipedia text, news pixel); grab just the base one (~217G) here.
huggingface-cli download StarTrail-org/pixelrag-faiss-indexes \
  --repo-type dataset --include "search_index_normed_v2/*" --local-dir ./index

# Serve, then query
pixelrag serve --index-dir ./index/search_index_normed_v2 --port 30001

curl -X POST http://localhost:30001/search \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

Build an index from your own documents

pip install 'pixelrag[index]'

# Create pixelrag.yaml
cat > pixelrag.yaml << 'EOF'
source:
  type: local
  path: ./my_docs

embed:
  model: Qwen/Qwen3-VL-Embedding-2B
  device: cuda
  gpu_ids: [0]

output: ./my_index
EOF

# Build, then serve
pixelrag index build
pixelrag serve --index-dir ./my_index --port 30001

Render a page programmatically

from pixelrag_render import render_url

# render a single page to tiles — e.g. for an agent to read
tiles = render_url("https://en.wikipedia.org/wiki/Python", "./tiles")

Embed tools (standalone)

Each stage runs independently, without the orchestrator:

pip install 'pixelrag[embed]'

pixelrag chunk --tiles-dir ./tiles
pixelrag embed --shard-dir ./tiles --output-dir ./embeddings --gpu-ids 0,1
pixelrag build-index --embeddings-dir ./embeddings --output-dir ./index

Training

Fine-tuning lives in train/ — a separate uv project (wiki-screenshot-training) with its own pinned env. It LoRA-fine-tunes Qwen/Qwen3-VL-Embedding-2B for webpage retrieval; run it from inside train/ (cd train && uv sync). See train/README.md for the full recipe.

You don't need to retrain to use the model — the trained adapters are published at Chrisyichuan/wiki-screenshot-embedding-lora.

We also release the full training set (Chrisyichuan/screenshot-training-natural-filtered-v2), so you can adapt other backbones yourself — a larger Qwen, or any other embedding model.

Acknowledgments

Thanks to Rulin Shao for support.

Thanks also to Claude Code and OpenAI Codex for supporting open-source contributors with credits and plans, which we earned by working on LEANN.

This work is done by the Berkeley Sky Computing Lab, BAIR, and the Berkeley NLP Group.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
assets		assets
chromium		chromium
demos		demos
deploy		deploy
docs		docs
embed/src/pixelrag_embed		embed/src/pixelrag_embed
eval		eval
history		history
index/src/pixelrag_index		index/src/pixelrag_index
plugin		plugin
render/src/pixelrag_render		render/src/pixelrag_render
serve/src/pixelrag_serve		serve/src/pixelrag_serve
skill		skill
src/pixelrag		src/pixelrag
status-site		status-site
tests		tests
train		train
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
.upptimerc.yml		.upptimerc.yml
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
chromium-screenshot-patches.diff		chromium-screenshot-patches.diff
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What it is

Give Claude eyes

How it works

Pipelines

Search a pre-built index

Build an index from your own documents

Render a page programmatically

Embed tools (standalone)

Training

Acknowledgments

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What it is

Give Claude eyes

How it works

Pipelines

Search a pre-built index

Build an index from your own documents

Render a page programmatically

Embed tools (standalone)

Training

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages