Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs)#133
Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs)#133adorosario wants to merge 1 commit intoNirDiamant:mainfrom
Conversation
Adds an empirical benchmark comparing native Claude Code file reading vs RAG retrieval across 500 PDFs. Includes interactive charts for scaling curves, cost comparison, and hallucination rates. Data from the open-source benchmark at github.com/adorosario/customgpt-rag-plugin-benchmarking (MIT licensed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughA new Jupyter notebook for benchmarking native file reading versus RAG approaches has been added. The notebook installs dependencies, defines benchmark datasets across multiple document tiers, generates four comparative visualizations, and provides performance summaries with identified limitations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.15.9)evaluation/rag_vs_native_file_reading_benchmark.ipynbUnexpected end of JSON input Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
evaluation/rag_vs_native_file_reading_benchmark.ipynb (1)
223-224: Derive displayed metrics fromdf_native/rag_resultsinstead of hardcoding.Hardcoded values (
4.2×,$0.40,$0.13) can silently drift from the source table if data is updated.♻️ Proposed refactor
+# Derive 500-doc metrics from source data +native_500_time = float(df_native.loc[df_native['Documents'] == 500, 'Avg Wait (sec)'].iloc[0]) +rag_time = float(rag_results['Avg Wait (sec)']) +speedup = native_500_time / rag_time + -ax.annotate(f'4.2× faster', - xy=(500, 36), xytext=(250, 50), +ax.annotate(f'{speedup:.1f}× faster', + xy=(500, rag_time), xytext=(250, 50), arrowprops=dict(arrowstyle='->', color='#27ae60'), fontsize=11, color='#27ae60', fontweight='bold')+# Derive cost values from source data +native_500_cost = float(df_native.loc[df_native['Documents'] == 500, 'Cost per Query ($)'].iloc[0]) +rag_cost = float(rag_results['Cost per Query ($)']) -ax.bar(['Native\n(500 docs)'], [0.40], color='#e74c3c', width=0.4, label='Native Claude Code') -ax.bar(['With RAG\n(500 docs)'], [0.13], color='#27ae60', width=0.4, label='With RAG') +ax.bar(['Native\n(500 docs)'], [native_500_cost], color='#e74c3c', width=0.4, label='Native Claude Code') +ax.bar(['With RAG\n(500 docs)'], [rag_cost], color='#27ae60', width=0.4, label='With RAG')Also applies to: 247-248
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 223 - 224, The plot currently hardcodes annotations (e.g., "4.2× faster", "$0.40", "$0.13") instead of deriving them from the dataframes; update the annotation code that calls ax.annotate to compute the displayed strings from df_native and rag_results (use the relevant aggregate values or computed ratios from those DataFrame rows/columns), format them (e.g., f"{ratio:.1f}×" and currency strings) and then pass the computed strings to ax.annotate so the labels always reflect the source data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb`:
- Around line 283-284: The chart currently uses a single midpoint value for
'Fabricates Answer (%)' (75) which hides the reported 50–100% uncertainty;
update the data and plotting calls to represent the interval explicitly (e.g.,
store the range as [50,100] or store center+error and use error bars) for the
keys 'Fabricates Answer (%)' and 'Correctly Refuses (%)', and change the
plotting logic to draw the interval (shaded band or errorbar) instead of a
single point; apply the same change to the second occurrence noted (the block
around the other data assignment) so both chart instances show the uncertainty
range rather than a midpoint.
- Line 7: This notebook markdown cell contains tracked link wrappers and a
tracking endpoint (e.g. any URL pointing to
europe-west1-rag-techniques-views-tracker.cloudfunctions.net) that call external
analytics on open; remove those wrappers and any tracking-pixel image calls and
replace them with the direct target URLs (e.g. the direct Amazon shortlink
https://amzn.to/4cvxqSw) or plain text links, ensuring the text/anchor (the book
promotion heading and the "Get the book on Amazon" link) remain but reference
only the direct destination without any tracking proxies or external analytics
calls.
- Around line 22-23: Decide on a single, consistent description for the "Native
file reading" mode and update both occurrences to match; specifically, either
keep the original phrasing "1. **Native file reading** — Claude Code searches
files on its own using its built-in tools (grep, cat, read)" or replace it with
an explicit statement that no built-in grep is used. Then edit the later note
that currently says "no explicit grep tools were given" (the text around lines
referencing the RAG comparison) so it matches the chosen behavior (e.g., "native
mode uses Claude Code's built-in tools (grep, cat, read) and does not use a
pre-indexed RAG layer" or "native mode does not use grep; it relies solely on
file reads without tooling"); ensure both the initial bullet and the later
explanatory sentence use the exact same wording to remove the contradiction.
---
Nitpick comments:
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb`:
- Around line 223-224: The plot currently hardcodes annotations (e.g., "4.2×
faster", "$0.40", "$0.13") instead of deriving them from the dataframes; update
the annotation code that calls ax.annotate to compute the displayed strings from
df_native and rag_results (use the relevant aggregate values or computed ratios
from those DataFrame rows/columns), format them (e.g., f"{ratio:.1f}×" and
currency strings) and then pass the computed strings to ax.annotate so the
labels always reflect the source data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ea5ec042-07d9-457b-842e-53e5df4b0b5a
📒 Files selected for processing (1)
evaluation/rag_vs_native_file_reading_benchmark.ipynb
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## \ud83d\udcd6 [The RAG Techniques Book is HERE](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=evaluation--rag-vs-native-file-reading-benchmark&click=book-buy-amazon&target=https%3A%2F%2Famzn.to%2F4cvxqSw&text=)\n\n**The super extended version of this repository.** The book goes far beyond the notebooks: the **intuition** behind every technique, **side-by-side comparisons** showing when each approach wins (and when it quietly fails), and **illustrations** that make the tricky parts finally click.\n\n\u23f3 **Launch window only: $0.99.** The price goes up once the launch ends, and readers who grab it now lock in the lowest price it will ever have.\n\n### \ud83d\udc49 [Get the book on Amazon before the price changes](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=evaluation--rag-vs-native-file-reading-benchmark&click=book-buy-amazon&target=https%3A%2F%2Famzn.to%2F4cvxqSw&text=)\n\n---\n" |
There was a problem hiding this comment.
Remove tracking endpoints from notebook content.
The tracked link wrappers and tracking-pixel image call external analytics endpoints on open/click, which introduces a privacy/compliance risk for readers of a public technical notebook.
Also applies to: 399-399
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` at line 7, This
notebook markdown cell contains tracked link wrappers and a tracking endpoint
(e.g. any URL pointing to
europe-west1-rag-techniques-views-tracker.cloudfunctions.net) that call external
analytics on open; remove those wrappers and any tracking-pixel image calls and
replace them with the direct target URLs (e.g. the direct Amazon shortlink
https://amzn.to/4cvxqSw) or plain text links, ensuring the text/anchor (the book
promotion heading and the "Get the book on Amazon" link) remain but reference
only the direct destination without any tracking proxies or external analytics
calls.
| "1. **Native file reading** \u2014 Claude Code searches files on its own using its built-in tools (grep, cat, read)\n", | ||
| "2. **With a RAG layer** \u2014 documents are pre-indexed, and a retrieval layer fetches relevant chunks before the LLM answers\n", |
There was a problem hiding this comment.
Resolve the grep-tool methodology contradiction.
Line 22 says native mode used built-in grep, while Lines 361–362 say no explicit grep tools were given. This conflict weakens the benchmark narrative and needs a single consistent description.
Also applies to: 361-362
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 22 - 23,
Decide on a single, consistent description for the "Native file reading" mode
and update both occurrences to match; specifically, either keep the original
phrasing "1. **Native file reading** — Claude Code searches files on its own
using its built-in tools (grep, cat, read)" or replace it with an explicit
statement that no built-in grep is used. Then edit the later note that currently
says "no explicit grep tools were given" (the text around lines referencing the
RAG comparison) so it matches the chosen behavior (e.g., "native mode uses
Claude Code's built-in tools (grep, cat, read) and does not use a pre-indexed
RAG layer" or "native mode does not use grep; it relies solely on file reads
without tooling"); ensure both the initial bullet and the later explanatory
sentence use the exact same wording to remove the contradiction.
| " 'Fabricates Answer (%)': [75, 0], # midpoint of 50-100% range for native\n", | ||
| " 'Correctly Refuses (%)': [25, 100],\n", |
There was a problem hiding this comment.
Represent hallucination uncertainty explicitly in the chart.
Using a single midpoint (75%) for a reported 50–100% range can be misread as a measured point estimate. Prefer plotting the interval (or error bars) directly to avoid over-precision.
Also applies to: 311-313
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 283 -
284, The chart currently uses a single midpoint value for 'Fabricates Answer
(%)' (75) which hides the reported 50–100% uncertainty; update the data and
plotting calls to represent the interval explicitly (e.g., store the range as
[50,100] or store center+error and use error bars) for the keys 'Fabricates
Answer (%)' and 'Correctly Refuses (%)', and change the plotting logic to draw
the interval (shaded band or errorbar) instead of a single point; apply the same
change to the second occurrence noted (the block around the other data
assignment) so both chart instances show the uncertainty range rather than a
midpoint.
NirDiamant
left a comment
There was a problem hiding this comment.
Thanks for this contribution - a benchmark comparing RAG vs native LLM file reading is valuable content for the repo.
A few things to check before merging:
- Does the notebook follow the existing format? (see any current notebook for the template - markdown headers, code cells with comments, results visualization)
- Are all dependencies listed at the top of the notebook?
- Is the dataset or a sample included, or are there clear instructions for reproducing?
Will review the code more closely soon.
What this adds
A new evaluation notebook (
evaluation/rag_vs_native_file_reading_benchmark.ipynb) with an empirical benchmark comparing native LLM file reading vs RAG retrieval.The benchmark
Claude Code (Sonnet 4.6) was given 500 corporate PDFs and asked 10 factual questions (5 needle-in-haystack, 5 pattern-matching) under two configurations: native file reading and with a RAG layer.
Key findings:
What's in the notebook
Why this fits the repo
The RAG_Techniques community regularly asks "do I still need RAG with larger context windows?" This notebook answers that question with empirical data. It complements the existing evaluation notebooks (DeepEval, GroUSE) by measuring a different dimension: when does native file reading break down and retrieval become necessary?
Format
pandas,matplotlib(no API keys needed)Raw data, scoring logic, and reproduction scripts are MIT licensed at the benchmark repo.
Summary by CodeRabbit