Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs) by adorosario · Pull Request #133 · NirDiamant/RAG_Techniques

adorosario · 2026-04-09T15:05:40Z

What this adds

A new evaluation notebook (evaluation/rag_vs_native_file_reading_benchmark.ipynb) with an empirical benchmark comparing native LLM file reading vs RAG retrieval.

The benchmark

Claude Code (Sonnet 4.6) was given 500 corporate PDFs and asked 10 factual questions (5 needle-in-haystack, 5 pattern-matching) under two configurations: native file reading and with a RAG layer.

Key findings:

Metric	Native (500 docs)	With RAG (500 docs)
Avg response time	2 min 31 sec	36 sec (4.2× faster)
Cost per query	$0.40	$0.13 (3.2× cheaper)
Completion in 3 min	39%	100%
Hallucination on missing info	50–100%	0% (correctly refuses)

What's in the notebook

Interactive charts: scaling cliff, response time comparison, cost comparison, hallucination rates
Analysis of the crossover point (~50-100 documents)
Honest limitations table (vendor benchmark, synthetic corpus, single model, etc.)
All data sourced from the open-source benchmark repo: adorosario/customgpt-rag-plugin-benchmarking

Why this fits the repo

The RAG_Techniques community regularly asks "do I still need RAG with larger context windows?" This notebook answers that question with empirical data. It complements the existing evaluation notebooks (DeepEval, GroUSE) by measuring a different dimension: when does native file reading break down and retrieval become necessary?

Format

Follows the repo's notebook structure (book promo cell, markdown explanations before each code cell, tracking pixel)
Only dependencies: pandas, matplotlib (no API keys needed)
Data is hardcoded from the published results — notebook runs without any external setup

Raw data, scoring logic, and reproduction scripts are MIT licensed at the benchmark repo.

Summary by CodeRabbit

New Features
- Added an evaluation notebook with comprehensive benchmarks comparing RAG and native file reading approaches, including performance metrics, response-time analysis, cost comparisons, and behavior analysis across multiple document tiers.

Adds an empirical benchmark comparing native Claude Code file reading vs RAG retrieval across 500 PDFs. Includes interactive charts for scaling curves, cost comparison, and hallucination rates. Data from the open-source benchmark at github.com/adorosario/customgpt-rag-plugin-benchmarking (MIT licensed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-09T15:05:59Z

📝 Walkthrough

Walkthrough

A new Jupyter notebook for benchmarking native file reading versus RAG approaches has been added. The notebook installs dependencies, defines benchmark datasets across multiple document tiers, generates four comparative visualizations, and provides performance summaries with identified limitations.

Changes

Cohort / File(s)	Summary
Benchmark Evaluation Notebook `evaluation/rag_vs_native_file_reading_benchmark.ipynb`	New notebook containing benchmark datasets comparing native file reading and RAG approaches. Includes dependency installation, four visualizations (completion-rate scaling, response-time scaling, cost comparison, hallucination vs. refusal analysis), per-tier performance summaries, status categorization, crossover zone identification, and limitations/takeaways sections.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops through benchmark trees,
Comparing native reads with RAG with ease!
Charts and graphs spring to life so bright,
Document tiers dancing in the light—
Performance secrets now laid bare,
A thorough analysis crafted with care! 📊✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding a new evaluation notebook comparing RAG versus native LLM file reading with specific benchmark parameters (500 PDFs).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.15.9)

evaluation/rag_vs_native_file_reading_benchmark.ipynb

Unexpected end of JSON input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

evaluation/rag_vs_native_file_reading_benchmark.ipynb (1)

223-224: Derive displayed metrics from df_native/rag_results instead of hardcoding.

Hardcoded values (4.2×, $0.40, $0.13) can silently drift from the source table if data is updated.

♻️ Proposed refactor

+# Derive 500-doc metrics from source data
+native_500_time = float(df_native.loc[df_native['Documents'] == 500, 'Avg Wait (sec)'].iloc[0])
+rag_time = float(rag_results['Avg Wait (sec)'])
+speedup = native_500_time / rag_time
+
-ax.annotate(f'4.2× faster',
-            xy=(500, 36), xytext=(250, 50),
+ax.annotate(f'{speedup:.1f}× faster',
+            xy=(500, rag_time), xytext=(250, 50),
             arrowprops=dict(arrowstyle='->', color='#27ae60'),
             fontsize=11, color='#27ae60', fontweight='bold')

+# Derive cost values from source data
+native_500_cost = float(df_native.loc[df_native['Documents'] == 500, 'Cost per Query ($)'].iloc[0])
+rag_cost = float(rag_results['Cost per Query ($)'])
-ax.bar(['Native\n(500 docs)'], [0.40], color='#e74c3c', width=0.4, label='Native Claude Code')
-ax.bar(['With RAG\n(500 docs)'], [0.13], color='#27ae60', width=0.4, label='With RAG')
+ax.bar(['Native\n(500 docs)'], [native_500_cost], color='#e74c3c', width=0.4, label='Native Claude Code')
+ax.bar(['With RAG\n(500 docs)'], [rag_cost], color='#27ae60', width=0.4, label='With RAG')

Also applies to: 247-248

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 223 -
224, The plot currently hardcodes annotations (e.g., "4.2× faster", "$0.40",
"$0.13") instead of deriving them from the dataframes; update the annotation
code that calls ax.annotate to compute the displayed strings from df_native and
rag_results (use the relevant aggregate values or computed ratios from those
DataFrame rows/columns), format them (e.g., f"{ratio:.1f}×" and currency
strings) and then pass the computed strings to ax.annotate so the labels always
reflect the source data.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb`:
- Around line 283-284: The chart currently uses a single midpoint value for
'Fabricates Answer (%)' (75) which hides the reported 50–100% uncertainty;
update the data and plotting calls to represent the interval explicitly (e.g.,
store the range as [50,100] or store center+error and use error bars) for the
keys 'Fabricates Answer (%)' and 'Correctly Refuses (%)', and change the
plotting logic to draw the interval (shaded band or errorbar) instead of a
single point; apply the same change to the second occurrence noted (the block
around the other data assignment) so both chart instances show the uncertainty
range rather than a midpoint.
- Line 7: This notebook markdown cell contains tracked link wrappers and a
tracking endpoint (e.g. any URL pointing to
europe-west1-rag-techniques-views-tracker.cloudfunctions.net) that call external
analytics on open; remove those wrappers and any tracking-pixel image calls and
replace them with the direct target URLs (e.g. the direct Amazon shortlink
https://amzn.to/4cvxqSw) or plain text links, ensuring the text/anchor (the book
promotion heading and the "Get the book on Amazon" link) remain but reference
only the direct destination without any tracking proxies or external analytics
calls.
- Around line 22-23: Decide on a single, consistent description for the "Native
file reading" mode and update both occurrences to match; specifically, either
keep the original phrasing "1. **Native file reading** — Claude Code searches
files on its own using its built-in tools (grep, cat, read)" or replace it with
an explicit statement that no built-in grep is used. Then edit the later note
that currently says "no explicit grep tools were given" (the text around lines
referencing the RAG comparison) so it matches the chosen behavior (e.g., "native
mode uses Claude Code's built-in tools (grep, cat, read) and does not use a
pre-indexed RAG layer" or "native mode does not use grep; it relies solely on
file reads without tooling"); ensure both the initial bullet and the later
explanatory sentence use the exact same wording to remove the contradiction.

---

Nitpick comments:
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb`:
- Around line 223-224: The plot currently hardcodes annotations (e.g., "4.2×
faster", "$0.40", "$0.13") instead of deriving them from the dataframes; update
the annotation code that calls ax.annotate to compute the displayed strings from
df_native and rag_results (use the relevant aggregate values or computed ratios
from those DataFrame rows/columns), format them (e.g., f"{ratio:.1f}×" and
currency strings) and then pass the computed strings to ax.annotate so the
labels always reflect the source data.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea5ec042-07d9-457b-842e-53e5df4b0b5a

📥 Commits

Reviewing files that changed from the base of the PR and between 788cd74 and 1e3bd14.

📒 Files selected for processing (1)

evaluation/rag_vs_native_file_reading_benchmark.ipynb

coderabbitai · 2026-04-09T15:08:02Z

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## \ud83d\udcd6 [The RAG Techniques Book is HERE](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=evaluation--rag-vs-native-file-reading-benchmark&click=book-buy-amazon&target=https%3A%2F%2Famzn.to%2F4cvxqSw&text=)\n\n**The super extended version of this repository.** The book goes far beyond the notebooks: the **intuition** behind every technique, **side-by-side comparisons** showing when each approach wins (and when it quietly fails), and **illustrations** that make the tricky parts finally click.\n\n\u23f3 **Launch window only: $0.99.** The price goes up once the launch ends, and readers who grab it now lock in the lowest price it will ever have.\n\n### \ud83d\udc49 [Get the book on Amazon before the price changes](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=evaluation--rag-vs-native-file-reading-benchmark&click=book-buy-amazon&target=https%3A%2F%2Famzn.to%2F4cvxqSw&text=)\n\n---\n"


⚠️ Potential issue | 🟠 Major

Remove tracking endpoints from notebook content.

The tracked link wrappers and tracking-pixel image call external analytics endpoints on open/click, which introduces a privacy/compliance risk for readers of a public technical notebook.

Also applies to: 399-399

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` at line 7, This notebook markdown cell contains tracked link wrappers and a tracking endpoint (e.g. any URL pointing to europe-west1-rag-techniques-views-tracker.cloudfunctions.net) that call external analytics on open; remove those wrappers and any tracking-pixel image calls and replace them with the direct target URLs (e.g. the direct Amazon shortlink https://amzn.to/4cvxqSw) or plain text links, ensuring the text/anchor (the book promotion heading and the "Get the book on Amazon" link) remain but reference only the direct destination without any tracking proxies or external analytics calls.

coderabbitai · 2026-04-09T15:08:02Z

+    "1. **Native file reading** \u2014 Claude Code searches files on its own using its built-in tools (grep, cat, read)\n",
+    "2. **With a RAG layer** \u2014 documents are pre-indexed, and a retrieval layer fetches relevant chunks before the LLM answers\n",


⚠️ Potential issue | 🟠 Major

Resolve the grep-tool methodology contradiction.

Line 22 says native mode used built-in grep, while Lines 361–362 say no explicit grep tools were given. This conflict weakens the benchmark narrative and needs a single consistent description.

Also applies to: 361-362

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 22 - 23, Decide on a single, consistent description for the "Native file reading" mode and update both occurrences to match; specifically, either keep the original phrasing "1. **Native file reading** — Claude Code searches files on its own using its built-in tools (grep, cat, read)" or replace it with an explicit statement that no built-in grep is used. Then edit the later note that currently says "no explicit grep tools were given" (the text around lines referencing the RAG comparison) so it matches the chosen behavior (e.g., "native mode uses Claude Code's built-in tools (grep, cat, read) and does not use a pre-indexed RAG layer" or "native mode does not use grep; it relies solely on file reads without tooling"); ensure both the initial bullet and the later explanatory sentence use the exact same wording to remove the contradiction.

coderabbitai · 2026-04-09T15:08:02Z

+    "    'Fabricates Answer (%)': [75, 0],  # midpoint of 50-100% range for native\n",
+    "    'Correctly Refuses (%)': [25, 100],\n",


⚠️ Potential issue | 🟡 Minor

Represent hallucination uncertainty explicitly in the chart.

Using a single midpoint (75%) for a reported 50–100% range can be misread as a measured point estimate. Prefer plotting the interval (or error bars) directly to avoid over-precision.

Also applies to: 311-313

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 283 - 284, The chart currently uses a single midpoint value for 'Fabricates Answer (%)' (75) which hides the reported 50–100% uncertainty; update the data and plotting calls to represent the interval explicitly (e.g., store the range as [50,100] or store center+error and use error bars) for the keys 'Fabricates Answer (%)' and 'Correctly Refuses (%)', and change the plotting logic to draw the interval (shaded band or errorbar) instead of a single point; apply the same change to the second occurrence noted (the block around the other data assignment) so both chart instances show the uncertainty range rather than a midpoint.

NirDiamant

Thanks for this contribution - a benchmark comparing RAG vs native LLM file reading is valuable content for the repo.

A few things to check before merging:

Does the notebook follow the existing format? (see any current notebook for the template - markdown headers, code cells with comments, results visualization)
Are all dependencies listed at the top of the notebook?
Is the dataset or a sample included, or are there clear instructions for reproducing?

Will review the code more closely soon.

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

NirDiamant reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs)#133

Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs)#133
adorosario wants to merge 1 commit intoNirDiamant:mainfrom
adorosario:add-rag-vs-native-evaluation

adorosario commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 9, 2026

Uh oh!

coderabbitai Bot Apr 9, 2026

Uh oh!

coderabbitai Bot Apr 9, 2026

Uh oh!

NirDiamant left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"1. Native file reading \u2014 Claude Code searches files on its own using its built-in tools (grep, cat, read)\n",
		"2. With a RAG layer \u2014 documents are pre-indexed, and a retrieval layer fetches relevant chunks before the LLM answers\n",

		" 'Fabricates Answer (%)': [75, 0], # midpoint of 50-100% range for native\n",
		" 'Correctly Refuses (%)': [25, 100],\n",

Uh oh!

Conversation

adorosario commented Apr 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

The benchmark

What's in the notebook

Why this fits the repo

Format

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

NirDiamant left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adorosario commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading