Skip to content

Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs)#133

Open
adorosario wants to merge 1 commit intoNirDiamant:mainfrom
adorosario:add-rag-vs-native-evaluation
Open

Add evaluation: RAG vs native LLM file reading benchmark (500 PDFs)#133
adorosario wants to merge 1 commit intoNirDiamant:mainfrom
adorosario:add-rag-vs-native-evaluation

Conversation

@adorosario
Copy link
Copy Markdown

@adorosario adorosario commented Apr 9, 2026

What this adds

A new evaluation notebook (evaluation/rag_vs_native_file_reading_benchmark.ipynb) with an empirical benchmark comparing native LLM file reading vs RAG retrieval.

The benchmark

Claude Code (Sonnet 4.6) was given 500 corporate PDFs and asked 10 factual questions (5 needle-in-haystack, 5 pattern-matching) under two configurations: native file reading and with a RAG layer.

Key findings:

Metric Native (500 docs) With RAG (500 docs)
Avg response time 2 min 31 sec 36 sec (4.2× faster)
Cost per query $0.40 $0.13 (3.2× cheaper)
Completion in 3 min 39% 100%
Hallucination on missing info 50–100% 0% (correctly refuses)

What's in the notebook

  • Interactive charts: scaling cliff, response time comparison, cost comparison, hallucination rates
  • Analysis of the crossover point (~50-100 documents)
  • Honest limitations table (vendor benchmark, synthetic corpus, single model, etc.)
  • All data sourced from the open-source benchmark repo: adorosario/customgpt-rag-plugin-benchmarking

Why this fits the repo

The RAG_Techniques community regularly asks "do I still need RAG with larger context windows?" This notebook answers that question with empirical data. It complements the existing evaluation notebooks (DeepEval, GroUSE) by measuring a different dimension: when does native file reading break down and retrieval become necessary?

Format

  • Follows the repo's notebook structure (book promo cell, markdown explanations before each code cell, tracking pixel)
  • Only dependencies: pandas, matplotlib (no API keys needed)
  • Data is hardcoded from the published results — notebook runs without any external setup

Raw data, scoring logic, and reproduction scripts are MIT licensed at the benchmark repo.

Summary by CodeRabbit

  • New Features
    • Added an evaluation notebook with comprehensive benchmarks comparing RAG and native file reading approaches, including performance metrics, response-time analysis, cost comparisons, and behavior analysis across multiple document tiers.

Adds an empirical benchmark comparing native Claude Code file reading
vs RAG retrieval across 500 PDFs. Includes interactive charts for
scaling curves, cost comparison, and hallucination rates.

Data from the open-source benchmark at
github.com/adorosario/customgpt-rag-plugin-benchmarking (MIT licensed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

A new Jupyter notebook for benchmarking native file reading versus RAG approaches has been added. The notebook installs dependencies, defines benchmark datasets across multiple document tiers, generates four comparative visualizations, and provides performance summaries with identified limitations.

Changes

Cohort / File(s) Summary
Benchmark Evaluation Notebook
evaluation/rag_vs_native_file_reading_benchmark.ipynb
New notebook containing benchmark datasets comparing native file reading and RAG approaches. Includes dependency installation, four visualizations (completion-rate scaling, response-time scaling, cost comparison, hallucination vs. refusal analysis), per-tier performance summaries, status categorization, crossover zone identification, and limitations/takeaways sections.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops through benchmark trees,
Comparing native reads with RAG with ease!
Charts and graphs spring to life so bright,
Document tiers dancing in the light—
Performance secrets now laid bare,
A thorough analysis crafted with care! 📊✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding a new evaluation notebook comparing RAG versus native LLM file reading with specific benchmark parameters (500 PDFs).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.15.9)
evaluation/rag_vs_native_file_reading_benchmark.ipynb

Unexpected end of JSON input


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
evaluation/rag_vs_native_file_reading_benchmark.ipynb (1)

223-224: Derive displayed metrics from df_native/rag_results instead of hardcoding.

Hardcoded values (4.2×, $0.40, $0.13) can silently drift from the source table if data is updated.

♻️ Proposed refactor
+# Derive 500-doc metrics from source data
+native_500_time = float(df_native.loc[df_native['Documents'] == 500, 'Avg Wait (sec)'].iloc[0])
+rag_time = float(rag_results['Avg Wait (sec)'])
+speedup = native_500_time / rag_time
+
-ax.annotate(f'4.2× faster',
-            xy=(500, 36), xytext=(250, 50),
+ax.annotate(f'{speedup:.1f}× faster',
+            xy=(500, rag_time), xytext=(250, 50),
             arrowprops=dict(arrowstyle='->', color='#27ae60'),
             fontsize=11, color='#27ae60', fontweight='bold')
+# Derive cost values from source data
+native_500_cost = float(df_native.loc[df_native['Documents'] == 500, 'Cost per Query ($)'].iloc[0])
+rag_cost = float(rag_results['Cost per Query ($)'])
-ax.bar(['Native\n(500 docs)'], [0.40], color='#e74c3c', width=0.4, label='Native Claude Code')
-ax.bar(['With RAG\n(500 docs)'], [0.13], color='#27ae60', width=0.4, label='With RAG')
+ax.bar(['Native\n(500 docs)'], [native_500_cost], color='#e74c3c', width=0.4, label='Native Claude Code')
+ax.bar(['With RAG\n(500 docs)'], [rag_cost], color='#27ae60', width=0.4, label='With RAG')

Also applies to: 247-248

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 223 -
224, The plot currently hardcodes annotations (e.g., "4.2× faster", "$0.40",
"$0.13") instead of deriving them from the dataframes; update the annotation
code that calls ax.annotate to compute the displayed strings from df_native and
rag_results (use the relevant aggregate values or computed ratios from those
DataFrame rows/columns), format them (e.g., f"{ratio:.1f}×" and currency
strings) and then pass the computed strings to ax.annotate so the labels always
reflect the source data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb`:
- Around line 283-284: The chart currently uses a single midpoint value for
'Fabricates Answer (%)' (75) which hides the reported 50–100% uncertainty;
update the data and plotting calls to represent the interval explicitly (e.g.,
store the range as [50,100] or store center+error and use error bars) for the
keys 'Fabricates Answer (%)' and 'Correctly Refuses (%)', and change the
plotting logic to draw the interval (shaded band or errorbar) instead of a
single point; apply the same change to the second occurrence noted (the block
around the other data assignment) so both chart instances show the uncertainty
range rather than a midpoint.
- Line 7: This notebook markdown cell contains tracked link wrappers and a
tracking endpoint (e.g. any URL pointing to
europe-west1-rag-techniques-views-tracker.cloudfunctions.net) that call external
analytics on open; remove those wrappers and any tracking-pixel image calls and
replace them with the direct target URLs (e.g. the direct Amazon shortlink
https://amzn.to/4cvxqSw) or plain text links, ensuring the text/anchor (the book
promotion heading and the "Get the book on Amazon" link) remain but reference
only the direct destination without any tracking proxies or external analytics
calls.
- Around line 22-23: Decide on a single, consistent description for the "Native
file reading" mode and update both occurrences to match; specifically, either
keep the original phrasing "1. **Native file reading** — Claude Code searches
files on its own using its built-in tools (grep, cat, read)" or replace it with
an explicit statement that no built-in grep is used. Then edit the later note
that currently says "no explicit grep tools were given" (the text around lines
referencing the RAG comparison) so it matches the chosen behavior (e.g., "native
mode uses Claude Code's built-in tools (grep, cat, read) and does not use a
pre-indexed RAG layer" or "native mode does not use grep; it relies solely on
file reads without tooling"); ensure both the initial bullet and the later
explanatory sentence use the exact same wording to remove the contradiction.

---

Nitpick comments:
In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb`:
- Around line 223-224: The plot currently hardcodes annotations (e.g., "4.2×
faster", "$0.40", "$0.13") instead of deriving them from the dataframes; update
the annotation code that calls ax.annotate to compute the displayed strings from
df_native and rag_results (use the relevant aggregate values or computed ratios
from those DataFrame rows/columns), format them (e.g., f"{ratio:.1f}×" and
currency strings) and then pass the computed strings to ax.annotate so the
labels always reflect the source data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea5ec042-07d9-457b-842e-53e5df4b0b5a

📥 Commits

Reviewing files that changed from the base of the PR and between 788cd74 and 1e3bd14.

📒 Files selected for processing (1)
  • evaluation/rag_vs_native_file_reading_benchmark.ipynb

"cell_type": "markdown",
"metadata": {},
"source": [
"## \ud83d\udcd6 [The RAG Techniques Book is HERE](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=evaluation--rag-vs-native-file-reading-benchmark&click=book-buy-amazon&target=https%3A%2F%2Famzn.to%2F4cvxqSw&text=)\n\n**The super extended version of this repository.** The book goes far beyond the notebooks: the **intuition** behind every technique, **side-by-side comparisons** showing when each approach wins (and when it quietly fails), and **illustrations** that make the tricky parts finally click.\n\n\u23f3 **Launch window only: $0.99.** The price goes up once the launch ends, and readers who grab it now lock in the lowest price it will ever have.\n\n### \ud83d\udc49 [Get the book on Amazon before the price changes](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=evaluation--rag-vs-native-file-reading-benchmark&click=book-buy-amazon&target=https%3A%2F%2Famzn.to%2F4cvxqSw&text=)\n\n---\n"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove tracking endpoints from notebook content.

The tracked link wrappers and tracking-pixel image call external analytics endpoints on open/click, which introduces a privacy/compliance risk for readers of a public technical notebook.

Also applies to: 399-399

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` at line 7, This
notebook markdown cell contains tracked link wrappers and a tracking endpoint
(e.g. any URL pointing to
europe-west1-rag-techniques-views-tracker.cloudfunctions.net) that call external
analytics on open; remove those wrappers and any tracking-pixel image calls and
replace them with the direct target URLs (e.g. the direct Amazon shortlink
https://amzn.to/4cvxqSw) or plain text links, ensuring the text/anchor (the book
promotion heading and the "Get the book on Amazon" link) remain but reference
only the direct destination without any tracking proxies or external analytics
calls.

Comment on lines +22 to +23
"1. **Native file reading** \u2014 Claude Code searches files on its own using its built-in tools (grep, cat, read)\n",
"2. **With a RAG layer** \u2014 documents are pre-indexed, and a retrieval layer fetches relevant chunks before the LLM answers\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the grep-tool methodology contradiction.

Line 22 says native mode used built-in grep, while Lines 361–362 say no explicit grep tools were given. This conflict weakens the benchmark narrative and needs a single consistent description.

Also applies to: 361-362

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 22 - 23,
Decide on a single, consistent description for the "Native file reading" mode
and update both occurrences to match; specifically, either keep the original
phrasing "1. **Native file reading** — Claude Code searches files on its own
using its built-in tools (grep, cat, read)" or replace it with an explicit
statement that no built-in grep is used. Then edit the later note that currently
says "no explicit grep tools were given" (the text around lines referencing the
RAG comparison) so it matches the chosen behavior (e.g., "native mode uses
Claude Code's built-in tools (grep, cat, read) and does not use a pre-indexed
RAG layer" or "native mode does not use grep; it relies solely on file reads
without tooling"); ensure both the initial bullet and the later explanatory
sentence use the exact same wording to remove the contradiction.

Comment on lines +283 to +284
" 'Fabricates Answer (%)': [75, 0], # midpoint of 50-100% range for native\n",
" 'Correctly Refuses (%)': [25, 100],\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Represent hallucination uncertainty explicitly in the chart.

Using a single midpoint (75%) for a reported 50–100% range can be misread as a measured point estimate. Prefer plotting the interval (or error bars) directly to avoid over-precision.

Also applies to: 311-313

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluation/rag_vs_native_file_reading_benchmark.ipynb` around lines 283 -
284, The chart currently uses a single midpoint value for 'Fabricates Answer
(%)' (75) which hides the reported 50–100% uncertainty; update the data and
plotting calls to represent the interval explicitly (e.g., store the range as
[50,100] or store center+error and use error bars) for the keys 'Fabricates
Answer (%)' and 'Correctly Refuses (%)', and change the plotting logic to draw
the interval (shaded band or errorbar) instead of a single point; apply the same
change to the second occurrence noted (the block around the other data
assignment) so both chart instances show the uncertainty range rather than a
midpoint.

Copy link
Copy Markdown
Owner

@NirDiamant NirDiamant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution - a benchmark comparing RAG vs native LLM file reading is valuable content for the repo.

A few things to check before merging:

  1. Does the notebook follow the existing format? (see any current notebook for the template - markdown headers, code cells with comments, results visualization)
  2. Are all dependencies listed at the top of the notebook?
  3. Is the dataset or a sample included, or are there clear instructions for reproducing?

Will review the code more closely soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants