Skip to content

Commit 35cc004

Browse files
committed
Rescope benchmark report to retrieval-only
- Drop the broken agent-pipeline section (775/776 cells failed with no_final_response on this branch); the agent fix is being landed separately in PR #1399 / issue #1381. - Reframe scope and TL;DR around the retrieval probe. - Replace committed-artifacts language and the runs/MANIFEST.md pointer with reproduction-from-CLI instructions. Run artifacts are no longer committed because (a) ~22 MB across four configs bloats clone size and (b) gold.json contains verbatim contract excerpts whose redistribution licensing is unsettled. - Drop the unresolved 'TODO replace with link' placeholder. - Remove the Agent row from the Configurations table.
1 parent 31fe44b commit 35cc004

1 file changed

Lines changed: 41 additions & 99 deletions

File tree

docs/benchmarks/legalbench_rag_results.md

Lines changed: 41 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,19 @@
44
**Subsets**: all four (privacy_qa, contractnli, cuad, maud) — 194 tasks each
55
**Sampling**: paper-faithful — `legalbenchrag/benchmark.py:46-58` `SORT_BY_DOCUMENT=True` (random key seeded by `test.snippets[0].file_path`), then truncate to first 194 per subset
66
**Metrics**: paper-faithful — verbatim port of `legalbenchrag/run_benchmark.py:16-53` `QAResult.precision`/`.recall`, equivalence-tested against a vendored copy in `test_benchmarks.TestUpstreamEquivalence`
7-
**Last updated**: 2026-04-26
8-
**Run artifacts**: every number below resolves to one row in [`runs/MANIFEST.md`](./runs/MANIFEST.md)
7+
**Last updated**: 2026-04-28
8+
**Reproduction**: every number below is reproducible via the
9+
`python manage.py run_benchmark` CLI. Run artifacts are intentionally
10+
not committed to git — see [Reproduction](#reproduction) below.
11+
12+
> **Scope**: this PR validates the **retrieval probe only**. An
13+
> end-to-end agent benchmark was attempted on this branch but uncovered
14+
> a production-pipeline bug (multi-tool-call → no final structured
15+
> output, `no_final_response` failure mode on 775 of 776 cells); that
16+
> work is being landed separately in PR #1399 / issue #1381. The probe
17+
> reads the same vector store the agent does, so retrieval numbers
18+
> remain valid; only the LLM-extraction pass and citation grounding
19+
> are deferred.
920
1021
## TL;DR
1122

@@ -34,17 +45,8 @@
3445
> retrieval budget) we trade precision for recall on three of four
3546
> subsets, on the same operating-point trajectory the paper documents.
3647
37-
The agent layer of the production pipeline is currently broken on this
38-
branch (only 1 of 776 cells succeeds end-to-end; pydantic-ai exits the
39-
loop after a multi-tool-call message without producing a final structured
40-
output — same `no_final_response` failure mode the [PR #1380 audit thread]
41-
flagged). The probe still works correctly. Probe and agent results live
42-
in their own clearly-labelled sections.
43-
4448
## What changed since PR #1380's first numbers
4549

46-
[PR #1380 audit thread]: TODO replace with link
47-
4850
The earlier version of this report claimed `78.4% on privacy_qa`,
4951
`100.0% on contractnli`, `66.2% macro`. Those numbers were wrong because:
5052

@@ -243,7 +245,6 @@ All headline runs use `--retrieval-only --corpus-wide`:
243245
| Config A | `multi-qa-MiniLM-L6-cos-v1` (384d, microservice) | `paragraph`, `max_chars=None` | none |
244246
| Config B | `text-embedding-3-large` (3072d, OpenAI) | `sliding_window`, `window_size=500`, `overlap=0`, `respect_word_boundaries=False` | none |
245247
| Config C | `text-embedding-3-large` (3072d, OpenAI) | `paragraph`, `max_chars=6000` | none |
246-
| Agent | Config A's pipeline + `openai:gpt-4o-mini` extractor + grounding | (same as A) | none |
247248

248249
`max_chars=6000` for Config C keeps individual paragraph embeddings under
249250
OpenAI's 8,192-token context limit (~6000 chars ≈ 1500 tokens);
@@ -332,101 +333,42 @@ capture maud's clause-level gold spans well) but should be read as
332333
deep questions about merger-agreement clauses where gold answers are
333334
often discriminator phrases embedded in long boilerplate paragraphs.
334335

335-
## Agent-pipeline results (production end-to-end)
336-
337-
> **Scope**: This section measures the full OpenContracts production
338-
> pipeline (retrieval → iterative agent loop → LLM extraction → citation
339-
> grounding) on the same 776 paper-faithful task slice, using Config A
340-
> (MiniLM + paragraph) for retrieval and `openai:gpt-4o-mini` for
341-
> extraction at `extraction-concurrency=4`. The LegalBench-RAG paper has
342-
> no agent-loop equivalent, so **none of the metrics in this section
343-
> are comparable to the paper**. They characterise what production
344-
> OpenContracts actually surfaces to a user.
345-
346-
### Headline
347-
348-
| Metric | Value |
349-
|---|---:|
350-
| `extraction_success_rate` | **0.0013** (1 of 776 cells succeeded) |
351-
| `answer_token_f1` | 0.31 (over the 1 successful cell) |
352-
| `citation_char_recall` | 0.0000 |
353-
| `citation_char_precision` | 0.0000 |
354-
| `citation_span_overlaps_gold` | 0.0000 |
355-
| `probe_char_recall` (control) | 0.6339 (matches Config A retrieval-only exactly) |
356-
| total tokens consumed | 1,288,282 (1.23M in / 58K out) |
357-
| LLM requests | 795 |
358-
359-
### What this means
360-
361-
The probe column reproduces Config A's retrieval-only number exactly
362-
(0.6339), so the retrieval stack is fine. The agent integration is broken:
363-
each cell hits the same `no_final_response` failure mode the original
364-
PR #1380 audit thread flagged:
365-
366-
> The agent issues a tool call, the message log ends there with no
367-
> tool-return / no synthesis. The pydantic-ai loop exited without
368-
> producing a structured answer. **Pipeline bug**, not a data signal.
369-
370-
Sample failing cell from `report.json`:
371-
372-
```text
373-
task: contractnli::0000
374-
prediction: ''
375-
error: Failed to extract requested data from document (no_final_response)
376-
messages=2, response_msgs=1, tool_calls_total=3,
377-
last_response_parts=['tool-call', 'tool-call', 'tool-call']
378-
```
379-
380-
The agent emits 3 tool calls in one assistant message, then the loop
381-
terminates without producing a structured-output assistant message, and
382-
`doc_extract_query_task` records `no_final_response` and returns None.
383-
This happens on 775 of 776 cells.
384-
385-
The original PR #1380 included a "prompt-tightening fix" in
386-
`pydantic_ai_agents.py` that was supposed to address this exact failure
387-
mode. Either:
388-
1. That fix didn't apply to this branch's pydantic-ai version,
389-
2. The fix regressed in a later commit, or
390-
3. gpt-4o-mini's behaviour changed enough that the prompt tightening no
391-
longer prevents the multi-tool-call-then-stop pattern.
392-
393-
Either way, **the production agent path is currently unusable on this
394-
branch with gpt-4o-mini**. The prior report's claim of `0.197 char_F1`
395-
and `0.242 answer_token_f1` for this exact config does not reproduce
396-
on a clean run today.
397-
398-
This is now follow-up work tracked in PR #1380's audit thread; the
399-
report's headline retrieval claims do not depend on it.
400-
401336
## Reproduction
402337

403-
Every run directory under [`runs/`](./runs/) contains:
404-
405-
| File | Source |
406-
|---|---|
407-
| `report.json` | `BenchmarkReport.write` — per-task metrics + aggregates |
408-
| `report.csv` | same data flattened |
409-
| `config.json` | adapter description, top_k, model id, sampling parameters |
410-
| `gold.json` | per-datacell gold spans + answer text + tags |
411-
| `command.txt` | exact `manage.py run_benchmark` invocation that produced the artifacts |
412-
413-
To re-execute any run:
338+
Run artifacts are NOT committed to git — they are large (~22 MB across
339+
the four configs above) and the LegalBench-RAG `gold.json` files contain
340+
verbatim contract excerpts whose redistribution licensing is unsettled.
341+
Every config above is reproducible from the local stack:
414342

415-
1. Bring up the local stack: `docker compose -f local.yml up -d postgres redis vector-embedder django`.
343+
1. Bring up local services: `docker compose -f local.yml up -d postgres redis vector-embedder django`.
416344
2. Apply migrations: `python manage.py migrate`.
417-
3. Stage the LegalBench-RAG dataset under `/data/legalbenchrag/` (corpus + benchmarks subdirs, exactly as shipped by the upstream Dropbox link).
418-
4. Configure `PipelineSettings.default_embedder` + `parser_kwargs[TxtParser]` per the run's `config.json`.
419-
5. Run the command in `command.txt`.
345+
3. Stage the LegalBench-RAG dataset under `/data/legalbenchrag/`
346+
(corpus + benchmarks subdirs, exactly as shipped by the upstream
347+
Dropbox link).
348+
4. Configure `PipelineSettings.default_embedder` and
349+
`parser_kwargs[TxtParser]` per the configuration you want to
350+
reproduce (see the [Configurations](#configurations) section).
351+
5. Invoke the harness:
352+
```bash
353+
python manage.py run_benchmark legalbench-rag \
354+
--top-k 32 --paper-sampling --retrieval-only \
355+
--run-dir <output-dir>
356+
```
357+
358+
`<output-dir>` will receive `report.json`, `report.csv`, `config.json`,
359+
`gold.json`, and `command.txt` — the same artifact shape that earlier
360+
versions of this PR committed in-tree.
420361

421362
The harness is deterministic given identical inputs and identical
422-
settings; `aggregates.probe_char_recall` should match the committed
423-
`report.json` to within rounding.
363+
settings; `aggregates.probe_char_recall` reproduces to within rounding.
424364

425365
## Open issues
426366

427-
- **Agent integration broken** (above section) — pydantic-ai loop exits
428-
on multi-tool-call response without producing structured output.
429-
Blocks all production-pipeline numbers.
367+
- **End-to-end agent benchmark deferred** — pydantic-ai loop exits on
368+
multi-tool-call response without producing structured output, leaving
369+
the production extraction path unusable on this branch with
370+
`gpt-4o-mini`. Tracked in PR #1399 / issue #1381. Once the agent
371+
fix lands, an end-to-end column will be added back to this report.
430372
- **Precision gap at the paper's operating point** — Config B at k=32
431373
matches the paper's recall on cuad/maud but its precision is roughly
432374
flat (0.0240 macro vs paper's ~0.034). This is mostly the k=32 vs k=10
@@ -436,7 +378,7 @@ settings; `aggregates.probe_char_recall` should match the committed
436378
- **Config C precision is microscopic for the same reason** — 0.0104
437379
macro is what you get when you retrieve 30K-136K chars per query. Not
438380
a defect; just the cost of a bigger budget. PR #1354's reranker
439-
framework hasn't yet produced a reranker that helps the agent loop
381+
framework hasn't yet produced a reranker that helps in this regime
440382
(see #1378).
441383
- **Sampling matches PAPER but not source dataset** — our 194-per-subset
442384
slice is selected via the upstream codebase's

0 commit comments

Comments
 (0)