You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Drop the broken agent-pipeline section (775/776 cells failed with
no_final_response on this branch); the agent fix is being landed
separately in PR #1399 / issue #1381.
- Reframe scope and TL;DR around the retrieval probe.
- Replace committed-artifacts language and the runs/MANIFEST.md pointer
with reproduction-from-CLI instructions. Run artifacts are no
longer committed because (a) ~22 MB across four configs bloats clone
size and (b) gold.json contains verbatim contract excerpts whose
redistribution licensing is unsettled.
- Drop the unresolved 'TODO replace with link' placeholder.
- Remove the Agent row from the Configurations table.
Copy file name to clipboardExpand all lines: docs/benchmarks/legalbench_rag_results.md
+41-99Lines changed: 41 additions & 99 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,19 @@
4
4
**Subsets**: all four (privacy_qa, contractnli, cuad, maud) — 194 tasks each
5
5
**Sampling**: paper-faithful — `legalbenchrag/benchmark.py:46-58``SORT_BY_DOCUMENT=True` (random key seeded by `test.snippets[0].file_path`), then truncate to first 194 per subset
6
6
**Metrics**: paper-faithful — verbatim port of `legalbenchrag/run_benchmark.py:16-53``QAResult.precision`/`.recall`, equivalence-tested against a vendored copy in `test_benchmarks.TestUpstreamEquivalence`
7
-
**Last updated**: 2026-04-26
8
-
**Run artifacts**: every number below resolves to one row in [`runs/MANIFEST.md`](./runs/MANIFEST.md)
7
+
**Last updated**: 2026-04-28
8
+
**Reproduction**: every number below is reproducible via the
9
+
`python manage.py run_benchmark` CLI. Run artifacts are intentionally
10
+
not committed to git — see [Reproduction](#reproduction) below.
11
+
12
+
> **Scope**: this PR validates the **retrieval probe only**. An
13
+
> end-to-end agent benchmark was attempted on this branch but uncovered
14
+
> a production-pipeline bug (multi-tool-call → no final structured
15
+
> output, `no_final_response` failure mode on 775 of 776 cells); that
16
+
> work is being landed separately in PR #1399 / issue #1381. The probe
17
+
> reads the same vector store the agent does, so retrieval numbers
18
+
> remain valid; only the LLM-extraction pass and citation grounding
19
+
> are deferred.
9
20
10
21
## TL;DR
11
22
@@ -34,17 +45,8 @@
34
45
> retrieval budget) we trade precision for recall on three of four
35
46
> subsets, on the same operating-point trajectory the paper documents.
36
47
37
-
The agent layer of the production pipeline is currently broken on this
38
-
branch (only 1 of 776 cells succeeds end-to-end; pydantic-ai exits the
39
-
loop after a multi-tool-call message without producing a final structured
40
-
output — same `no_final_response` failure mode the [PR #1380 audit thread]
41
-
flagged). The probe still works correctly. Probe and agent results live
42
-
in their own clearly-labelled sections.
43
-
44
48
## What changed since PR #1380's first numbers
45
49
46
-
[PR #1380 audit thread]: TODO replace with link
47
-
48
50
The earlier version of this report claimed `78.4% on privacy_qa`,
49
51
`100.0% on contractnli`, `66.2% macro`. Those numbers were wrong because:
50
52
@@ -243,7 +245,6 @@ All headline runs use `--retrieval-only --corpus-wide`:
243
245
| Config A |`multi-qa-MiniLM-L6-cos-v1` (384d, microservice) |`paragraph`, `max_chars=None`| none |
0 commit comments