Skip to content

Commit 91fbef9

Browse files
authored
Merge pull request #1380 from Open-Source-Legal/pr-1239-clean
Add LegalBench-RAG benchmark harness + parity report (composite of #1239/#1353/#1354/#1376)
2 parents adf0e92 + a38c079 commit 91fbef9

85 files changed

Lines changed: 11175 additions & 416 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,3 +190,15 @@ scratch.py
190190
# Implementation plans (ephemeral)
191191
docs/plans/
192192
docs/superpowers/
193+
194+
# Benchmark run outputs (per-experiment, ad-hoc — not committed)
195+
/benchmark_runs/
196+
/data/legalbenchrag/
197+
/data/legalbenchrag_data/
198+
/BENCHMARK_RESULTS.md
199+
/benchmark_results_aggregates.json
200+
201+
# Headline benchmark runs ARE committed under docs/benchmarks/runs/ so
202+
# every number in docs/benchmarks/legalbench_rag_results.md has a
203+
# traceable artifact. Do not exclude that directory.
204+
!docs/benchmarks/runs/

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8181

8282
### Added
8383

84+
- **Pluggable text chunking strategies for `TxtParser`** (Issue #1348, alongside PR #1239): Introduced `opencontractserver/pipeline/parsers/text_chunkers.py` — a small registry-backed abstraction (`BaseTextChunker` + `TextChunk` + `get_chunker`) with three built-in strategies: `SentenceChunker` (spaCy `doc.sents`, preserves pre-#1348 behaviour and emits the existing `SENTENCE` label), `ParagraphChunker` (blank-line split with optional `min_chars` filter and `max_chars` oversize-paragraph fallback, emits `PARAGRAPH`), and `SlidingWindowChunker` (fixed-character window with configurable `overlap` and optional `respect_word_boundaries` snap, emits `WINDOW`). `TxtParser` now declares a `Settings` dataclass with a `chunkers: list[ChunkerSpec]` field (default `[{"name": "sentence"}]`) that can be overridden via `PipelineSettings` *or* per-call via a `chunkers=[...]` kwarg on `parse_document`; the parser iterates the configured strategies and emits one structural SPAN_LABEL annotation per chunk under each strategy's label, so stacked configurations (e.g. sentence + paragraph) index multiple retrieval granularities simultaneously. Motivates the benchmark work in #1239: the LegalBench-RAG `probe_recall_at_10` gap on `privacy_qa` (0.22 observed vs 0.5–0.8 paper floor) is the thesis for needing paragraph-granularity retrieval units, but this PR is strategy-neutral — which chunker wins for which subset is a follow-up optimisation to be driven by the benchmark harness itself. Regression coverage in `opencontractserver/tests/test_text_chunkers.py` (pure-Python, no Django DB) exercises offset/whitespace invariants, overlap arithmetic, word-boundary snapping, argument validation and registry lookup; `test_txt_ingestor_pipeline.py` gains two integration tests that parse the live fixture with a paragraph-only and a stacked paragraph+sliding_window recipe. Existing sentence-only ingestion path is unchanged.
85+
- **Global post-retrieval reranker for vector search** (Issue #1349): Adds an optional cross-encoder reranking stage that runs after first-stage vector / hybrid retrieval, so OpenContracts can close the gap between vanilla HNSW recall and the accuracy achievable with a cross-encoder scoring pass.
86+
- New abstract base class `opencontractserver.pipeline.base.reranker.BaseReranker` wired into the existing `PipelineComponentBase` settings machinery: concrete subclasses declare a `Settings` dataclass (loaded from `PipelineSettings` at runtime) and implement `_rerank_impl(query, passages, **kwargs)`. A default `_arerank_impl` wraps the sync implementation via `sync_to_async` so every backend has a working async path without duplicating logic.
87+
- Fault-tolerant helpers `safe_rerank` / `safe_arerank` swallow reranker failures and return `None` so retrieval degrades gracefully to the first-stage ordering — critical because a misconfigured reranker must never take down semantic search.
88+
- Four shipped backends in `opencontractserver/pipeline/rerankers/`:
89+
- `NoopReranker` — identity pass-through for tests and benchmark control conditions.
90+
- `CrossEncoderReranker` — in-process `sentence_transformers.CrossEncoder` (default `BAAI/bge-reranker-v2-m3` per the issue); lazy model load cached by `(model_name, device)` so workers pay the ~300 MB cost once and reuse it on every query. `sentence-transformers` / `torch` are treated as optional dependencies; a missing install surfaces a clear `ImportError` only when this backend is actually selected.
91+
- `MicroserviceReranker` — HTTP client that mirrors the shape of `MicroserviceEmbedder` (URL, optional API key, Cloud-Run IAM auth, retry-friendly timeouts). Operators can run any reranker model behind a `/rerank` endpoint and point OpenContracts at it via `RERANKER_MICROSERVICE_URL` (+ secret `RERANKER_MICROSERVICE_API_KEY`).
92+
- `CohereReranker` — hosted Rerank API (`rerank-v3.5` by default) via the REST endpoint directly (no hard dep on the `cohere` SDK). API key stored in the encrypted `PipelineSettings.encrypted_secrets` bag under `cohere_api_key` (env var `COHERE_API_KEY` at migration time).
93+
- New `ComponentType.RERANKER` enum value and `rerankers/` auto-discovery in `opencontractserver.pipeline.registry`; `PipelineComponentRegistry` now exposes `.rerankers` / `get_all_rerankers_cached()` alongside parsers, embedders, thumbnailers, and post-processors.
94+
- `PipelineSettings.default_reranker` (CharField, max_length=512, `documents/models.py:852-980`) — empty string disables reranking; any value is a full class path resolved at runtime. Seeded from `DEFAULT_RERANKER` Django setting at migration time (`documents/migrations/0037_add_default_reranker_to_pipeline_settings.py`). Helpers `get_default_reranker_path()` / `get_default_reranker_class()` / `get_default_reranker_instance()` in `opencontractserver.pipeline.utils`, with a process-local instance cache (cross-encoder model weights are expensive) invalidated via `invalidate_reranker_cache()` on every settings update.
95+
- `CoreAnnotationVectorStore` (`opencontractserver/llms/vector_stores/core_vector_stores.py:120-1041`) now accepts an optional `reranker` override + `rerank_oversample_factor` kwarg. Every search path — `search`, `async_search`, `hybrid_search`, `async_hybrid_search`, `global_search`, `async_global_search` — oversamples candidates by `RERANK_OVERSAMPLE_FACTOR` (default 3× the requested `top_k`, hard-capped by `RERANK_MAX_CANDIDATES = 128`) when a reranker is active and re-orders results through `_apply_rerank` / `_aapply_rerank` before returning the final `top_k`. All new plumbing is a no-op when `default_reranker` is empty, so zero behavior change for existing deployments.
96+
- GraphQL surface: `PipelineComponentsType.rerankers`, `PipelineSettingsType.default_reranker`, and `UpdatePipelineSettingsMutation.default_reranker` (validated against the registry, invalidates the reranker instance cache on change).
97+
- New constants in `opencontractserver.constants.search`: `RERANK_OVERSAMPLE_FACTOR`, `RERANK_MAX_CANDIDATES`, `RERANK_DEFAULT_TOP_K`. New `RERANKER_REQUEST_TIMEOUT_SECONDS` in `opencontractserver.constants.document_processing`.
98+
- Tests in `opencontractserver/tests/test_reranker.py` cover the base-class contract (sorting, top_k trim, out-of-range indices, max-candidates, async fallback), `safe_rerank` / `safe_arerank` fault-tolerance, all three HTTP backends with mocked `requests.post`, pipeline utility resolution + instance caching, registry auto-discovery, and vector-store integration (oversample factor, reranker failure fallback, re-ordering effects).
8499
- **Mypy graduation: typed GraphQL resolvers, mutations, and filters** (Issue #1332): Raised return-annotation coverage in `config/graphql/` from ~4.8% at the start of #1331 to **91.5%** (421/460 function defs) and removed 22 modules from the `mypy.ini` baseline allow-list.
85100
- **Root-cause annotation fixes in `opencontractserver/utils/permissioning.py`**: `set_permissions_for_obj_to_user`, `user_has_permission_for_obj`, `get_users_permissions_for_obj`, and `get_permission_id_to_name_map_for_model` were previously annotated with `instance: type[django.db.models.Model]` (a class) despite every call site passing an instance — and with `user: type[User]` instead of the `User` runtime instance. These were annotation bugs (the code was correct, the annotations were inverted), which compounded: every mutation calling `set_permissions_for_obj_to_user(user, obj, ...)` was a single `[arg-type]` error each. Corrected to `instance: django.db.models.Model` / `user: UserModel` (forward-referenced via `TYPE_CHECKING` import of `opencontractserver.users.models.User`). Also added the missing `dict[int, str]` annotation on `this_model_permission_id_map` and removed the `user_instance=User` (class) default on `get_users_group_ids`, which would have exploded at runtime if any caller ever omitted the argument. Module graduated out of the baseline.
86101
- **Graduated from `mypy.ini` baseline** (22 modules): `config.graphql.{action_queries, agent_mutations, badge_mutations, base_types, conversation_mutations, conversation_types, corpus_types, document_queries, filters, ingestion_source_mutations, moderation_mutations, og_metadata_queries, pipeline_queries, security, serializers, slug_queries, smart_label_mutations, social_types, user_queries, user_types, voting_mutations}` and `opencontractserver.utils.permissioning`. Each had the underlying mypy errors fixed first (root-cause in `permissioning.py` cleared the `set_permissions_for_obj_to_user` cluster across every mutation file above).
@@ -351,6 +366,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
351366

352367
### Added
353368

369+
- **Benchmark harness for external RAG datasets** (new app `opencontractserver/benchmarks/`): Generate an OpenContracts corpus from a third-party benchmark (LegalBench-RAG today, pluggable for CUAD/MAUD/etc. via a small adapter interface), run the production extract-grid pipeline against the benchmark's queries with a configurable LLM, probe retrieval independently via `CoreAnnotationVectorStore`, and compute standard metrics (SQuAD-style exact match / token F1 for answers; character-span recall@k / precision@k / IoU for retrieval). Results are written as `report.json` / `report.csv` / `config.json` / `gold.json` under a run directory.
370+
- Adapter interface and `LegalBenchRAGAdapter` at `opencontractserver/benchmarks/adapters/` (reads the authoritative ZeroEntropy schema — `{"tests": [{"query", "snippets": [{"file_path", "span": [start, end]}], "tags"}]}`)
371+
- Loader, runner, evaluator, and report modules under `opencontractserver/benchmarks/`
372+
- Django management command: `python manage.py run_benchmark --benchmark legalbench-rag --path /data/legalbench-rag --user admin --model openai:gpt-4o-mini --top-k 10`
373+
- Micro fixture under `fixtures/benchmarks/legalbench_rag_micro/` for end-to-end tests without downloading the full dataset
374+
- Test coverage: `opencontractserver/tests/test_benchmarks.py` (metric unit tests, adapter unit tests, loader materialization test, runner end-to-end test with mocked structured-response agent)
375+
- **`model_override` kwarg on `doc_extract_query_task`** (`opencontractserver/tasks/data_extract_tasks.py`): Optional, backward-compatible kwarg that lets callers override the hardcoded `openai:gpt-4o-mini` default for a single invocation. Consumed by the benchmark runner to sweep models without affecting production defaults; still defaults to `openai:gpt-4o-mini` when not supplied.
354376
- **Frontend unit tests for utils and hooks** (Issue #1267): Added 14 new `*.test.ts(x)` files covering previously-untested utilities and hooks to raise `frontend-unit` coverage on high-ROI pure functions:
355377

356378
- **Utils**: `formatters.test.ts`, `arrayUtils.test.ts`, `colorUtils.test.ts`, `parseOutputType.test.ts`, `annotationGuards.test.ts`, `env.test.ts`, `extractUtils.test.ts`, `layout.test.ts`, `persistentVar.test.ts`, `routingLogger.test.ts`, `navigationCircuitBreaker.test.ts`, `performance.test.ts`, `jobNotificationCacheUpdates.test.ts`, `compactAnnotationJson.test.ts`.

config/graphql/pipeline_queries.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,9 @@ def resolve_pipeline_components(
9797
if settings_instance.default_embedder:
9898
configured_components.add(settings_instance.default_embedder)
9999

100+
if settings_instance.default_reranker:
101+
configured_components.add(settings_instance.default_reranker)
102+
100103
if settings_instance.parser_kwargs:
101104
configured_components.update(settings_instance.parser_kwargs.keys())
102105

@@ -121,6 +124,7 @@ def filter_configured(
121124
"post_processors": filter_configured(
122125
components_data["post_processors"]
123126
),
127+
"rerankers": filter_configured(components_data.get("rerankers", [])),
124128
}
125129

126130
# Convert PipelineComponentDefinition objects to GraphQL types
@@ -183,6 +187,10 @@ def to_graphql_type(
183187
to_graphql_type(d, "post_processor")
184188
for d in components_data["post_processors"]
185189
],
190+
rerankers=[
191+
to_graphql_type(d, "reranker")
192+
for d in components_data.get("rerankers", [])
193+
],
186194
)
187195

188196
# SUPPORTED MIME TYPES #####################################
@@ -254,6 +262,7 @@ def resolve_pipeline_settings(
254262
parser_kwargs=settings_instance.parser_kwargs or {},
255263
component_settings=settings_instance.component_settings or {},
256264
default_embedder=settings_instance.default_embedder or "",
265+
default_reranker=settings_instance.default_reranker or "",
257266
components_with_secrets=components_with_secrets,
258267
tools_with_secrets=settings_instance.get_tools_with_secrets(),
259268
enabled_components=settings_instance.enabled_components or [],

config/graphql/pipeline_settings_mutations.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,13 @@ class Arguments:
213213
required=False,
214214
description="Default embedder class path when no MIME-specific embedder is found.",
215215
)
216+
default_reranker = graphene.String(
217+
required=False,
218+
description=(
219+
"Default post-retrieval reranker class path. Empty string "
220+
"disables reranking (first-stage vector / hybrid search only)."
221+
),
222+
)
216223
enabled_components = graphene.List(
217224
graphene.String,
218225
required=False,
@@ -237,6 +244,7 @@ def mutate(
237244
parser_kwargs=None,
238245
component_settings=None,
239246
default_embedder=None,
247+
default_reranker=None,
240248
enabled_components=None,
241249
) -> "UpdatePipelineSettingsMutation":
242250
"""
@@ -393,6 +401,32 @@ def mutate(
393401
)
394402
settings_instance.default_embedder = default_embedder
395403

404+
# Validate default_reranker (empty string = disabled)
405+
if default_reranker is not None:
406+
if default_reranker:
407+
error = validate_component_path(default_reranker)
408+
if error:
409+
return UpdatePipelineSettingsMutation(
410+
ok=False, message=error, pipeline_settings=None
411+
)
412+
if not registry.get_by_class_name(default_reranker):
413+
return UpdatePipelineSettingsMutation(
414+
ok=False,
415+
message=(
416+
f"Default reranker '{default_reranker}' "
417+
"not found in registry."
418+
),
419+
pipeline_settings=None,
420+
)
421+
settings_instance.default_reranker = default_reranker
422+
# Drop cached reranker instance so the next retrieval picks
423+
# up the new configuration without a worker restart.
424+
from opencontractserver.pipeline.utils import (
425+
invalidate_reranker_cache,
426+
)
427+
428+
invalidate_reranker_cache()
429+
396430
# Validate enabled_components
397431
if enabled_components is not None:
398432
if not isinstance(enabled_components, list):
@@ -487,6 +521,7 @@ def mutate(
487521
("parser_kwargs", parser_kwargs),
488522
("component_settings", component_settings),
489523
("default_embedder", default_embedder),
524+
("default_reranker", default_reranker),
490525
("enabled_components", enabled_components),
491526
]
492527
if val is not None
@@ -508,6 +543,7 @@ def mutate(
508543
parser_kwargs=settings_instance.parser_kwargs or {},
509544
component_settings=settings_instance.component_settings or {},
510545
default_embedder=settings_instance.default_embedder or "",
546+
default_reranker=settings_instance.default_reranker or "",
511547
enabled_components=settings_instance.enabled_components or [],
512548
components_with_secrets=list(
513549
settings_instance.get_secrets().keys()

config/graphql/pipeline_types.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ class PipelineComponentsType(graphene.ObjectType):
113113
post_processors = graphene.List(
114114
PipelineComponentType, description="List of available post-processors."
115115
)
116+
rerankers = graphene.List(
117+
PipelineComponentType,
118+
description="List of available post-retrieval rerankers.",
119+
)
116120

117121

118122
# ==============================================================================
@@ -203,6 +207,14 @@ class PipelineSettingsType(graphene.ObjectType):
203207
description="Default embedder class path when no MIME-specific embedder is found"
204208
)
205209

210+
# Default reranker (post-retrieval). Empty string means reranking disabled.
211+
default_reranker = graphene.String(
212+
description="Default post-retrieval reranker class path. Empty string "
213+
"means reranking is disabled and first-stage retrieval "
214+
"results are returned as-is.",
215+
required=False,
216+
)
217+
206218
# Secrets indicator (actual secrets are never exposed via GraphQL)
207219
components_with_secrets = graphene.List(
208220
graphene.String,

config/settings/base.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@
158158
"opencontractserver.agents",
159159
"opencontractserver.worker_uploads",
160160
"opencontractserver.discovery",
161+
"opencontractserver.benchmarks",
161162
]
162163

163164
# https://docs.djangoproject.com/en/dev/ref/settings/#installed-apps

config/settings/test.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,15 @@
130130
#
131131
# Integration tests that need to verify actual service connectivity should
132132
# explicitly instantiate the real embedder class (e.g., MicroserviceEmbedder).
133-
DEFAULT_EMBEDDER = "opencontractserver.pipeline.embedders.test_embedder.TestEmbedder"
133+
#
134+
# Intentionally env-overridable: benchmark runs via the test.yml compose
135+
# stack (see opencontractserver/benchmarks/) need to swap in a real embedder
136+
# at runtime without editing settings. Standard CI never sets DEFAULT_EMBEDDER,
137+
# so the default TestEmbedder keeps regular test runs hermetic.
138+
DEFAULT_EMBEDDER = env(
139+
"DEFAULT_EMBEDDER",
140+
default="opencontractserver.pipeline.embedders.test_embedder.TestEmbedder",
141+
)
134142

135143
# Auth0 settings for tests
136144
# ------------------------------------------------------------------------------
-924 Bytes
Loading
713 Bytes
Loading
-12 Bytes
Loading

0 commit comments

Comments
 (0)