Open-Source-Legal
diff --git a/‎.gitignore‎
Lines changed: 12 additions & 0 deletions b/‎.gitignore‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 22 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎config/graphql/pipeline_queries.py‎
Lines changed: 9 additions & 0 deletions b/‎config/graphql/pipeline_queries.py‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎config/graphql/pipeline_settings_mutations.py‎
Lines changed: 36 additions & 0 deletions b/‎config/graphql/pipeline_settings_mutations.py‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎config/graphql/pipeline_types.py‎
Lines changed: 12 additions & 0 deletions b/‎config/graphql/pipeline_types.py‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎config/settings/base.py‎
Lines changed: 1 addition & 0 deletions b/‎config/settings/base.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎config/settings/test.py‎
Lines changed: 9 additions & 1 deletion b/‎config/settings/test.py‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎docs/assets/images/screenshots/auto/admin--agent-config--create-modal.png‎
-924 Bytes b/‎docs/assets/images/screenshots/auto/admin--agent-config--create-modal.png‎
-924 Bytes
diff --git a/‎docs/assets/images/screenshots/auto/annotator--edit-label-modal--open.png‎
713 Bytes b/‎docs/assets/images/screenshots/auto/annotator--edit-label-modal--open.png‎
713 Bytes
diff --git a/‎docs/assets/images/screenshots/auto/badges--criteria-config--with-type-selected.png‎
-12 Bytes b/‎docs/assets/images/screenshots/auto/badges--criteria-config--with-type-selected.png‎
-12 Bytes
@@ -190,3 +190,15 @@ scratch.py
 # Implementation plans (ephemeral)
 docs/plans/
 docs/superpowers/
+
+# Benchmark run outputs (per-experiment, ad-hoc — not committed)
+/benchmark_runs/
+/data/legalbenchrag/
+/data/legalbenchrag_data/
+/BENCHMARK_RESULTS.md
+/benchmark_results_aggregates.json
+
+# Headline benchmark runs ARE committed under docs/benchmarks/runs/ so
+# every number in docs/benchmarks/legalbench_rag_results.md has a
+# traceable artifact. Do not exclude that directory.
+!docs/benchmarks/runs/
@@ -81,6 +81,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- **Pluggable text chunking strategies for `TxtParser`** (Issue #1348, alongside PR #1239): Introduced `opencontractserver/pipeline/parsers/text_chunkers.py` — a small registry-backed abstraction (`BaseTextChunker` + `TextChunk` + `get_chunker`) with three built-in strategies: `SentenceChunker` (spaCy `doc.sents`, preserves pre-#1348 behaviour and emits the existing `SENTENCE` label), `ParagraphChunker` (blank-line split with optional `min_chars` filter and `max_chars` oversize-paragraph fallback, emits `PARAGRAPH`), and `SlidingWindowChunker` (fixed-character window with configurable `overlap` and optional `respect_word_boundaries` snap, emits `WINDOW`). `TxtParser` now declares a `Settings` dataclass with a `chunkers: list[ChunkerSpec]` field (default `[{"name": "sentence"}]`) that can be overridden via `PipelineSettings` *or* per-call via a `chunkers=[...]` kwarg on `parse_document`; the parser iterates the configured strategies and emits one structural SPAN_LABEL annotation per chunk under each strategy's label, so stacked configurations (e.g. sentence + paragraph) index multiple retrieval granularities simultaneously. Motivates the benchmark work in #1239: the LegalBench-RAG `probe_recall_at_10` gap on `privacy_qa` (0.22 observed vs 0.5–0.8 paper floor) is the thesis for needing paragraph-granularity retrieval units, but this PR is strategy-neutral — which chunker wins for which subset is a follow-up optimisation to be driven by the benchmark harness itself. Regression coverage in `opencontractserver/tests/test_text_chunkers.py` (pure-Python, no Django DB) exercises offset/whitespace invariants, overlap arithmetic, word-boundary snapping, argument validation and registry lookup; `test_txt_ingestor_pipeline.py` gains two integration tests that parse the live fixture with a paragraph-only and a stacked paragraph+sliding_window recipe. Existing sentence-only ingestion path is unchanged.
+- **Global post-retrieval reranker for vector search** (Issue #1349): Adds an optional cross-encoder reranking stage that runs after first-stage vector / hybrid retrieval, so OpenContracts can close the gap between vanilla HNSW recall and the accuracy achievable with a cross-encoder scoring pass.
+  - New abstract base class `opencontractserver.pipeline.base.reranker.BaseReranker` wired into the existing `PipelineComponentBase` settings machinery: concrete subclasses declare a `Settings` dataclass (loaded from `PipelineSettings` at runtime) and implement `_rerank_impl(query, passages, **kwargs)`. A default `_arerank_impl` wraps the sync implementation via `sync_to_async` so every backend has a working async path without duplicating logic.
+  - Fault-tolerant helpers `safe_rerank` / `safe_arerank` swallow reranker failures and return `None` so retrieval degrades gracefully to the first-stage ordering — critical because a misconfigured reranker must never take down semantic search.
+  - Four shipped backends in `opencontractserver/pipeline/rerankers/`:
+    - `NoopReranker` — identity pass-through for tests and benchmark control conditions.
+    - `CrossEncoderReranker` — in-process `sentence_transformers.CrossEncoder` (default `BAAI/bge-reranker-v2-m3` per the issue); lazy model load cached by `(model_name, device)` so workers pay the ~300 MB cost once and reuse it on every query. `sentence-transformers` / `torch` are treated as optional dependencies; a missing install surfaces a clear `ImportError` only when this backend is actually selected.
+    - `MicroserviceReranker` — HTTP client that mirrors the shape of `MicroserviceEmbedder` (URL, optional API key, Cloud-Run IAM auth, retry-friendly timeouts). Operators can run any reranker model behind a `/rerank` endpoint and point OpenContracts at it via `RERANKER_MICROSERVICE_URL` (+ secret `RERANKER_MICROSERVICE_API_KEY`).
+    - `CohereReranker` — hosted Rerank API (`rerank-v3.5` by default) via the REST endpoint directly (no hard dep on the `cohere` SDK). API key stored in the encrypted `PipelineSettings.encrypted_secrets` bag under `cohere_api_key` (env var `COHERE_API_KEY` at migration time).
+  - New `ComponentType.RERANKER` enum value and `rerankers/` auto-discovery in `opencontractserver.pipeline.registry`; `PipelineComponentRegistry` now exposes `.rerankers` / `get_all_rerankers_cached()` alongside parsers, embedders, thumbnailers, and post-processors.
+  - `PipelineSettings.default_reranker` (CharField, max_length=512, `documents/models.py:852-980`) — empty string disables reranking; any value is a full class path resolved at runtime. Seeded from `DEFAULT_RERANKER` Django setting at migration time (`documents/migrations/0037_add_default_reranker_to_pipeline_settings.py`). Helpers `get_default_reranker_path()` / `get_default_reranker_class()` / `get_default_reranker_instance()` in `opencontractserver.pipeline.utils`, with a process-local instance cache (cross-encoder model weights are expensive) invalidated via `invalidate_reranker_cache()` on every settings update.
+  - `CoreAnnotationVectorStore` (`opencontractserver/llms/vector_stores/core_vector_stores.py:120-1041`) now accepts an optional `reranker` override + `rerank_oversample_factor` kwarg. Every search path — `search`, `async_search`, `hybrid_search`, `async_hybrid_search`, `global_search`, `async_global_search` — oversamples candidates by `RERANK_OVERSAMPLE_FACTOR` (default 3× the requested `top_k`, hard-capped by `RERANK_MAX_CANDIDATES = 128`) when a reranker is active and re-orders results through `_apply_rerank` / `_aapply_rerank` before returning the final `top_k`. All new plumbing is a no-op when `default_reranker` is empty, so zero behavior change for existing deployments.
+  - GraphQL surface: `PipelineComponentsType.rerankers`, `PipelineSettingsType.default_reranker`, and `UpdatePipelineSettingsMutation.default_reranker` (validated against the registry, invalidates the reranker instance cache on change).
+  - New constants in `opencontractserver.constants.search`: `RERANK_OVERSAMPLE_FACTOR`, `RERANK_MAX_CANDIDATES`, `RERANK_DEFAULT_TOP_K`. New `RERANKER_REQUEST_TIMEOUT_SECONDS` in `opencontractserver.constants.document_processing`.
+  - Tests in `opencontractserver/tests/test_reranker.py` cover the base-class contract (sorting, top_k trim, out-of-range indices, max-candidates, async fallback), `safe_rerank` / `safe_arerank` fault-tolerance, all three HTTP backends with mocked `requests.post`, pipeline utility resolution + instance caching, registry auto-discovery, and vector-store integration (oversample factor, reranker failure fallback, re-ordering effects).
 - **Mypy graduation: typed GraphQL resolvers, mutations, and filters** (Issue #1332): Raised return-annotation coverage in `config/graphql/` from ~4.8% at the start of #1331 to **91.5%** (421/460 function defs) and removed 22 modules from the `mypy.ini` baseline allow-list.
   - **Root-cause annotation fixes in `opencontractserver/utils/permissioning.py`**: `set_permissions_for_obj_to_user`, `user_has_permission_for_obj`, `get_users_permissions_for_obj`, and `get_permission_id_to_name_map_for_model` were previously annotated with `instance: type[django.db.models.Model]` (a class) despite every call site passing an instance — and with `user: type[User]` instead of the `User` runtime instance. These were annotation bugs (the code was correct, the annotations were inverted), which compounded: every mutation calling `set_permissions_for_obj_to_user(user, obj, ...)` was a single `[arg-type]` error each. Corrected to `instance: django.db.models.Model` / `user: UserModel` (forward-referenced via `TYPE_CHECKING` import of `opencontractserver.users.models.User`). Also added the missing `dict[int, str]` annotation on `this_model_permission_id_map` and removed the `user_instance=User` (class) default on `get_users_group_ids`, which would have exploded at runtime if any caller ever omitted the argument. Module graduated out of the baseline.
   - **Graduated from `mypy.ini` baseline** (22 modules): `config.graphql.{action_queries, agent_mutations, badge_mutations, base_types, conversation_mutations, conversation_types, corpus_types, document_queries, filters, ingestion_source_mutations, moderation_mutations, og_metadata_queries, pipeline_queries, security, serializers, slug_queries, smart_label_mutations, social_types, user_queries, user_types, voting_mutations}` and `opencontractserver.utils.permissioning`. Each had the underlying mypy errors fixed first (root-cause in `permissioning.py` cleared the `set_permissions_for_obj_to_user` cluster across every mutation file above).
@@ -351,6 +366,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- **Benchmark harness for external RAG datasets** (new app `opencontractserver/benchmarks/`): Generate an OpenContracts corpus from a third-party benchmark (LegalBench-RAG today, pluggable for CUAD/MAUD/etc. via a small adapter interface), run the production extract-grid pipeline against the benchmark's queries with a configurable LLM, probe retrieval independently via `CoreAnnotationVectorStore`, and compute standard metrics (SQuAD-style exact match / token F1 for answers; character-span recall@k / precision@k / IoU for retrieval). Results are written as `report.json` / `report.csv` / `config.json` / `gold.json` under a run directory.
+  - Adapter interface and `LegalBenchRAGAdapter` at `opencontractserver/benchmarks/adapters/` (reads the authoritative ZeroEntropy schema — `{"tests": [{"query", "snippets": [{"file_path", "span": [start, end]}], "tags"}]}`)
+  - Loader, runner, evaluator, and report modules under `opencontractserver/benchmarks/`
+  - Django management command: `python manage.py run_benchmark --benchmark legalbench-rag --path /data/legalbench-rag --user admin --model openai:gpt-4o-mini --top-k 10`
+  - Micro fixture under `fixtures/benchmarks/legalbench_rag_micro/` for end-to-end tests without downloading the full dataset
+  - Test coverage: `opencontractserver/tests/test_benchmarks.py` (metric unit tests, adapter unit tests, loader materialization test, runner end-to-end test with mocked structured-response agent)
+- **`model_override` kwarg on `doc_extract_query_task`** (`opencontractserver/tasks/data_extract_tasks.py`): Optional, backward-compatible kwarg that lets callers override the hardcoded `openai:gpt-4o-mini` default for a single invocation. Consumed by the benchmark runner to sweep models without affecting production defaults; still defaults to `openai:gpt-4o-mini` when not supplied.
 - **Frontend unit tests for utils and hooks** (Issue #1267): Added 14 new `*.test.ts(x)` files covering previously-untested utilities and hooks to raise `frontend-unit` coverage on high-ROI pure functions:
 
   - **Utils**: `formatters.test.ts`, `arrayUtils.test.ts`, `colorUtils.test.ts`, `parseOutputType.test.ts`, `annotationGuards.test.ts`, `env.test.ts`, `extractUtils.test.ts`, `layout.test.ts`, `persistentVar.test.ts`, `routingLogger.test.ts`, `navigationCircuitBreaker.test.ts`, `performance.test.ts`, `jobNotificationCacheUpdates.test.ts`, `compactAnnotationJson.test.ts`.
 
@@ -97,6 +97,9 @@ def resolve_pipeline_components(
             if settings_instance.default_embedder:
                 configured_components.add(settings_instance.default_embedder)
 
+            if settings_instance.default_reranker:
+                configured_components.add(settings_instance.default_reranker)
+
             if settings_instance.parser_kwargs:
                 configured_components.update(settings_instance.parser_kwargs.keys())
 
@@ -121,6 +124,7 @@ def filter_configured(
                 "post_processors": filter_configured(
                     components_data["post_processors"]
                 ),
+                "rerankers": filter_configured(components_data.get("rerankers", [])),
             }
 
         # Convert PipelineComponentDefinition objects to GraphQL types
@@ -183,6 +187,10 @@ def to_graphql_type(
                 to_graphql_type(d, "post_processor")
                 for d in components_data["post_processors"]
             ],
+            rerankers=[
+                to_graphql_type(d, "reranker")
+                for d in components_data.get("rerankers", [])
+            ],
         )
 
     # SUPPORTED MIME TYPES #####################################
@@ -254,6 +262,7 @@ def resolve_pipeline_settings(
             parser_kwargs=settings_instance.parser_kwargs or {},
             component_settings=settings_instance.component_settings or {},
             default_embedder=settings_instance.default_embedder or "",
+            default_reranker=settings_instance.default_reranker or "",
             components_with_secrets=components_with_secrets,
             tools_with_secrets=settings_instance.get_tools_with_secrets(),
             enabled_components=settings_instance.enabled_components or [],
 
@@ -213,6 +213,13 @@ class Arguments:
             required=False,
             description="Default embedder class path when no MIME-specific embedder is found.",
         )
+        default_reranker = graphene.String(
+            required=False,
+            description=(
+                "Default post-retrieval reranker class path. Empty string "
+                "disables reranking (first-stage vector / hybrid search only)."
+            ),
+        )
         enabled_components = graphene.List(
             graphene.String,
             required=False,
@@ -237,6 +244,7 @@ def mutate(
         parser_kwargs=None,
         component_settings=None,
         default_embedder=None,
+        default_reranker=None,
         enabled_components=None,
     ) -> "UpdatePipelineSettingsMutation":
         """
@@ -393,6 +401,32 @@ def mutate(
                         )
                 settings_instance.default_embedder = default_embedder
 
+            # Validate default_reranker (empty string = disabled)
+            if default_reranker is not None:
+                if default_reranker:
+                    error = validate_component_path(default_reranker)
+                    if error:
+                        return UpdatePipelineSettingsMutation(
+                            ok=False, message=error, pipeline_settings=None
+                        )
+                    if not registry.get_by_class_name(default_reranker):
+                        return UpdatePipelineSettingsMutation(
+                            ok=False,
+                            message=(
+                                f"Default reranker '{default_reranker}' "
+                                "not found in registry."
+                            ),
+                            pipeline_settings=None,
+                        )
+                settings_instance.default_reranker = default_reranker
+                # Drop cached reranker instance so the next retrieval picks
+                # up the new configuration without a worker restart.
+                from opencontractserver.pipeline.utils import (
+                    invalidate_reranker_cache,
+                )
+
+                invalidate_reranker_cache()
+
             # Validate enabled_components
             if enabled_components is not None:
                 if not isinstance(enabled_components, list):
@@ -487,6 +521,7 @@ def mutate(
                     ("parser_kwargs", parser_kwargs),
                     ("component_settings", component_settings),
                     ("default_embedder", default_embedder),
+                    ("default_reranker", default_reranker),
                     ("enabled_components", enabled_components),
                 ]
                 if val is not None
@@ -508,6 +543,7 @@ def mutate(
                     parser_kwargs=settings_instance.parser_kwargs or {},
                     component_settings=settings_instance.component_settings or {},
                     default_embedder=settings_instance.default_embedder or "",
+                    default_reranker=settings_instance.default_reranker or "",
                     enabled_components=settings_instance.enabled_components or [],
                     components_with_secrets=list(
                         settings_instance.get_secrets().keys()
 
@@ -113,6 +113,10 @@ class PipelineComponentsType(graphene.ObjectType):
     post_processors = graphene.List(
         PipelineComponentType, description="List of available post-processors."
     )
+    rerankers = graphene.List(
+        PipelineComponentType,
+        description="List of available post-retrieval rerankers.",
+    )
 
 
 # ==============================================================================
@@ -203,6 +207,14 @@ class PipelineSettingsType(graphene.ObjectType):
         description="Default embedder class path when no MIME-specific embedder is found"
     )
 
+    # Default reranker (post-retrieval). Empty string means reranking disabled.
+    default_reranker = graphene.String(
+        description="Default post-retrieval reranker class path. Empty string "
+        "means reranking is disabled and first-stage retrieval "
+        "results are returned as-is.",
+        required=False,
+    )
+
     # Secrets indicator (actual secrets are never exposed via GraphQL)
     components_with_secrets = graphene.List(
         graphene.String,
 
@@ -158,6 +158,7 @@
     "opencontractserver.agents",
     "opencontractserver.worker_uploads",
     "opencontractserver.discovery",
+    "opencontractserver.benchmarks",
 ]
 
 # https://docs.djangoproject.com/en/dev/ref/settings/#installed-apps
 
@@ -130,7 +130,15 @@
 #
 # Integration tests that need to verify actual service connectivity should
 # explicitly instantiate the real embedder class (e.g., MicroserviceEmbedder).
-DEFAULT_EMBEDDER = "opencontractserver.pipeline.embedders.test_embedder.TestEmbedder"
+#
+# Intentionally env-overridable: benchmark runs via the test.yml compose
+# stack (see opencontractserver/benchmarks/) need to swap in a real embedder
+# at runtime without editing settings. Standard CI never sets DEFAULT_EMBEDDER,
+# so the default TestEmbedder keeps regular test runs hermetic.
+DEFAULT_EMBEDDER = env(
+    "DEFAULT_EMBEDDER",
+    default="opencontractserver.pipeline.embedders.test_embedder.TestEmbedder",
+)
 
 # Auth0 settings for tests
 # ------------------------------------------------------------------------------
Original file line number	Diff line number	Diff line change
`@@ -158,6 +158,7 @@`
`158`	`158`	`"opencontractserver.agents",`
`159`	`159`	`"opencontractserver.worker_uploads",`
`160`	`160`	`"opencontractserver.discovery",`
	`161`	`+ "opencontractserver.benchmarks",`
`161`	`162`	`]`
`162`	`163`
`163`	`164`	`# https://docs.djangoproject.com/en/dev/ref/settings/#installed-apps`