Open-Source-Legal
diff --git a/‎CHANGELOG.md‎
Lines changed: 10 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎frontend/.prettierignore‎
Lines changed: 1 addition & 0 deletions b/‎frontend/.prettierignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎opencontractserver/documents/management/commands/validate_v3_migration.py‎
Lines changed: 9 additions & 3 deletions b/‎opencontractserver/documents/management/commands/validate_v3_migration.py‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎opencontractserver/llms/vector_stores/core_vector_stores.py‎
Lines changed: 83 additions & 21 deletions b/‎opencontractserver/llms/vector_stores/core_vector_stores.py‎
Lines changed: 83 additions & 21 deletions
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Security
+
+- **Cross-corpus structural-annotation leak in `CoreAnnotationVectorStore`** (`opencontractserver/llms/vector_stores/core_vector_stores.py:296-326,371-413`): The corpus-wide retrieval path (`corpus_id` set, `document_id=None`) returned every structural annotation in the database regardless of corpus. Two collaborating defects caused the leak:
+  1. `Q(structural=True)` in the corpus-only branch had **no corpus constraint** — parser-produced structural annotations have `Annotation.document_id = corpus_id = NULL`, so corpus membership is only knowable through `structural_set → Document.structural_annotation_set (reverse FK) → DocumentPath.corpus_id`, a join the previous code did not perform.
+  2. The `check_corpus_deletion` block (default `True`) added `Q(document_id__in=active_doc_ids)`, and `__in` lookups never match `NULL`, so structural annotations were silently dropped on the production-default path. Bypassing this filter with `check_corpus_deletion=False` exposed defect #1 directly.
+  - **Impact**: Multi-tenant deployments leaked structural annotations across tenant corpora — a real security boundary violation since the upfront IDOR check only validated the *requested* `corpus_id`, not the *returned* rows. Single-tenant deployments saw it as a corpus-scoping / search-quality bug (e.g. corpus-wide benchmark runs returned chunks from abandoned corpora). The standard `Annotation.objects.visible_to_user()` permission filter was bypassed entirely because the vector store builds its own filter chain rather than going through that manager method.
+  - **Fix** (corpus boundary + per-document visibility for the structural class): the corpus-only branch now requires `structural_set_id__in=<sets reachable from a document in this corpus that is visible to the user>`, joining through `Document.objects.visible_to_user(user).filter(path_records__corpus_id=...)`. The deletion-aware filter accepts both `document_id__in=active` AND structural rows whose set links to one of those active documents, so parser-produced structural annotations remain reachable on the default path. `CoreAnnotationVectorStore.global_search()` was already correct (it explicitly joins via `structural_set__documents__in=accessible_doc_ids`) and is unchanged.
+  - Regression coverage: `opencontractserver/tests/test_corpus_isolation_vector_store.py` — six tests covering cross-corpus leak, deletion-aware drop, orphan-set leak, document-scoped retrieval still returns structural rows, viewer-without-doc-permission excluded, creator still sees own row.
+- **Test-only**: `opencontractserver/tests/test_pydantic_ai_agents.py`, `opencontractserver/tests/test_structural_annotation_portability.py` — `Document.objects.create(...)` calls in `TransactionTestCase` setUp now pass `processing_started=timezone.now()` to short-circuit `process_doc_on_create_atomic`, which would otherwise eagerly chain a Celery PDF-ingest task that fails on the (file-less) test document and aborts the whole test class. Pre-existing failure, exposed cleanly when the regression suite was added.
+
 ### Added
 
 - **Mypy type-checking wired into pre-commit and CI** (Issue #1331): The existing `[mypy]` block in `setup.cfg` and the `mypy==1.20.1` / `django-stubs==6.0.2` / `djangorestframework-stubs==3.16.9` pins in `requirements/local.txt` were never actually enforced, so the investment was drifting (48 pre-existing `# type: ignore` markers, many modules at 0% annotation coverage). This PR turns on the gate without requiring the 7208 existing errors across 357 files to be fixed first — `mypy.ini` lists each of those files under its own `[mypy-<module>] ignore_errors = True` section, so new modules added outside the baseline **are** type-checked and CI / the hook fails on their errors.
 
@@ -13,6 +13,7 @@ node_modules/
 # Generated files
 *.min.js
 *.min.css
+public/env-config.js
 
 # Playwright cache
 playwright/.cache/
 
@@ -9,6 +9,7 @@
 """
 
 import logging
+from typing import cast
 
 from django.core.management.base import BaseCommand
 from django.db.models import Count
@@ -298,9 +299,14 @@ def _check_structural_set_uniqueness(self, verbose):
         )
 
         if verbose:
-            for content_hash, count in duplicates.values_list(
-                "content_hash", "count"
-            )[:5]:
+            # django-stubs 6.0.3 mistypes `.values().annotate().values_list()`
+            # chains as the annotated model rather than a tuple iterable, so
+            # materialise with an explicit cast. Functionally identical.
+            top_duplicates = cast(
+                list[tuple[str, int]],
+                list(duplicates.values_list("content_hash", "count")[:5]),
+            )
+            for content_hash, count in top_duplicates:
                 self.stdout.write(
                     f"    - Hash {content_hash[:16]}...: {count} duplicates"
                 )
 
@@ -9,7 +9,7 @@
 from django.contrib.auth import get_user_model
 from django.db.models import Q, QuerySet
 
-from opencontractserver.annotations.models import Annotation
+from opencontractserver.annotations.models import Annotation, StructuralAnnotationSet
 from opencontractserver.constants.search import (
     FTS_CONFIG,
     HYBRID_SEARCH_OVERSAMPLE_FACTOR,
@@ -208,7 +208,7 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
         # enumeration attacks.
         # -------------------------------------------------------------------------
         from opencontractserver.corpuses.models import Corpus
-        from opencontractserver.documents.models import Document
+        from opencontractserver.documents.models import Document, DocumentPath
 
         user = None
         if self.user_id:
@@ -289,22 +289,47 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
 
         # Check for deleted documents in corpus
         if self.check_corpus_deletion and self.corpus_id and not self.document_id:
-            # Note: sync_to_async already imported at module level
-            from opencontractserver.documents.models import DocumentPath
-
-            # Get documents with active (non-deleted) paths in corpus
-            active_doc_ids = await sync_to_async(
-                lambda: list(
-                    DocumentPath.objects.filter(
-                        corpus_id=self.corpus_id, is_current=True, is_deleted=False
-                    ).values_list("document_id", flat=True)
+            # Lazy subquery — never round-trips through Python, so the
+            # generated SQL stays a single statement with a real subquery
+            # rather than a giant ``IN (val, val, ...)`` literal even for
+            # corpora with tens of thousands of documents.
+            active_doc_ids_qs = (
+                DocumentPath.objects.filter(
+                    corpus_id=self.corpus_id, is_current=True, is_deleted=False
+                )
+                .values("document_id")
+                .distinct()
+            )
+            # Trade-off: this ``EXISTS`` adds one extra round-trip on the
+            # happy path, but lets us short-circuit the entire vector search
+            # for empty/all-deleted corpora (returning ``Annotation.none()``
+            # spares a downstream HNSW probe and keeps the existing
+            # operational warning). For corpora with at least one active
+            # document the cost is a single boolean SELECT and is dwarfed
+            # by the main query. Removing the check would also remove the
+            # debug log of the active-doc count that the materialised list
+            # used to provide.
+            has_active_docs = await sync_to_async(active_doc_ids_qs.exists)()
+
+            if has_active_docs:
+                # Two annotation shapes pass this filter:
+                #   1. Direct: Annotation.document_id is in the active set.
+                #   2. Structural: Annotation.document_id is NULL but the
+                #      structural_set links to one of those active documents.
+                #      ``document_id__in`` cannot match NULL on its own, so
+                #      structural annotations need an explicit OR clause —
+                #      otherwise every parser-produced structural row is
+                #      silently dropped on this corpus-wide path.
+                active_struct_set_ids = (
+                    StructuralAnnotationSet.objects.filter(
+                        documents__in=active_doc_ids_qs
+                    )
+                    .values("id")
+                    .distinct()
+                )
+                active_filters &= Q(document_id__in=active_doc_ids_qs) | Q(
+                    structural=True, structural_set_id__in=active_struct_set_ids
                 )
-            )()
-
-            if active_doc_ids:
-                # Ensure we only search documents with active paths
-                active_filters &= Q(document_id__in=active_doc_ids)
-                _logger.debug(f"Found {len(active_doc_ids)} active documents in corpus")
             else:
                 _logger.warning(f"No active documents found in corpus {self.corpus_id}")
                 return Annotation.objects.none()
@@ -361,11 +386,48 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
             # --- Corpus-only context (no document_id specified) ---
             _logger.debug(f"Corpus-only context: corpus_id={self.corpus_id}")
             # Annotations must be either:
-            # a) Structural (their Annotation.corpus_id might be null, included by nature)
-            # b) Non-structural AND directly linked to this corpus via Annotation.corpus_id.
-            active_filters &= Q(structural=True) | Q(
-                structural=False, corpus_id=self.corpus_id
+            # a) Structural — restricted to ``structural_set``s reachable from
+            #    a document in this corpus that is *visible to the requesting
+            #    user*. This collapses two checks into one filter:
+            #
+            #      • Corpus boundary (was missing): a parser-produced
+            #        structural annotation has ``document_id = corpus_id =
+            #        NULL``; its corpus membership is only knowable via
+            #        ``structural_set → Document.structural_annotation_set
+            #        (reverse FK) → DocumentPath.corpus_id``.
+            #      • Per-document permission (was bypassed): structural rows
+            #        previously matched ``Q(structural=True)`` unconditionally,
+            #        so a user with permission on the *corpus* could be served
+            #        rows whose underlying documents they have no permission
+            #        on.
+            #
+            #    Without this, the corpus-wide path leaked structural rows from
+            #    every other corpus in the database (and from inaccessible
+            #    documents within the same corpus).
+            # b) Non-structural — directly linked to this corpus via
+            #    ``Annotation.corpus_id``. Per-document visibility for these
+            #    rows is enforced by the visibility filter further below
+            #    plus the upfront IDOR check on ``corpus_id``.
+            # Both subqueries below stay lazy so the SQL planner sees a
+            # nested ``IN (SELECT ...)`` rather than a Python-materialised
+            # ``IN (val, val, ...)`` literal — important for corpora with
+            # tens of thousands of documents.
+            visible_corpus_doc_ids_qs = (
+                Document.objects.visible_to_user(user)
+                .filter(path_records__corpus_id=self.corpus_id)
+                .values("id")
+                .distinct()
+            )
+            visible_corpus_set_ids = (
+                StructuralAnnotationSet.objects.filter(
+                    documents__in=visible_corpus_doc_ids_qs
+                )
+                .values("id")
+                .distinct()
             )
+            active_filters &= Q(
+                structural=True, structural_set_id__in=visible_corpus_set_ids
+            ) | Q(structural=False, corpus_id=self.corpus_id)
 
         # ------------------------------------------------------------------ #
         # Apply accumulated document/corpus scope filters if any were added