Open-Source-Legal
diff --git a/‎CHANGELOG.md‎
Lines changed: 10 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎opencontractserver/llms/vector_stores/core_vector_stores.py‎
Lines changed: 55 additions & 7 deletions b/‎opencontractserver/llms/vector_stores/core_vector_stores.py‎
Lines changed: 55 additions & 7 deletions
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Security
+
+- **Cross-corpus structural-annotation leak in `CoreAnnotationVectorStore`** (`opencontractserver/llms/vector_stores/core_vector_stores.py:296-326,371-413`): The corpus-wide retrieval path (`corpus_id` set, `document_id=None`) returned every structural annotation in the database regardless of corpus. Two collaborating defects caused the leak:
+  1. `Q(structural=True)` in the corpus-only branch had **no corpus constraint** — parser-produced structural annotations have `Annotation.document_id = corpus_id = NULL`, so corpus membership is only knowable through `structural_set → Document.structural_annotation_set (reverse FK) → DocumentPath.corpus_id`, a join the previous code did not perform.
+  2. The `check_corpus_deletion` block (default `True`) added `Q(document_id__in=active_doc_ids)`, and `__in` lookups never match `NULL`, so structural annotations were silently dropped on the production-default path. Bypassing this filter with `check_corpus_deletion=False` exposed defect #1 directly.
+  - **Impact**: Multi-tenant deployments leaked structural annotations across tenant corpora — a real security boundary violation since the upfront IDOR check only validated the *requested* `corpus_id`, not the *returned* rows. Single-tenant deployments saw it as a corpus-scoping / search-quality bug (e.g. corpus-wide benchmark runs returned chunks from abandoned corpora). The standard `Annotation.objects.visible_to_user()` permission filter was bypassed entirely because the vector store builds its own filter chain rather than going through that manager method.
+  - **Fix** (corpus boundary + per-document visibility for the structural class): the corpus-only branch now requires `structural_set_id__in=<sets reachable from a document in this corpus that is visible to the user>`, joining through `Document.objects.visible_to_user(user).filter(path_records__corpus_id=...)`. The deletion-aware filter accepts both `document_id__in=active` AND structural rows whose set links to one of those active documents, so parser-produced structural annotations remain reachable on the default path. `CoreAnnotationVectorStore.global_search()` was already correct (it explicitly joins via `structural_set__documents__in=accessible_doc_ids`) and is unchanged.
+  - Regression coverage: `opencontractserver/tests/test_corpus_isolation_vector_store.py` — six tests covering cross-corpus leak, deletion-aware drop, orphan-set leak, document-scoped retrieval still returns structural rows, viewer-without-doc-permission excluded, creator still sees own row.
+- **Test-only**: `opencontractserver/tests/test_pydantic_ai_agents.py`, `opencontractserver/tests/test_structural_annotation_portability.py` — `Document.objects.create(...)` calls in `TransactionTestCase` setUp now pass `processing_started=timezone.now()` to short-circuit `process_doc_on_create_atomic`, which would otherwise eagerly chain a Celery PDF-ingest task that fails on the (file-less) test document and aborts the whole test class. Pre-existing failure, exposed cleanly when the regression suite was added.
+
 ### Added
 
 - **Mypy type-checking wired into pre-commit and CI** (Issue #1331): The existing `[mypy]` block in `setup.cfg` and the `mypy==1.20.1` / `django-stubs==6.0.2` / `djangorestframework-stubs==3.16.9` pins in `requirements/local.txt` were never actually enforced, so the investment was drifting (48 pre-existing `# type: ignore` markers, many modules at 0% annotation coverage). This PR turns on the gate without requiring the 7208 existing errors across 357 files to be fixed first — `mypy.ini` lists each of those files under its own `[mypy-<module>] ignore_errors = True` section, so new modules added outside the baseline **are** type-checked and CI / the hook fails on their errors.
 
@@ -290,6 +290,7 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
         # Check for deleted documents in corpus
         if self.check_corpus_deletion and self.corpus_id and not self.document_id:
             # Note: sync_to_async already imported at module level
+            from opencontractserver.annotations.models import StructuralAnnotationSet
             from opencontractserver.documents.models import DocumentPath
 
             # Get documents with active (non-deleted) paths in corpus
@@ -302,8 +303,20 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
             )()
 
             if active_doc_ids:
-                # Ensure we only search documents with active paths
-                active_filters &= Q(document_id__in=active_doc_ids)
+                # Two annotation shapes pass this filter:
+                #   1. Direct: Annotation.document_id is in the active set.
+                #   2. Structural: Annotation.document_id is NULL but the
+                #      structural_set links to one of those active documents.
+                #      ``document_id__in`` cannot match NULL on its own, so
+                #      structural annotations need an explicit OR clause —
+                #      otherwise every parser-produced structural row is
+                #      silently dropped on this corpus-wide path.
+                active_struct_set_ids = StructuralAnnotationSet.objects.filter(
+                    documents__in=active_doc_ids
+                ).values("id")
+                active_filters &= Q(document_id__in=active_doc_ids) | Q(
+                    structural=True, structural_set_id__in=active_struct_set_ids
+                )
                 _logger.debug(f"Found {len(active_doc_ids)} active documents in corpus")
             else:
                 _logger.warning(f"No active documents found in corpus {self.corpus_id}")
@@ -361,11 +374,46 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
             # --- Corpus-only context (no document_id specified) ---
             _logger.debug(f"Corpus-only context: corpus_id={self.corpus_id}")
             # Annotations must be either:
-            # a) Structural (their Annotation.corpus_id might be null, included by nature)
-            # b) Non-structural AND directly linked to this corpus via Annotation.corpus_id.
-            active_filters &= Q(structural=True) | Q(
-                structural=False, corpus_id=self.corpus_id
-            )
+            # a) Structural — restricted to ``structural_set``s reachable from
+            #    a document in this corpus that is *visible to the requesting
+            #    user*. This collapses two checks into one filter:
+            #
+            #      • Corpus boundary (was missing): a parser-produced
+            #        structural annotation has ``document_id = corpus_id =
+            #        NULL``; its corpus membership is only knowable via
+            #        ``structural_set → Document.structural_annotation_set
+            #        (reverse FK) → DocumentPath.corpus_id``.
+            #      • Per-document permission (was bypassed): structural rows
+            #        previously matched ``Q(structural=True)`` unconditionally,
+            #        so a user with permission on the *corpus* could be served
+            #        rows whose underlying documents they have no permission
+            #        on.
+            #
+            #    Without this, the corpus-wide path leaked structural rows from
+            #    every other corpus in the database (and from inaccessible
+            #    documents within the same corpus).
+            # b) Non-structural — directly linked to this corpus via
+            #    ``Annotation.corpus_id``. Per-document visibility for these
+            #    rows is enforced by the visibility filter further below
+            #    plus the upfront IDOR check on ``corpus_id``.
+            # Document is imported earlier in this method (line 211); reusing
+            # the local binding avoids an F811 redefinition warning.
+            from opencontractserver.annotations.models import StructuralAnnotationSet
+
+            visible_corpus_doc_ids = await sync_to_async(
+                lambda: list(
+                    Document.objects.visible_to_user(user)
+                    .filter(path_records__corpus_id=self.corpus_id)
+                    .values_list("id", flat=True)
+                    .distinct()
+                )
+            )()
+            visible_corpus_set_ids = StructuralAnnotationSet.objects.filter(
+                documents__in=visible_corpus_doc_ids
+            ).values("id")
+            active_filters &= Q(
+                structural=True, structural_set_id__in=visible_corpus_set_ids
+            ) | Q(structural=False, corpus_id=self.corpus_id)
 
         # ------------------------------------------------------------------ #
         # Apply accumulated document/corpus scope filters if any were added