Skip to content

Commit c7f6732

Browse files
authored
Merge pull request #1376 from Open-Source-Legal/fix/cross-corpus-structural-annotation-leak
Scope structural annotations to the queried corpus in vector store
2 parents ff410db + 255f84a commit c7f6732

5 files changed

Lines changed: 486 additions & 8 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Security
11+
12+
- **Cross-corpus structural-annotation leak in `CoreAnnotationVectorStore`** (`opencontractserver/llms/vector_stores/core_vector_stores.py:296-326,371-413`): The corpus-wide retrieval path (`corpus_id` set, `document_id=None`) returned every structural annotation in the database regardless of corpus. Two collaborating defects caused the leak:
13+
1. `Q(structural=True)` in the corpus-only branch had **no corpus constraint** — parser-produced structural annotations have `Annotation.document_id = corpus_id = NULL`, so corpus membership is only knowable through `structural_set → Document.structural_annotation_set (reverse FK) → DocumentPath.corpus_id`, a join the previous code did not perform.
14+
2. The `check_corpus_deletion` block (default `True`) added `Q(document_id__in=active_doc_ids)`, and `__in` lookups never match `NULL`, so structural annotations were silently dropped on the production-default path. Bypassing this filter with `check_corpus_deletion=False` exposed defect #1 directly.
15+
- **Impact**: Multi-tenant deployments leaked structural annotations across tenant corpora — a real security boundary violation since the upfront IDOR check only validated the *requested* `corpus_id`, not the *returned* rows. Single-tenant deployments saw it as a corpus-scoping / search-quality bug (e.g. corpus-wide benchmark runs returned chunks from abandoned corpora). The standard `Annotation.objects.visible_to_user()` permission filter was bypassed entirely because the vector store builds its own filter chain rather than going through that manager method.
16+
- **Fix** (corpus boundary + per-document visibility for the structural class): the corpus-only branch now requires `structural_set_id__in=<sets reachable from a document in this corpus that is visible to the user>`, joining through `Document.objects.visible_to_user(user).filter(path_records__corpus_id=...)`. The deletion-aware filter accepts both `document_id__in=active` AND structural rows whose set links to one of those active documents, so parser-produced structural annotations remain reachable on the default path. `CoreAnnotationVectorStore.global_search()` was already correct (it explicitly joins via `structural_set__documents__in=accessible_doc_ids`) and is unchanged.
17+
- Regression coverage: `opencontractserver/tests/test_corpus_isolation_vector_store.py` — six tests covering cross-corpus leak, deletion-aware drop, orphan-set leak, document-scoped retrieval still returns structural rows, viewer-without-doc-permission excluded, creator still sees own row.
18+
- **Test-only**: `opencontractserver/tests/test_pydantic_ai_agents.py`, `opencontractserver/tests/test_structural_annotation_portability.py``Document.objects.create(...)` calls in `TransactionTestCase` setUp now pass `processing_started=timezone.now()` to short-circuit `process_doc_on_create_atomic`, which would otherwise eagerly chain a Celery PDF-ingest task that fails on the (file-less) test document and aborts the whole test class. Pre-existing failure, exposed cleanly when the regression suite was added.
19+
1020
### Added
1121

1222
- **Mypy type-checking wired into pre-commit and CI** (Issue #1331): The existing `[mypy]` block in `setup.cfg` and the `mypy==1.20.1` / `django-stubs==6.0.2` / `djangorestframework-stubs==3.16.9` pins in `requirements/local.txt` were never actually enforced, so the investment was drifting (48 pre-existing `# type: ignore` markers, many modules at 0% annotation coverage). This PR turns on the gate without requiring the 7208 existing errors across 357 files to be fixed first — `mypy.ini` lists each of those files under its own `[mypy-<module>] ignore_errors = True` section, so new modules added outside the baseline **are** type-checked and CI / the hook fails on their errors.

opencontractserver/llms/vector_stores/core_vector_stores.py

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,7 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
290290
# Check for deleted documents in corpus
291291
if self.check_corpus_deletion and self.corpus_id and not self.document_id:
292292
# Note: sync_to_async already imported at module level
293+
from opencontractserver.annotations.models import StructuralAnnotationSet
293294
from opencontractserver.documents.models import DocumentPath
294295

295296
# Get documents with active (non-deleted) paths in corpus
@@ -302,8 +303,20 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
302303
)()
303304

304305
if active_doc_ids:
305-
# Ensure we only search documents with active paths
306-
active_filters &= Q(document_id__in=active_doc_ids)
306+
# Two annotation shapes pass this filter:
307+
# 1. Direct: Annotation.document_id is in the active set.
308+
# 2. Structural: Annotation.document_id is NULL but the
309+
# structural_set links to one of those active documents.
310+
# ``document_id__in`` cannot match NULL on its own, so
311+
# structural annotations need an explicit OR clause —
312+
# otherwise every parser-produced structural row is
313+
# silently dropped on this corpus-wide path.
314+
active_struct_set_ids = StructuralAnnotationSet.objects.filter(
315+
documents__in=active_doc_ids
316+
).values("id")
317+
active_filters &= Q(document_id__in=active_doc_ids) | Q(
318+
structural=True, structural_set_id__in=active_struct_set_ids
319+
)
307320
_logger.debug(f"Found {len(active_doc_ids)} active documents in corpus")
308321
else:
309322
_logger.warning(f"No active documents found in corpus {self.corpus_id}")
@@ -361,11 +374,46 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
361374
# --- Corpus-only context (no document_id specified) ---
362375
_logger.debug(f"Corpus-only context: corpus_id={self.corpus_id}")
363376
# Annotations must be either:
364-
# a) Structural (their Annotation.corpus_id might be null, included by nature)
365-
# b) Non-structural AND directly linked to this corpus via Annotation.corpus_id.
366-
active_filters &= Q(structural=True) | Q(
367-
structural=False, corpus_id=self.corpus_id
368-
)
377+
# a) Structural — restricted to ``structural_set``s reachable from
378+
# a document in this corpus that is *visible to the requesting
379+
# user*. This collapses two checks into one filter:
380+
#
381+
# • Corpus boundary (was missing): a parser-produced
382+
# structural annotation has ``document_id = corpus_id =
383+
# NULL``; its corpus membership is only knowable via
384+
# ``structural_set → Document.structural_annotation_set
385+
# (reverse FK) → DocumentPath.corpus_id``.
386+
# • Per-document permission (was bypassed): structural rows
387+
# previously matched ``Q(structural=True)`` unconditionally,
388+
# so a user with permission on the *corpus* could be served
389+
# rows whose underlying documents they have no permission
390+
# on.
391+
#
392+
# Without this, the corpus-wide path leaked structural rows from
393+
# every other corpus in the database (and from inaccessible
394+
# documents within the same corpus).
395+
# b) Non-structural — directly linked to this corpus via
396+
# ``Annotation.corpus_id``. Per-document visibility for these
397+
# rows is enforced by the visibility filter further below
398+
# plus the upfront IDOR check on ``corpus_id``.
399+
# Document is imported earlier in this method (line 211); reusing
400+
# the local binding avoids an F811 redefinition warning.
401+
from opencontractserver.annotations.models import StructuralAnnotationSet
402+
403+
visible_corpus_doc_ids = await sync_to_async(
404+
lambda: list(
405+
Document.objects.visible_to_user(user)
406+
.filter(path_records__corpus_id=self.corpus_id)
407+
.values_list("id", flat=True)
408+
.distinct()
409+
)
410+
)()
411+
visible_corpus_set_ids = StructuralAnnotationSet.objects.filter(
412+
documents__in=visible_corpus_doc_ids
413+
).values("id")
414+
active_filters &= Q(
415+
structural=True, structural_set_id__in=visible_corpus_set_ids
416+
) | Q(structural=False, corpus_id=self.corpus_id)
369417

370418
# ------------------------------------------------------------------ #
371419
# Apply accumulated document/corpus scope filters if any were added

0 commit comments

Comments
 (0)