Skip to content

Commit 255f84a

Browse files
committed
Scope structural annotations to the queried corpus in vector store
CoreAnnotationVectorStore._build_base_queryset() had two interacting issues on the corpus-wide path (corpus_id set, document_id None): 1. The corpus-only branch used Q(structural=True) with no corpus join, so a parser-produced structural annotation (document_id = corpus_id = NULL, set via structural_set FK) from any corpus could match. The structural class was effectively unscoped. 2. The check_corpus_deletion block ANDed Q(document_id__in=active_doc_ids). __in cannot match NULL, so the same parser-produced rows were silently dropped on the production-default path. With the deletion guard on, structural results were missing entirely; with it off, results were pulled from every corpus. Net effect was a corpus-isolation gap for the structural class — observable as cross-corpus contamination in corpus-wide retrieval and as a search quality regression in single-tenant deployments with multiple corpora. Fix walks structural_set → Document.structural_annotation_set (reverse FK) → DocumentPath to derive the set of structural_set ids reachable from documents in the queried corpus that are visible to the requesting user, then constrains the structural branch to those ids. The deletion-aware filter additionally accepts structural rows whose set links to one of the active documents, so the default path no longer drops them. Visibility is now enforced for the structural branch the same way it is for non-structural rows: via Document.objects.visible_to_user(user). The upfront IDOR check on corpus_id is unchanged. Adds opencontractserver/tests/test_corpus_isolation_vector_store.py (six regression tests). CoreAnnotationVectorStore.global_search() was already correct and is not modified. Two unrelated TransactionTestCase setUp blocks (test_pydantic_ai_agents.py, test_structural_annotation_portability.py) now pass processing_started=timezone.now() to short-circuit process_doc_on_create_atomic, which would otherwise eagerly chain a Celery PDF-ingest task that fails on a file-less test document and aborts the test class. These were already broken on main; cleaned up so the new regression suite has a green baseline alongside.
1 parent ff410db commit 255f84a

5 files changed

Lines changed: 486 additions & 8 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Security
11+
12+
- **Cross-corpus structural-annotation leak in `CoreAnnotationVectorStore`** (`opencontractserver/llms/vector_stores/core_vector_stores.py:296-326,371-413`): The corpus-wide retrieval path (`corpus_id` set, `document_id=None`) returned every structural annotation in the database regardless of corpus. Two collaborating defects caused the leak:
13+
1. `Q(structural=True)` in the corpus-only branch had **no corpus constraint** — parser-produced structural annotations have `Annotation.document_id = corpus_id = NULL`, so corpus membership is only knowable through `structural_set → Document.structural_annotation_set (reverse FK) → DocumentPath.corpus_id`, a join the previous code did not perform.
14+
2. The `check_corpus_deletion` block (default `True`) added `Q(document_id__in=active_doc_ids)`, and `__in` lookups never match `NULL`, so structural annotations were silently dropped on the production-default path. Bypassing this filter with `check_corpus_deletion=False` exposed defect #1 directly.
15+
- **Impact**: Multi-tenant deployments leaked structural annotations across tenant corpora — a real security boundary violation since the upfront IDOR check only validated the *requested* `corpus_id`, not the *returned* rows. Single-tenant deployments saw it as a corpus-scoping / search-quality bug (e.g. corpus-wide benchmark runs returned chunks from abandoned corpora). The standard `Annotation.objects.visible_to_user()` permission filter was bypassed entirely because the vector store builds its own filter chain rather than going through that manager method.
16+
- **Fix** (corpus boundary + per-document visibility for the structural class): the corpus-only branch now requires `structural_set_id__in=<sets reachable from a document in this corpus that is visible to the user>`, joining through `Document.objects.visible_to_user(user).filter(path_records__corpus_id=...)`. The deletion-aware filter accepts both `document_id__in=active` AND structural rows whose set links to one of those active documents, so parser-produced structural annotations remain reachable on the default path. `CoreAnnotationVectorStore.global_search()` was already correct (it explicitly joins via `structural_set__documents__in=accessible_doc_ids`) and is unchanged.
17+
- Regression coverage: `opencontractserver/tests/test_corpus_isolation_vector_store.py` — six tests covering cross-corpus leak, deletion-aware drop, orphan-set leak, document-scoped retrieval still returns structural rows, viewer-without-doc-permission excluded, creator still sees own row.
18+
- **Test-only**: `opencontractserver/tests/test_pydantic_ai_agents.py`, `opencontractserver/tests/test_structural_annotation_portability.py``Document.objects.create(...)` calls in `TransactionTestCase` setUp now pass `processing_started=timezone.now()` to short-circuit `process_doc_on_create_atomic`, which would otherwise eagerly chain a Celery PDF-ingest task that fails on the (file-less) test document and aborts the whole test class. Pre-existing failure, exposed cleanly when the regression suite was added.
19+
1020
### Added
1121

1222
- **Mypy type-checking wired into pre-commit and CI** (Issue #1331): The existing `[mypy]` block in `setup.cfg` and the `mypy==1.20.1` / `django-stubs==6.0.2` / `djangorestframework-stubs==3.16.9` pins in `requirements/local.txt` were never actually enforced, so the investment was drifting (48 pre-existing `# type: ignore` markers, many modules at 0% annotation coverage). This PR turns on the gate without requiring the 7208 existing errors across 357 files to be fixed first — `mypy.ini` lists each of those files under its own `[mypy-<module>] ignore_errors = True` section, so new modules added outside the baseline **are** type-checked and CI / the hook fails on their errors.

opencontractserver/llms/vector_stores/core_vector_stores.py

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,7 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
290290
# Check for deleted documents in corpus
291291
if self.check_corpus_deletion and self.corpus_id and not self.document_id:
292292
# Note: sync_to_async already imported at module level
293+
from opencontractserver.annotations.models import StructuralAnnotationSet
293294
from opencontractserver.documents.models import DocumentPath
294295

295296
# Get documents with active (non-deleted) paths in corpus
@@ -302,8 +303,20 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
302303
)()
303304

304305
if active_doc_ids:
305-
# Ensure we only search documents with active paths
306-
active_filters &= Q(document_id__in=active_doc_ids)
306+
# Two annotation shapes pass this filter:
307+
# 1. Direct: Annotation.document_id is in the active set.
308+
# 2. Structural: Annotation.document_id is NULL but the
309+
# structural_set links to one of those active documents.
310+
# ``document_id__in`` cannot match NULL on its own, so
311+
# structural annotations need an explicit OR clause —
312+
# otherwise every parser-produced structural row is
313+
# silently dropped on this corpus-wide path.
314+
active_struct_set_ids = StructuralAnnotationSet.objects.filter(
315+
documents__in=active_doc_ids
316+
).values("id")
317+
active_filters &= Q(document_id__in=active_doc_ids) | Q(
318+
structural=True, structural_set_id__in=active_struct_set_ids
319+
)
307320
_logger.debug(f"Found {len(active_doc_ids)} active documents in corpus")
308321
else:
309322
_logger.warning(f"No active documents found in corpus {self.corpus_id}")
@@ -361,11 +374,46 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
361374
# --- Corpus-only context (no document_id specified) ---
362375
_logger.debug(f"Corpus-only context: corpus_id={self.corpus_id}")
363376
# Annotations must be either:
364-
# a) Structural (their Annotation.corpus_id might be null, included by nature)
365-
# b) Non-structural AND directly linked to this corpus via Annotation.corpus_id.
366-
active_filters &= Q(structural=True) | Q(
367-
structural=False, corpus_id=self.corpus_id
368-
)
377+
# a) Structural — restricted to ``structural_set``s reachable from
378+
# a document in this corpus that is *visible to the requesting
379+
# user*. This collapses two checks into one filter:
380+
#
381+
# • Corpus boundary (was missing): a parser-produced
382+
# structural annotation has ``document_id = corpus_id =
383+
# NULL``; its corpus membership is only knowable via
384+
# ``structural_set → Document.structural_annotation_set
385+
# (reverse FK) → DocumentPath.corpus_id``.
386+
# • Per-document permission (was bypassed): structural rows
387+
# previously matched ``Q(structural=True)`` unconditionally,
388+
# so a user with permission on the *corpus* could be served
389+
# rows whose underlying documents they have no permission
390+
# on.
391+
#
392+
# Without this, the corpus-wide path leaked structural rows from
393+
# every other corpus in the database (and from inaccessible
394+
# documents within the same corpus).
395+
# b) Non-structural — directly linked to this corpus via
396+
# ``Annotation.corpus_id``. Per-document visibility for these
397+
# rows is enforced by the visibility filter further below
398+
# plus the upfront IDOR check on ``corpus_id``.
399+
# Document is imported earlier in this method (line 211); reusing
400+
# the local binding avoids an F811 redefinition warning.
401+
from opencontractserver.annotations.models import StructuralAnnotationSet
402+
403+
visible_corpus_doc_ids = await sync_to_async(
404+
lambda: list(
405+
Document.objects.visible_to_user(user)
406+
.filter(path_records__corpus_id=self.corpus_id)
407+
.values_list("id", flat=True)
408+
.distinct()
409+
)
410+
)()
411+
visible_corpus_set_ids = StructuralAnnotationSet.objects.filter(
412+
documents__in=visible_corpus_doc_ids
413+
).values("id")
414+
active_filters &= Q(
415+
structural=True, structural_set_id__in=visible_corpus_set_ids
416+
) | Q(structural=False, corpus_id=self.corpus_id)
369417

370418
# ------------------------------------------------------------------ #
371419
# Apply accumulated document/corpus scope filters if any were added

0 commit comments

Comments
 (0)