Skip to content

Commit 85a6789

Browse files
committed
Merge remote-tracking branch 'origin/main' into claude/fix-issue-1360-rT6dl
2 parents e42bb27 + 49309d5 commit 85a6789

7 files changed

Lines changed: 638 additions & 25 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Security
11+
12+
- **Cross-corpus structural-annotation leak in `CoreAnnotationVectorStore`** (`opencontractserver/llms/vector_stores/core_vector_stores.py:296-326,371-413`): The corpus-wide retrieval path (`corpus_id` set, `document_id=None`) returned every structural annotation in the database regardless of corpus. Two collaborating defects caused the leak:
13+
1. `Q(structural=True)` in the corpus-only branch had **no corpus constraint** — parser-produced structural annotations have `Annotation.document_id = corpus_id = NULL`, so corpus membership is only knowable through `structural_set → Document.structural_annotation_set (reverse FK) → DocumentPath.corpus_id`, a join the previous code did not perform.
14+
2. The `check_corpus_deletion` block (default `True`) added `Q(document_id__in=active_doc_ids)`, and `__in` lookups never match `NULL`, so structural annotations were silently dropped on the production-default path. Bypassing this filter with `check_corpus_deletion=False` exposed defect #1 directly.
15+
- **Impact**: Multi-tenant deployments leaked structural annotations across tenant corpora — a real security boundary violation since the upfront IDOR check only validated the *requested* `corpus_id`, not the *returned* rows. Single-tenant deployments saw it as a corpus-scoping / search-quality bug (e.g. corpus-wide benchmark runs returned chunks from abandoned corpora). The standard `Annotation.objects.visible_to_user()` permission filter was bypassed entirely because the vector store builds its own filter chain rather than going through that manager method.
16+
- **Fix** (corpus boundary + per-document visibility for the structural class): the corpus-only branch now requires `structural_set_id__in=<sets reachable from a document in this corpus that is visible to the user>`, joining through `Document.objects.visible_to_user(user).filter(path_records__corpus_id=...)`. The deletion-aware filter accepts both `document_id__in=active` AND structural rows whose set links to one of those active documents, so parser-produced structural annotations remain reachable on the default path. `CoreAnnotationVectorStore.global_search()` was already correct (it explicitly joins via `structural_set__documents__in=accessible_doc_ids`) and is unchanged.
17+
- Regression coverage: `opencontractserver/tests/test_corpus_isolation_vector_store.py` — six tests covering cross-corpus leak, deletion-aware drop, orphan-set leak, document-scoped retrieval still returns structural rows, viewer-without-doc-permission excluded, creator still sees own row.
18+
- **Test-only**: `opencontractserver/tests/test_pydantic_ai_agents.py`, `opencontractserver/tests/test_structural_annotation_portability.py``Document.objects.create(...)` calls in `TransactionTestCase` setUp now pass `processing_started=timezone.now()` to short-circuit `process_doc_on_create_atomic`, which would otherwise eagerly chain a Celery PDF-ingest task that fails on the (file-less) test document and aborts the whole test class. Pre-existing failure, exposed cleanly when the regression suite was added.
19+
1020
### Added
1121

1222
- **Mypy type-checking wired into pre-commit and CI** (Issue #1331): The existing `[mypy]` block in `setup.cfg` and the `mypy==1.20.1` / `django-stubs==6.0.2` / `djangorestframework-stubs==3.16.9` pins in `requirements/local.txt` were never actually enforced, so the investment was drifting (48 pre-existing `# type: ignore` markers, many modules at 0% annotation coverage). This PR turns on the gate without requiring the 7208 existing errors across 357 files to be fixed first — `mypy.ini` lists each of those files under its own `[mypy-<module>] ignore_errors = True` section, so new modules added outside the baseline **are** type-checked and CI / the hook fails on their errors.

frontend/.prettierignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ node_modules/
1313
# Generated files
1414
*.min.js
1515
*.min.css
16+
public/env-config.js
1617

1718
# Playwright cache
1819
playwright/.cache/

opencontractserver/documents/management/commands/validate_v3_migration.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
"""
1010

1111
import logging
12+
from typing import cast
1213

1314
from django.core.management.base import BaseCommand
1415
from django.db.models import Count
@@ -298,9 +299,14 @@ def _check_structural_set_uniqueness(self, verbose):
298299
)
299300

300301
if verbose:
301-
for content_hash, count in duplicates.values_list(
302-
"content_hash", "count"
303-
)[:5]:
302+
# django-stubs 6.0.3 mistypes `.values().annotate().values_list()`
303+
# chains as the annotated model rather than a tuple iterable, so
304+
# materialise with an explicit cast. Functionally identical.
305+
top_duplicates = cast(
306+
list[tuple[str, int]],
307+
list(duplicates.values_list("content_hash", "count")[:5]),
308+
)
309+
for content_hash, count in top_duplicates:
304310
self.stdout.write(
305311
f" - Hash {content_hash[:16]}...: {count} duplicates"
306312
)

opencontractserver/llms/vector_stores/core_vector_stores.py

Lines changed: 83 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from django.contrib.auth import get_user_model
1010
from django.db.models import Q, QuerySet
1111

12-
from opencontractserver.annotations.models import Annotation
12+
from opencontractserver.annotations.models import Annotation, StructuralAnnotationSet
1313
from opencontractserver.constants.search import (
1414
FTS_CONFIG,
1515
HYBRID_SEARCH_OVERSAMPLE_FACTOR,
@@ -208,7 +208,7 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
208208
# enumeration attacks.
209209
# -------------------------------------------------------------------------
210210
from opencontractserver.corpuses.models import Corpus
211-
from opencontractserver.documents.models import Document
211+
from opencontractserver.documents.models import Document, DocumentPath
212212

213213
user = None
214214
if self.user_id:
@@ -289,22 +289,47 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
289289

290290
# Check for deleted documents in corpus
291291
if self.check_corpus_deletion and self.corpus_id and not self.document_id:
292-
# Note: sync_to_async already imported at module level
293-
from opencontractserver.documents.models import DocumentPath
294-
295-
# Get documents with active (non-deleted) paths in corpus
296-
active_doc_ids = await sync_to_async(
297-
lambda: list(
298-
DocumentPath.objects.filter(
299-
corpus_id=self.corpus_id, is_current=True, is_deleted=False
300-
).values_list("document_id", flat=True)
292+
# Lazy subquery — never round-trips through Python, so the
293+
# generated SQL stays a single statement with a real subquery
294+
# rather than a giant ``IN (val, val, ...)`` literal even for
295+
# corpora with tens of thousands of documents.
296+
active_doc_ids_qs = (
297+
DocumentPath.objects.filter(
298+
corpus_id=self.corpus_id, is_current=True, is_deleted=False
299+
)
300+
.values("document_id")
301+
.distinct()
302+
)
303+
# Trade-off: this ``EXISTS`` adds one extra round-trip on the
304+
# happy path, but lets us short-circuit the entire vector search
305+
# for empty/all-deleted corpora (returning ``Annotation.none()``
306+
# spares a downstream HNSW probe and keeps the existing
307+
# operational warning). For corpora with at least one active
308+
# document the cost is a single boolean SELECT and is dwarfed
309+
# by the main query. Removing the check would also remove the
310+
# debug log of the active-doc count that the materialised list
311+
# used to provide.
312+
has_active_docs = await sync_to_async(active_doc_ids_qs.exists)()
313+
314+
if has_active_docs:
315+
# Two annotation shapes pass this filter:
316+
# 1. Direct: Annotation.document_id is in the active set.
317+
# 2. Structural: Annotation.document_id is NULL but the
318+
# structural_set links to one of those active documents.
319+
# ``document_id__in`` cannot match NULL on its own, so
320+
# structural annotations need an explicit OR clause —
321+
# otherwise every parser-produced structural row is
322+
# silently dropped on this corpus-wide path.
323+
active_struct_set_ids = (
324+
StructuralAnnotationSet.objects.filter(
325+
documents__in=active_doc_ids_qs
326+
)
327+
.values("id")
328+
.distinct()
329+
)
330+
active_filters &= Q(document_id__in=active_doc_ids_qs) | Q(
331+
structural=True, structural_set_id__in=active_struct_set_ids
301332
)
302-
)()
303-
304-
if active_doc_ids:
305-
# Ensure we only search documents with active paths
306-
active_filters &= Q(document_id__in=active_doc_ids)
307-
_logger.debug(f"Found {len(active_doc_ids)} active documents in corpus")
308333
else:
309334
_logger.warning(f"No active documents found in corpus {self.corpus_id}")
310335
return Annotation.objects.none()
@@ -361,11 +386,48 @@ async def _build_base_queryset(self) -> QuerySet[Annotation]:
361386
# --- Corpus-only context (no document_id specified) ---
362387
_logger.debug(f"Corpus-only context: corpus_id={self.corpus_id}")
363388
# Annotations must be either:
364-
# a) Structural (their Annotation.corpus_id might be null, included by nature)
365-
# b) Non-structural AND directly linked to this corpus via Annotation.corpus_id.
366-
active_filters &= Q(structural=True) | Q(
367-
structural=False, corpus_id=self.corpus_id
389+
# a) Structural — restricted to ``structural_set``s reachable from
390+
# a document in this corpus that is *visible to the requesting
391+
# user*. This collapses two checks into one filter:
392+
#
393+
# • Corpus boundary (was missing): a parser-produced
394+
# structural annotation has ``document_id = corpus_id =
395+
# NULL``; its corpus membership is only knowable via
396+
# ``structural_set → Document.structural_annotation_set
397+
# (reverse FK) → DocumentPath.corpus_id``.
398+
# • Per-document permission (was bypassed): structural rows
399+
# previously matched ``Q(structural=True)`` unconditionally,
400+
# so a user with permission on the *corpus* could be served
401+
# rows whose underlying documents they have no permission
402+
# on.
403+
#
404+
# Without this, the corpus-wide path leaked structural rows from
405+
# every other corpus in the database (and from inaccessible
406+
# documents within the same corpus).
407+
# b) Non-structural — directly linked to this corpus via
408+
# ``Annotation.corpus_id``. Per-document visibility for these
409+
# rows is enforced by the visibility filter further below
410+
# plus the upfront IDOR check on ``corpus_id``.
411+
# Both subqueries below stay lazy so the SQL planner sees a
412+
# nested ``IN (SELECT ...)`` rather than a Python-materialised
413+
# ``IN (val, val, ...)`` literal — important for corpora with
414+
# tens of thousands of documents.
415+
visible_corpus_doc_ids_qs = (
416+
Document.objects.visible_to_user(user)
417+
.filter(path_records__corpus_id=self.corpus_id)
418+
.values("id")
419+
.distinct()
420+
)
421+
visible_corpus_set_ids = (
422+
StructuralAnnotationSet.objects.filter(
423+
documents__in=visible_corpus_doc_ids_qs
424+
)
425+
.values("id")
426+
.distinct()
368427
)
428+
active_filters &= Q(
429+
structural=True, structural_set_id__in=visible_corpus_set_ids
430+
) | Q(structural=False, corpus_id=self.corpus_id)
369431

370432
# ------------------------------------------------------------------ #
371433
# Apply accumulated document/corpus scope filters if any were added

0 commit comments

Comments
 (0)