Skip to content

Commit 380c694

Browse files
authored
Merge pull request #1372 from Open-Source-Legal/claude/fix-issue-1357-p0Ivi
Enforce Embedding.embedder_path NOT NULL to match partial unique constraints
2 parents 071f38b + d503dda commit 380c694

3 files changed

Lines changed: 140 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3737
- Regression coverage: `opencontractserver/tests/test_corpus_isolation_vector_store.py` — six tests covering cross-corpus leak, deletion-aware drop, orphan-set leak, document-scoped retrieval still returns structural rows, viewer-without-doc-permission excluded, creator still sees own row.
3838
- **Test-only**: `opencontractserver/tests/test_pydantic_ai_agents.py`, `opencontractserver/tests/test_structural_annotation_portability.py``Document.objects.create(...)` calls in `TransactionTestCase` setUp now pass `processing_started=timezone.now()` to short-circuit `process_doc_on_create_atomic`, which would otherwise eagerly chain a Celery PDF-ingest task that fails on the (file-less) test document and aborts the whole test class. Pre-existing failure, exposed cleanly when the regression suite was added.
3939

40+
### Fixed
41+
42+
- **`Embedding.embedder_path` could be NULL but was typed `str`** (Issue #1357, `opencontractserver/annotations/models.py:461-465`, `opencontractserver/annotations/models.py:584-585`, `opencontractserver/annotations/migrations/0068_enforce_embedder_path_not_null.py`): The Django field was declared `null=True, blank=True` while the Python annotation claimed `str`, causing a long-standing mypy `assignment` error and — more importantly — silently gutting the partial unique constraints added in migration 0059. Each `unique_embedding_per_{document,annotation,note,conversation,message}_embedder` constraint is conditioned on `<parent>__isnull=False` and keys on `(embedder_path, <parent>)`, so any row with `embedder_path IS NULL` bypassed duplicate prevention for its parent. Every production code path that creates an `Embedding` (`Embedding.objects.store_embedding()`, `HasEmbeddingMixin.add_embedding()`, `worker_uploads._store_embeddings()`) already supplies a concrete `embedder_path` or skips creation when empty, so enforcing non-null at the DB level matches actual behaviour rather than constraining it. New migration 0068 backfills any legacy NULL rows with `settings.DEFAULT_EMBEDDER` (deleting rows that would collide with an existing `(default_embedder_path, parent)` row under the partial unique constraint — they were previously unreachable via any query path since all call sites filter on a concrete embedder path), then `AlterField`s the column to `NOT NULL`. Removed the now-unreachable `or 'Unknown Model'` fallback in `Embedding.__str__`. Migration runs with `atomic = False` so the RunPython backfill commits before `AlterField` takes the `ACCESS EXCLUSIVE` lock to set `NOT NULL`, matching the pattern established by migration 0059.
43+
4044
### Added
4145

4246
- **Coverage: raise Corpus Chat & Agent Management component tests** (Issue #1276): added 36 new Playwright CT tests across the four lowest-ROI corpus components to drive coverage toward the ≥60% target. Breakdown:
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
"""
2+
Backfill any legacy Embedding rows with NULL ``embedder_path`` and then
3+
tighten the column to ``NOT NULL``.
4+
5+
Context (issue #1357): ``Embedding.embedder_path`` was declared
6+
``null=True, blank=True`` on the Django field while its Python annotation
7+
claimed ``str``. The partial unique constraints added in migration 0059
8+
reference ``embedder_path`` with ``condition=Q(<parent>__isnull=False)``,
9+
meaning any row where ``embedder_path`` is NULL silently bypasses duplicate
10+
prevention. Every production creation path (store_embedding, add_embedding,
11+
worker_uploads) already supplies a concrete value, so we enforce the
12+
invariant at the DB level to match.
13+
14+
Backfill strategy:
15+
1. For each NULL-embedder_path row, set ``embedder_path`` to
16+
``settings.DEFAULT_EMBEDDER``.
17+
2. If that assignment would collide with an existing (embedder_path,
18+
parent) row under a partial unique constraint, delete the NULL row
19+
instead — it cannot be matched by any query (all call sites filter
20+
on a concrete ``embedder_path``) so it was effectively dead data.
21+
"""
22+
23+
import logging
24+
25+
from django.conf import settings
26+
from django.db import IntegrityError, migrations, models, transaction
27+
28+
logger = logging.getLogger(__name__)
29+
30+
31+
def backfill_null_embedder_paths(apps, schema_editor):
32+
Embedding = apps.get_model("annotations", "Embedding")
33+
34+
total = Embedding.objects.filter(embedder_path__isnull=True).count()
35+
if total == 0:
36+
logger.info("No Embedding rows with NULL embedder_path — nothing to backfill.")
37+
return
38+
39+
# Refuse to run if there's no default to backfill with — silently deleting
40+
# embedding rows because of a misconfigured env var would be irreversible.
41+
default_embedder_path = getattr(settings, "DEFAULT_EMBEDDER", "") or ""
42+
if not default_embedder_path:
43+
raise ValueError(
44+
f"settings.DEFAULT_EMBEDDER is empty but {total} Embedding row(s) "
45+
"have NULL embedder_path. Set DEFAULT_EMBEDDER (or manually clean "
46+
"up the NULL rows) before running this migration."
47+
)
48+
49+
backfilled = 0
50+
deleted = 0
51+
52+
# Keyset pagination: re-query each chunk for rows that still match the
53+
# NULL predicate AND have pk > the previous batch's max. Using
54+
# `.iterator(chunk_size=N)` here would be unsafe because we mutate or
55+
# delete every row we visit, and OFFSET-based chunking against a
56+
# shrinking result set would silently skip rows.
57+
chunk_size = 500
58+
last_pk = 0
59+
while True:
60+
batch = list(
61+
Embedding.objects.filter(
62+
embedder_path__isnull=True, pk__gt=last_pk
63+
).order_by("pk")[:chunk_size]
64+
)
65+
if not batch:
66+
break
67+
for emb in batch:
68+
emb.embedder_path = default_embedder_path
69+
try:
70+
with transaction.atomic():
71+
emb.save(update_fields=["embedder_path"])
72+
backfilled += 1
73+
except IntegrityError:
74+
# A (default_embedder_path, parent) row already exists and is
75+
# covered by the partial unique constraint. The legacy NULL row
76+
# cannot be queried (no call site filters on NULL), so dropping
77+
# it is a lossless cleanup.
78+
logger.info(
79+
"Dropping NULL-embedder_path Embedding id=%s: backfill to %r "
80+
"would duplicate an existing row under the partial unique "
81+
"constraint.",
82+
emb.pk,
83+
default_embedder_path,
84+
)
85+
emb.delete()
86+
deleted += 1
87+
last_pk = batch[-1].pk
88+
89+
if backfilled + deleted != total:
90+
logger.warning(
91+
"Embedding.embedder_path backfill: processed %s != initial NULL count %s "
92+
"(backfilled=%s, deleted=%s). Some rows may have been added/removed by "
93+
"concurrent traffic during the migration.",
94+
backfilled + deleted,
95+
total,
96+
backfilled,
97+
deleted,
98+
)
99+
logger.info(
100+
"Embedding.embedder_path backfill complete: backfilled=%s, deleted=%s, "
101+
"initial_null_count=%s.",
102+
backfilled,
103+
deleted,
104+
total,
105+
)
106+
107+
108+
def reverse_backfill(apps, schema_editor):
109+
"""No-op: we cannot restore rows that were deleted, and re-nulling
110+
backfilled rows would be indistinguishable from values that have always
111+
been ``settings.DEFAULT_EMBEDDER``."""
112+
113+
114+
class Migration(migrations.Migration):
115+
atomic = False
116+
117+
dependencies = [
118+
("annotations", "0067_merge_20260316_0312"),
119+
]
120+
121+
operations = [
122+
migrations.RunPython(backfill_null_embedder_paths, reverse_backfill),
123+
migrations.AlterField(
124+
model_name="embedding",
125+
name="embedder_path",
126+
field=models.CharField(
127+
help_text=(
128+
"Identifier for the embedding model or pipeline used "
129+
"(e.g. 'openai/text-embedding-ada-002')."
130+
),
131+
max_length=256,
132+
),
133+
),
134+
]

opencontractserver/annotations/models.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -463,11 +463,9 @@ class Embedding(BaseOCModel):
463463
help_text="References the ChatMessage that this embedding belongs to (if any).",
464464
)
465465

466-
# The name/path of the model used to generate this embedding
466+
# Required: NULL would silently bypass the partial unique constraints below.
467467
embedder_path: str = django.db.models.CharField(
468468
max_length=256,
469-
null=True,
470-
blank=True,
471469
help_text="Identifier for the embedding model or pipeline used (e.g. 'openai/text-embedding-ada-002').",
472470
)
473471

@@ -589,7 +587,7 @@ class Meta:
589587
verbose_name_plural = "Embeddings"
590588

591589
def __str__(self) -> str:
592-
return f"Embedding (ID={self.pk}) [{self.embedder_path or 'Unknown Model'}]"
590+
return f"Embedding (ID={self.pk}) [{self.embedder_path}]"
593591

594592

595593
class StructuralAnnotationSet(BaseOCModel):

0 commit comments

Comments
 (0)