Skip to content

Commit a9efa91

Browse files
committed
Enforce Embedding.embedder_path NOT NULL to match partial unique constraints
The field was declared null=True, blank=True while its Python annotation claimed str, surfacing as a mypy assignment mismatch in the #1331 baseline. More importantly, the unique_embedding_per_{document,annotation,note, conversation,message}_embedder partial unique constraints (migration 0059) key on (embedder_path, <parent>) and filter by <parent>__isnull=False — so any row with NULL embedder_path bypassed duplicate prevention for its parent entirely. Every production creation path already supplies a concrete embedder_path: Embedding.objects.store_embedding(), HasEmbeddingMixin.add_embedding(), worker_uploads._store_embeddings() (which skips when empty). Enforcing NOT NULL matches actual behaviour. Migration 0068 backfills legacy NULL rows with settings.DEFAULT_EMBEDDER, deleting rows whose backfill would collide with an existing (default_embedder_path, parent) row (those rows were already unreachable since no query path filters on NULL). Runs atomic=False so the RunPython commits before AlterField takes the ACCESS EXCLUSIVE lock to SET NOT NULL, matching the pattern from migration 0059. Also removed the now-unreachable `or 'Unknown Model'` fallback in Embedding.__str__. Closes #1357
1 parent d9709c4 commit a9efa91

3 files changed

Lines changed: 118 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Fixed
11+
12+
- **`Embedding.embedder_path` could be NULL but was typed `str`** (Issue #1357, `opencontractserver/annotations/models.py:461-465`, `opencontractserver/annotations/models.py:584-585`, `opencontractserver/annotations/migrations/0068_enforce_embedder_path_not_null.py`): The Django field was declared `null=True, blank=True` while the Python annotation claimed `str`, causing a long-standing mypy `assignment` error and — more importantly — silently gutting the partial unique constraints added in migration 0059. Each `unique_embedding_per_{document,annotation,note,conversation,message}_embedder` constraint is conditioned on `<parent>__isnull=False` and keys on `(embedder_path, <parent>)`, so any row with `embedder_path IS NULL` bypassed duplicate prevention for its parent. Every production code path that creates an `Embedding` (`Embedding.objects.store_embedding()`, `HasEmbeddingMixin.add_embedding()`, `worker_uploads._store_embeddings()`) already supplies a concrete `embedder_path` or skips creation when empty, so enforcing non-null at the DB level matches actual behaviour rather than constraining it. New migration 0068 backfills any legacy NULL rows with `settings.DEFAULT_EMBEDDER` (deleting rows that would collide with an existing `(default_embedder_path, parent)` row under the partial unique constraint — they were previously unreachable via any query path since all call sites filter on a concrete embedder path), then `AlterField`s the column to `NOT NULL`. Removed the now-unreachable `or 'Unknown Model'` fallback in `Embedding.__str__`. Migration runs with `atomic = False` so the RunPython backfill commits before `AlterField` takes the `ACCESS EXCLUSIVE` lock to set `NOT NULL`, matching the pattern established by migration 0059.
13+
1014
### Added
1115

1216
- **Mypy type-checking wired into pre-commit and CI** (Issue #1331): The existing `[mypy]` block in `setup.cfg` and the `mypy==1.20.1` / `django-stubs==6.0.2` / `djangorestframework-stubs==3.16.9` pins in `requirements/local.txt` were never actually enforced, so the investment was drifting (48 pre-existing `# type: ignore` markers, many modules at 0% annotation coverage). This PR turns on the gate without requiring the 7208 existing errors across 357 files to be fixed first — `mypy.ini` lists each of those files under its own `[mypy-<module>] ignore_errors = True` section, so new modules added outside the baseline **are** type-checked and CI / the hook fails on their errors.
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
"""
2+
Backfill any legacy Embedding rows with NULL ``embedder_path`` and then
3+
tighten the column to ``NOT NULL``.
4+
5+
Context (issue #1357): ``Embedding.embedder_path`` was declared
6+
``null=True, blank=True`` on the Django field while its Python annotation
7+
claimed ``str``. The partial unique constraints added in migration 0059
8+
reference ``embedder_path`` with ``condition=Q(<parent>__isnull=False)``,
9+
meaning any row where ``embedder_path`` is NULL silently bypasses duplicate
10+
prevention. Every production creation path (store_embedding, add_embedding,
11+
worker_uploads) already supplies a concrete value, so we enforce the
12+
invariant at the DB level to match.
13+
14+
Backfill strategy:
15+
1. For each NULL-embedder_path row, set ``embedder_path`` to
16+
``settings.DEFAULT_EMBEDDER``.
17+
2. If that assignment would collide with an existing (embedder_path,
18+
parent) row under a partial unique constraint, delete the NULL row
19+
instead — it cannot be matched by any query (all call sites filter
20+
on a concrete ``embedder_path``) so it was effectively dead data.
21+
"""
22+
23+
import logging
24+
25+
from django.conf import settings
26+
from django.db import IntegrityError, migrations, models, transaction
27+
28+
logger = logging.getLogger(__name__)
29+
30+
31+
def backfill_null_embedder_paths(apps, schema_editor):
32+
Embedding = apps.get_model("annotations", "Embedding")
33+
34+
default_embedder_path = getattr(settings, "DEFAULT_EMBEDDER", "") or ""
35+
if not default_embedder_path:
36+
logger.warning(
37+
"settings.DEFAULT_EMBEDDER is empty; NULL embedder_path rows will "
38+
"be deleted because no sensible backfill value is available."
39+
)
40+
41+
null_rows = Embedding.objects.filter(embedder_path__isnull=True)
42+
total = null_rows.count()
43+
if total == 0:
44+
logger.info("No Embedding rows with NULL embedder_path — nothing to backfill.")
45+
return
46+
47+
backfilled = 0
48+
deleted = 0
49+
50+
# Use .iterator() to avoid loading the full set into memory on large tables.
51+
for emb in null_rows.iterator(chunk_size=500):
52+
if not default_embedder_path:
53+
emb.delete()
54+
deleted += 1
55+
continue
56+
57+
emb.embedder_path = default_embedder_path
58+
try:
59+
with transaction.atomic():
60+
emb.save(update_fields=["embedder_path"])
61+
backfilled += 1
62+
except IntegrityError:
63+
# A (default_embedder_path, parent) row already exists and is
64+
# covered by the partial unique constraint. The legacy NULL row
65+
# cannot be queried (no call site filters on NULL), so dropping
66+
# it is a lossless cleanup.
67+
logger.info(
68+
"Dropping NULL-embedder_path Embedding id=%s: backfill to %r "
69+
"would duplicate an existing row under the partial unique "
70+
"constraint.",
71+
emb.pk,
72+
default_embedder_path,
73+
)
74+
emb.delete()
75+
deleted += 1
76+
77+
logger.warning(
78+
"Embedding.embedder_path backfill complete: backfilled=%s, deleted=%s, "
79+
"total=%s.",
80+
backfilled,
81+
deleted,
82+
total,
83+
)
84+
85+
86+
def reverse_backfill(apps, schema_editor):
87+
"""No-op: we cannot restore rows that were deleted, and re-nulling
88+
backfilled rows would be indistinguishable from values that have always
89+
been ``settings.DEFAULT_EMBEDDER``."""
90+
91+
92+
class Migration(migrations.Migration):
93+
atomic = False
94+
95+
dependencies = [
96+
("annotations", "0067_merge_20260316_0312"),
97+
]
98+
99+
operations = [
100+
migrations.RunPython(backfill_null_embedder_paths, reverse_backfill),
101+
migrations.AlterField(
102+
model_name="embedding",
103+
name="embedder_path",
104+
field=models.CharField(
105+
help_text=(
106+
"Identifier for the embedding model or pipeline used "
107+
"(e.g. 'openai/text-embedding-ada-002')."
108+
),
109+
max_length=256,
110+
),
111+
),
112+
]

opencontractserver/annotations/models.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -458,11 +458,9 @@ class Embedding(BaseOCModel):
458458
help_text="References the ChatMessage that this embedding belongs to (if any).",
459459
)
460460

461-
# The name/path of the model used to generate this embedding
461+
# Required: NULL would silently bypass the partial unique constraints below.
462462
embedder_path: str = django.db.models.CharField(
463463
max_length=256,
464-
null=True,
465-
blank=True,
466464
help_text="Identifier for the embedding model or pipeline used (e.g. 'openai/text-embedding-ada-002').",
467465
)
468466

@@ -584,7 +582,7 @@ class Meta:
584582
verbose_name_plural = "Embeddings"
585583

586584
def __str__(self):
587-
return f"Embedding (ID={self.pk}) [{self.embedder_path or 'Unknown Model'}]"
585+
return f"Embedding (ID={self.pk}) [{self.embedder_path}]"
588586

589587

590588
class StructuralAnnotationSet(BaseOCModel):

0 commit comments

Comments
 (0)