diff --git a/CHANGELOG.md b/CHANGELOG.md index d352ed735..2ed9b0108 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Fixed +- **Extraction grounding follow-up #2** (Issue #1407, follow-up to PR #1397): + - **DB-level idempotency for the grounding `get_or_create`** (`opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py`, `opencontractserver/annotations/models.py`): added two partial `UniqueConstraint`s on `Annotation` that exactly match the application-level `get_or_create` lookup keys in `opencontractserver/utils/extraction_grounding.py` — `annotation_unique_token_label_grounding_key` keyed on `(document, corpus, annotation_label, page, raw_text, creator)` and `annotation_unique_span_label_grounding_key` keyed on `(document, corpus, annotation_label, raw_text, json, creator)`, both scoped via `condition=Q(structural=False, is_grounding_source=True)`. Without these, two concurrent Celery workers retrying the same datacell could both miss on the SELECT and both succeed on the CREATE, defeating PR #1397's `get_or_create`. The new `Annotation.is_grounding_source` boolean (default `False`, set `True` only inside the grounding pipeline) scopes the constraints so legitimate non-grounding flows producing the same key tuple — `add_annotations_from_exact_strings` finding multiple occurrences of the same word on a page, hierarchical annotation trees with duplicate `raw_text` at different parents — are not blocked. The migration adds the field, backfills `is_grounding_source=True` for any pre-existing `OC_EXTRACT_SOURCE` rows, deduplicates the now-flagged duplicates (keeping the lowest-pk row in each group and re-pointing `datacell.sources` M2M FKs) before adding the constraints — same pattern as `0068_enforce_embedder_path_not_null.py`. `creator` is in the key so two distinct users manually creating identical grounding rows are not blocked; only literal duplicates (the realistic Celery-race target) are. + - **`creator_id` is now in the grounding `get_or_create` lookup keys** (not just `defaults`) so the new constraints back the application-level lookup — required for `IntegrityError` to fire on the loser of a race and for `get_or_create` to fall back to the `SELECT`. + - **Documented PostgreSQL-specific `json` lookup semantics** (`opencontractserver/utils/extraction_grounding.py` `_create_span_annotation` docstring): the `json={"start", "end"}` lookup key relies on `JSONField` mapping to PostgreSQL `jsonb` (structural equality, key-order independent). On SQLite `JSONField` is plain text and the comparison would be lexical. The project's runtime and test database are both PostgreSQL, so this is documented rather than guarded. + - **Documented dedup rationale** (`_create_grounding_annotations`): the end-of-function dedup-by-pk loop is needed because distinct extraction values can legitimately resolve to the same source span (e.g. `["Acme Holdings", "Acme Holdings Inc."]` both anchoring on the same token range, or an idempotent re-run returning the existing row twice). The comment now spells out that this is intentional rather than masking a bug in `align_text_to_document`. + - **Test cleanup** (`opencontractserver/tests/test_extraction_grounding.py`): replaced the redundant `assertIsNotNone(...) + assert ... is not None` pair with a single plain `assert` (the only one of the two that actually narrows for mypy / pyright), with an inline comment explaining the choice. - **Shared-protocol contract drift surfaced by PR #1400 follow-up review** (Issue #1408): - **`PermissionedQueryManagerProtocol.visible_to_user` signature mismatch** (`opencontractserver/types/protocols.py:147`): the protocol declared `visible_to_user(self, user: Any = None)` but `PermissionManager.visible_to_user` and `UserFeedbackManager.visible_to_user` have no default. A caller holding a protocol-typed reference could call `.visible_to_user()` with no args and trigger a runtime `TypeError`, and mypy would also reject the concrete classes as structural matches. Dropped the `= None` default so the protocol pins the strictest contract; the docstring now explains that callers must pass an `AnonymousUser` when no authenticated principal is available. Lenient managers (e.g. `BaseVisibilityManager` with `user=None`) still satisfy the protocol — verified with both `isinstance` and `issubclass` checks. - **`ToolProtocol` `@property` descriptors against a `@dataclass`** (`opencontractserver/types/protocols.py:97`): `CoreTool` is a `@dataclass` exposing `name` / `description` / `parameters` / `requires_approval` as plain instance attributes. Mypy accepted dataclass fields against the property-shaped protocol, but the asymmetry was a footgun for future implementors who would reach for `@property` based on the protocol surface. Converted to plain attribute declarations matching the concrete class. diff --git a/opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py b/opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py new file mode 100644 index 000000000..9815bb009 --- /dev/null +++ b/opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py @@ -0,0 +1,349 @@ +""" +Add database-level unique constraints backing the application-level +``get_or_create`` calls in ``opencontractserver/utils/extraction_grounding.py``. + +Context (issue #1407, follow-up to PR #1397): +The grounding pipeline already uses ``Annotation.objects.get_or_create()`` +to make Celery retries idempotent. Without a backing ``UniqueConstraint``, +two workers racing on the same datacell can both miss on the SELECT and +both succeed on the CREATE — duplicating the source annotation that the +``get_or_create`` was meant to prevent. In practice this is rare for a +single datacell (Celery typically retries sequentially), but the +constraints below promote idempotency from a best-effort behaviour to a +correctness invariant. + +The constraints are scoped to ``is_grounding_source=True`` so that +legitimate non-grounding flows producing the same key tuple — e.g. +``add_annotations_from_exact_strings`` finding multiple occurrences of +the same word on the same page, or hierarchical annotation trees with +duplicate ``raw_text`` at different parents — are not blocked. + +Two partial unique constraints (one per annotation_type) match the +respective ``get_or_create`` lookup keys exactly: + +* TOKEN_LABEL (PDF grounding) — keyed on + ``(document, corpus, annotation_label, page, raw_text, creator)`` since + the bounding-box ``json`` is deterministic for a fixed input and we + want retries to reuse the existing row even if PlasmaPDF reformats the + bounding-box layout. +* SPAN_LABEL (text/DOCX grounding) — keyed on + ``(document, corpus, annotation_label, raw_text, json, creator)`` + because the character offsets in ``json={"start", "end"}`` ARE the + identity for a span annotation. ``page`` is omitted (always 1 in this + path). + +``creator`` is included so two users manually creating identical +grounding rows are not blocked — only a single user creating the literal +same row is, which matches the realistic race-condition target (the +grounding pipeline always uses the datacell owner's ID). + +Migration steps (in order): +1. Add the ``is_grounding_source`` boolean field (default False). +2. Backfill: set ``is_grounding_source=True`` for any pre-existing + annotation tied to the ``OC_EXTRACT_SOURCE`` label and not structural, + so the constraints meaningfully cover historical grounding rows. +3. Dedupe surviving duplicates among the backfilled (now-flagged) rows, + re-pointing ``datacell.sources`` M2M FKs to the lowest-pk keeper — + same pattern as ``0068_enforce_embedder_path_not_null.py``. +4. Add the two partial UniqueConstraints. +""" + +import logging + +from django.db import migrations, models, transaction +from django.db.models import Count, Q + +from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL + +logger = logging.getLogger(__name__) + + +def backfill_is_grounding_source(apps, schema_editor): + """Flag pre-existing grounding annotations. + + Without this backfill, historical OC_EXTRACT_SOURCE rows would have + ``is_grounding_source=False`` and slip past the new constraints. + """ + Annotation = apps.get_model("annotations", "Annotation") + AnnotationLabel = apps.get_model("annotations", "AnnotationLabel") + + grounding_label_ids = list( + AnnotationLabel.objects.filter(text=OC_EXTRACT_SOURCE_LABEL).values_list( + "id", flat=True + ) + ) + if not grounding_label_ids: + return + + updated = Annotation.objects.filter( + annotation_label_id__in=grounding_label_ids, + structural=False, + ).update(is_grounding_source=True) + + if updated: + logger.info( + "Backfilled is_grounding_source=True on %s pre-existing " + "OC_EXTRACT_SOURCE annotation rows.", + updated, + ) + + +def reverse_backfill(apps, schema_editor): + """Reverse: clear is_grounding_source on the backfilled rows.""" + Annotation = apps.get_model("annotations", "Annotation") + Annotation.objects.filter(is_grounding_source=True).update( + is_grounding_source=False + ) + + +def _repoint_cross_references(apps, redundant_ids: list[int], keeper_id: int) -> None: + """Repoint every realistic cross-reference from ``redundant_ids`` to ``keeper_id``. + + Survey of FK / M2M references to ``Annotation`` outside the grounding + pipeline (run ``grep -nE 'ForeignKey|ManyToManyField' ...`` to refresh): + + * ``extracts.Datacell.sources`` (M2M) — primary, always + * ``feedback.UserFeedback.commented_annotation`` (FK, SET_NULL) + * ``conversations.ChatMessage.source_annotations`` (M2M) + * ``conversations.ChatMessage.created_annotations`` (M2M) + * ``users.AssignmentTask.resulting_annotations`` (M2M) + * ``annotations.Relationship.source_annotations`` (M2M) + * ``annotations.Relationship.target_annotations`` (M2M) + * ``annotations.Embedding.annotation`` (FK, CASCADE) + — handled implicitly: deleting redundant annotations cascade-deletes + their embeddings, the keeper retains its own. + * ``annotations.Annotation.parent`` (FK, CASCADE) + — grounding annotations are leaf rows; in the unlikely event a child + FK exists, CASCADE preserves migration integrity at the cost of + data on a duplicate row that should not have had children. + + Every reference except the cascading ones is repointed to the keeper so + no cross-domain data is silently dropped when the redundant row is + deleted. + """ + Datacell = apps.get_model("extracts", "Datacell") + UserFeedback = apps.get_model("feedback", "UserFeedback") + ChatMessage = apps.get_model("conversations", "ChatMessage") + AssignmentTask = apps.get_model("users", "AssignmentTask") + Relationship = apps.get_model("annotations", "Relationship") + + Datacell.sources.through.objects.filter(annotation_id__in=redundant_ids).update( + annotation_id=keeper_id + ) + UserFeedback.objects.filter(commented_annotation_id__in=redundant_ids).update( + commented_annotation_id=keeper_id + ) + ChatMessage.source_annotations.through.objects.filter( + annotation_id__in=redundant_ids + ).update(annotation_id=keeper_id) + ChatMessage.created_annotations.through.objects.filter( + annotation_id__in=redundant_ids + ).update(annotation_id=keeper_id) + AssignmentTask.resulting_annotations.through.objects.filter( + annotation_id__in=redundant_ids + ).update(annotation_id=keeper_id) + Relationship.source_annotations.through.objects.filter( + annotation_id__in=redundant_ids + ).update(annotation_id=keeper_id) + Relationship.target_annotations.through.objects.filter( + annotation_id__in=redundant_ids + ).update(annotation_id=keeper_id) + + +def _dedupe_token_label(apps): + """Collapse duplicate flagged TOKEN_LABEL grounding annotations. + + Groups by ``(document, corpus, annotation_label, page, raw_text, + creator)`` and, for each group of >1, keeps the lowest-pk row and + repoints all realistic cross-references (see + ``_repoint_cross_references``) to it before deleting the rest. + """ + Annotation = apps.get_model("annotations", "Annotation") + + duplicates = ( + Annotation.objects.filter( + structural=False, + annotation_type="TOKEN_LABEL", + is_grounding_source=True, + ) + .values( + "document_id", + "corpus_id", + "annotation_label_id", + "page", + "raw_text", + "creator_id", + ) + .annotate(n=Count("id")) + .filter(n__gt=1) + ) + + total_groups = 0 + total_collapsed = 0 + for group in duplicates: + ids = list( + Annotation.objects.filter( + structural=False, + annotation_type="TOKEN_LABEL", + is_grounding_source=True, + document_id=group["document_id"], + corpus_id=group["corpus_id"], + annotation_label_id=group["annotation_label_id"], + page=group["page"], + raw_text=group["raw_text"], + creator_id=group["creator_id"], + ) + .order_by("id") + .values_list("id", flat=True) + ) + if len(ids) <= 1: + continue + keeper, *redundant = ids + with transaction.atomic(): + _repoint_cross_references(apps, redundant, keeper) + Annotation.objects.filter(id__in=redundant).delete() + total_groups += 1 + total_collapsed += len(redundant) + + if total_groups: + logger.info( + "Collapsed %s duplicate grounding TOKEN_LABEL annotation rows " + "across %s groups before adding unique constraint.", + total_collapsed, + total_groups, + ) + + +def _dedupe_span_label(apps): + """Collapse duplicate flagged SPAN_LABEL grounding annotations. + + Groups by ``(document, corpus, annotation_label, raw_text, json, + creator)``. ``json`` equality on PostgreSQL JSONB is structural + (key-order independent), which matches the application-level + ``get_or_create`` lookup behaviour. + """ + Annotation = apps.get_model("annotations", "Annotation") + + duplicates = ( + Annotation.objects.filter( + structural=False, + annotation_type="SPAN_LABEL", + is_grounding_source=True, + ) + .values( + "document_id", + "corpus_id", + "annotation_label_id", + "raw_text", + "json", + "creator_id", + ) + .annotate(n=Count("id")) + .filter(n__gt=1) + ) + + total_groups = 0 + total_collapsed = 0 + for group in duplicates: + ids = list( + Annotation.objects.filter( + structural=False, + annotation_type="SPAN_LABEL", + is_grounding_source=True, + document_id=group["document_id"], + corpus_id=group["corpus_id"], + annotation_label_id=group["annotation_label_id"], + raw_text=group["raw_text"], + json=group["json"], + creator_id=group["creator_id"], + ) + .order_by("id") + .values_list("id", flat=True) + ) + if len(ids) <= 1: + continue + keeper, *redundant = ids + with transaction.atomic(): + _repoint_cross_references(apps, redundant, keeper) + Annotation.objects.filter(id__in=redundant).delete() + total_groups += 1 + total_collapsed += len(redundant) + + if total_groups: + logger.info( + "Collapsed %s duplicate grounding SPAN_LABEL annotation rows " + "across %s groups before adding unique constraint.", + total_collapsed, + total_groups, + ) + + +def dedupe_grounding_targets(apps, schema_editor): + _dedupe_token_label(apps) + _dedupe_span_label(apps) + + +def reverse_dedupe(apps, schema_editor): + """No-op: dropping the constraint cannot resurrect deleted duplicates.""" + + +class Migration(migrations.Migration): + atomic = False + + dependencies = [ + ("annotations", "0068_enforce_embedder_path_not_null"), + # Dedup helper repoints cross-references on these apps' models. Pin + # to the latest migration of each so they exist when this runs. + ("conversations", "0018_conversation_memory_curated"), + ("extracts", "0028_rename_placeholder_indexes"), + ("feedback", "0006_alter_userfeedback_backend_lock"), + ("users", "0026_alter_user_username_validator"), + ] + + operations = [ + migrations.AddField( + model_name="annotation", + name="is_grounding_source", + field=models.BooleanField(default=False), + ), + migrations.RunPython(backfill_is_grounding_source, reverse_backfill), + migrations.RunPython(dedupe_grounding_targets, reverse_dedupe), + migrations.AddConstraint( + model_name="annotation", + constraint=models.UniqueConstraint( + fields=[ + "document", + "corpus", + "annotation_label", + "page", + "raw_text", + "creator", + ], + condition=Q( + structural=False, + annotation_type="TOKEN_LABEL", + is_grounding_source=True, + ), + name="annotation_unique_token_label_grounding_key", + ), + ), + migrations.AddConstraint( + model_name="annotation", + constraint=models.UniqueConstraint( + fields=[ + "document", + "corpus", + "annotation_label", + "raw_text", + "json", + "creator", + ], + condition=Q( + structural=False, + annotation_type="SPAN_LABEL", + is_grounding_source=True, + ), + name="annotation_unique_span_label_grounding_key", + ), + ), + ] diff --git a/opencontractserver/annotations/models.py b/opencontractserver/annotations/models.py index 7f3876f81..4f0cdec1d 100644 --- a/opencontractserver/annotations/models.py +++ b/opencontractserver/annotations/models.py @@ -963,6 +963,15 @@ class Annotation(BaseOCModel, HasEmbeddingMixin): # Mark structural / layout annotations explicitly. structural = django.db.models.BooleanField(default=False) + # True only for annotations created by the extraction-grounding pipeline + # (``opencontractserver/utils/extraction_grounding.py``). Backs the + # partial UniqueConstraints below — the constraints scope to this flag + # so legitimate non-grounding flows that happen to produce the same + # ``(document, corpus, label, page, raw_text, creator)`` tuple (e.g. + # multi-occurrence exact-string annotation tools, hierarchical + # annotation trees) are not blocked. + is_grounding_source = django.db.models.BooleanField(default=False) + # Content modalities present in this annotation (TEXT, IMAGE, etc.) content_modalities = ArrayField( django.db.models.CharField(max_length=20), @@ -1242,6 +1251,55 @@ class Meta: "Annotations in a structural_set must have structural=True" ), ), + # Backs the application-level ``get_or_create`` in the + # extraction grounding pipeline (see + # ``opencontractserver/utils/extraction_grounding.py``). Without + # this, two concurrent Celery workers retrying the same datacell + # could both miss on the SELECT and both succeed on the CREATE, + # producing duplicate source annotations. ``creator`` is in the + # key so two distinct users manually creating identical rows + # are not blocked. Scoped to ``is_grounding_source=True`` so + # legitimate non-grounding flows that happen to produce the + # same key tuple (e.g. ``add_annotations_from_exact_strings`` + # finding multiple occurrences of the same word on a page, + # hierarchical annotation trees) are not blocked. + django.db.models.UniqueConstraint( + fields=[ + "document", + "corpus", + "annotation_label", + "page", + "raw_text", + "creator", + ], + condition=django.db.models.Q( + structural=False, + annotation_type=TOKEN_LABEL, + is_grounding_source=True, + ), + name="annotation_unique_token_label_grounding_key", + ), + # Span counterpart: ``json={"start", "end"}`` is the identity + # of a span annotation, so it replaces ``page`` in the key. + # JSONB equality on PostgreSQL is structural (key-order + # independent), which matches the application-level lookup; + # SQLite stores JSON as text and would compare lexically. + django.db.models.UniqueConstraint( + fields=[ + "document", + "corpus", + "annotation_label", + "raw_text", + "json", + "creator", + ], + condition=django.db.models.Q( + structural=False, + annotation_type=SPAN_LABEL, + is_grounding_source=True, + ), + name="annotation_unique_span_label_grounding_key", + ), ] diff --git a/opencontractserver/tests/test_extraction_grounding.py b/opencontractserver/tests/test_extraction_grounding.py index 9438bc0be..a2d389917 100644 --- a/opencontractserver/tests/test_extraction_grounding.py +++ b/opencontractserver/tests/test_extraction_grounding.py @@ -286,8 +286,7 @@ def test_ground_text_document(self): self.assertEqual(annot.document, self.document) self.assertEqual(annot.corpus, self.corpus) self.assertFalse(annot.structural) - self.assertIsNotNone(annot.annotation_label) - assert annot.annotation_label is not None # narrow for mypy + assert annot.annotation_label is not None # narrows for mypy self.assertEqual(annot.annotation_label.text, OC_EXTRACT_SOURCE_LABEL) # Verify span data @@ -511,8 +510,7 @@ def test_ground_pdf_creates_token_label_annotations(self): self.assertEqual(annot.document, self.document) self.assertEqual(annot.corpus, self.corpus) self.assertFalse(annot.structural) - self.assertIsNotNone(annot.annotation_label) - assert annot.annotation_label is not None # narrow for mypy + assert annot.annotation_label is not None # narrows for mypy self.assertEqual(annot.annotation_label.text, OC_EXTRACT_SOURCE_LABEL) # PlasmaPDF returns 0-indexed pages; valid range is # [0, len(pages) - 1]. The bug we're guarding against is the @@ -690,3 +688,73 @@ def test_ground_pdf_separate_corpora_create_separate_annotations(self): self.assertEqual(annot.corpus_id, self.corpus.id) for annot in second_corpus_annotations: self.assertEqual(annot.corpus_id, other_corpus.id) + + def test_db_constraint_blocks_concurrent_token_label_grounding_duplicates(self): + """The DB-level UniqueConstraint must reject a literal duplicate + of a grounding TOKEN_LABEL row, and ``get_or_create`` must + recover via SELECT — not propagate the IntegrityError. + + Simulates the celery race the constraint exists to defeat: one + worker has already inserted the grounding row, a second worker's + SELECT misses, and its INSERT must fail-and-fallback rather than + creating a sibling row. + """ + from django.db import IntegrityError, transaction + + from opencontractserver.annotations.models import ( + TOKEN_LABEL, + Annotation, + ) + from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL + from opencontractserver.utils.extraction_grounding import ( + ground_extraction_to_annotations, + ) + + # Run grounding once to seed an OC_EXTRACT_SOURCE row with the + # right (label, page, raw_text, creator, is_grounding_source=True) + # tuple — exactly what the constraint guards. + seeded = async_to_sync(ground_extraction_to_annotations)( + datacell=self.datacell, + document=self.document, + corpus=self.corpus, + user_id=self.user.id, + enable_fuzzy=False, + ) + self.assertGreater(len(seeded), 0) + seed = seeded[0] + assert seed.annotation_label is not None # narrows for mypy + self.assertEqual(seed.annotation_label.text, OC_EXTRACT_SOURCE_LABEL) + self.assertEqual(seed.annotation_type, TOKEN_LABEL) + self.assertTrue(seed.is_grounding_source) + + # A direct duplicate INSERT must fail at the database level. + with self.assertRaises(IntegrityError): + with transaction.atomic(): + Annotation.objects.create( + document=seed.document, + corpus=seed.corpus, + annotation_label=seed.annotation_label, + annotation_type=TOKEN_LABEL, + page=seed.page, + raw_text=seed.raw_text, + creator_id=self.user.id, + is_grounding_source=True, + structural=False, + json={}, + ) + + # And re-running the grounding pipeline must NOT create a + # duplicate — get_or_create's fallback SELECT (after IntegrityError + # on a racing INSERT in production, or after the up-front SELECT + # hit here) returns the seed row. + count_before = Annotation.objects.filter(document=self.document).count() + retry = async_to_sync(ground_extraction_to_annotations)( + datacell=self.datacell, + document=self.document, + corpus=self.corpus, + user_id=self.user.id, + enable_fuzzy=False, + ) + count_after = Annotation.objects.filter(document=self.document).count() + self.assertEqual(count_after, count_before) + self.assertIn(seed.id, {a.id for a in retry}) diff --git a/opencontractserver/utils/extraction_grounding.py b/opencontractserver/utils/extraction_grounding.py index b1ff2a553..7b6e3bb98 100644 --- a/opencontractserver/utils/extraction_grounding.py +++ b/opencontractserver/utils/extraction_grounding.py @@ -251,12 +251,21 @@ def _create_grounding_annotations( exc_info=True, ) - # Deduplicate by primary key, preserving first-seen order: two - # alignment results for the same phrase on the same page produce a - # single ``get_or_create`` row, but each iteration appends it. - # Returning duplicates makes ``len(annotations)`` diverge from - # ``datacell.sources.count()`` and breaks idempotency invariants - # for downstream consumers. + # Deduplicate by primary key, preserving first-seen order. + # + # Why this is needed (not a bug masked): ``align_text_to_document`` + # returns at most one ``AlignmentResult`` per query string, but + # multiple distinct extraction values can legitimately resolve to the + # same source span — e.g. a column extracted as both + # ``["Acme Holdings", "Acme Holdings Inc."]`` where each substring + # gets its own alignment but both anchor on the same token range, or + # an idempotent re-run where ``get_or_create`` returns the existing + # row twice for the same phrase. In both cases the underlying + # source annotation is the same row, and returning duplicates would + # make ``len(annotations)`` diverge from ``datacell.sources.count()`` + # (since ``M2M.add()`` is itself idempotent) and break the idempotency + # invariants that downstream consumers — including the idempotency + # tests in ``test_extraction_grounding.py`` — rely on. seen_pks: set[int] = set() deduped: list[Annotation] = [] for annot in annotations: @@ -325,6 +334,13 @@ def _create_pdf_annotation( # must point to an annotation whose ``corpus`` matches the extract's # corpus, otherwise ``MIN(document_permission, corpus_permission)`` # falls back to the wrong corpus's permissions. + # + # ``creator`` is in the lookup AND in the backing partial UniqueConstraint + # ``annotation_unique_token_label_grounding_key`` (see Annotation.Meta). + # The constraint promotes idempotency from a best-effort + # ``get_or_create`` to a correctness invariant: if two Celery workers + # race on the same datacell, the loser's CREATE raises IntegrityError + # and ``get_or_create`` falls back to a SELECT. annot, _ = Annotation.objects.get_or_create( document=document, corpus=corpus, @@ -332,9 +348,10 @@ def _create_pdf_annotation( page=page, annotation_type=TOKEN_LABEL, raw_text=oc_ann["rawText"], + creator_id=creator_id, + is_grounding_source=True, defaults={ "json": oc_ann["annotation_json"], - "creator_id": creator_id, "structural": False, }, ) @@ -360,15 +377,27 @@ def _create_span_annotation( serves as a placeholder and the actual location is encoded by the character offsets in ``json``. - Identity key uses ``json={"start": ..., "end": ...}``. The ``json`` - column is a Django ``JSONField``, which maps to PostgreSQL ``jsonb`` - — equality compares structurally, so key order does not affect - ``get_or_create``'s lookup. We still construct the dict in a stable - order to avoid surprises if the column type ever changes. + Identity key uses ``json={"start": ..., "end": ...}``. + + .. important:: + The ``json`` lookup is **PostgreSQL-specific**. Django maps + ``JSONField`` to PostgreSQL ``jsonb`` (structural equality — + ``{"start": 1, "end": 2}`` matches ``{"end": 2, "start": 1}``) + but to plain ``TEXT`` on SQLite, where comparison is lexical. + A future contributor running this pipeline on SQLite would see + silent ``get_or_create`` misses on dict-key reordering. The + project's runtime and test database are both PostgreSQL, so + this is documented rather than guarded. We still construct + the dict in a stable insertion order to keep behaviour + predictable across backends. ``corpus`` IS in the lookup so a multi-corpus document doesn't share a single annotation between unrelated corpora — see the parallel docstring on ``_create_pdf_annotation`` for the permission rationale. + + ``creator`` is in the lookup AND in the backing partial + UniqueConstraint ``annotation_unique_span_label_grounding_key`` so + racing Celery retries collapse to a single row. """ from opencontractserver.annotations.models import SPAN_LABEL, Annotation @@ -379,9 +408,10 @@ def _create_span_annotation( annotation_type=SPAN_LABEL, raw_text=result.matched_text, json={"start": result.char_start, "end": result.char_end}, + creator_id=creator_id, + is_grounding_source=True, defaults={ "page": 1, - "creator_id": creator_id, "structural": False, }, )