Closes #1407. Follow-up to #1397.
* Adds two partial UniqueConstraints on Annotation that exactly match
the application-level get_or_create lookup keys in the extraction
grounding pipeline so concurrent Celery retries can't bypass
idempotency. Migration 0069 deduplicates any existing rows before
adding the constraints.
* Moves creator_id into the get_or_create lookup so the new constraints
back the lookup; without that the race-loser's IntegrityError
wouldn't fire.
* Documents the PostgreSQL-specific json equality semantics on the
span path and the rationale for the end-of-function dedup loop.
* Replaces the redundant assertIsNotNone + assert pair with a single
assert in test_extraction_grounding.py (only the latter narrows for
mypy / pyright).
Closes #1407. Follow-up to #1397.
Summary
Addresses the four review concerns raised in #1407 against the extraction-grounding hardening in #1397.
1. Race condition: DB-level uniqueness backing
get_or_create(medium)opencontractserver/utils/extraction_grounding.pyalready usedAnnotation.objects.get_or_create()to make Celery retries idempotent, but without a backingUniqueConstrainttwo workers racing on the same datacell could both miss on the SELECT and both succeed on the CREATE.Added two partial
UniqueConstraints onAnnotationwhose fields exactly match the application-level lookup keys:TOKEN_LABEL(PDF)annotation_unique_token_label_grounding_key(document, corpus, annotation_label, page, raw_text, creator)SPAN_LABEL(text/DOCX)annotation_unique_span_label_grounding_key(document, corpus, annotation_label, raw_text, json, creator)Both scoped via
condition=Q(structural=False).creatoris in the key so two distinct users manually creating identical annotations are not blocked — only the realistic Celery-race target (the grounding pipeline always uses the datacell owner's ID) is.To make the constraints actually back the lookup,
creator_idis moved out ofdefaults={...}and into the lookup keys for both_create_pdf_annotationand_create_span_annotation.opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.pyfirst deduplicates any pre-existing rows (keeping the lowest-pk row in each group and re-pointingdatacell.sourcesM2M FKs to it) before adding the constraints — same pattern used in0068_enforce_embedder_path_not_null.py.2. PostgreSQL-specific
jsonlookup (low)The
_create_span_annotationdocstring now explicitly notes that thejson={"start", "end"}lookup relies onJSONFieldmapping to PostgreSQLjsonb(structural equality, key-order independent) and that on SQLite the comparison would be lexical. Both runtime and test databases are PostgreSQL, so this is documented rather than guarded.3. End-of-function dedup rationale (minor)
The comment now spells out why dedup is needed rather than treating it as a symptom: distinct extraction values can legitimately resolve to the same source span (e.g.
["Acme Holdings", "Acme Holdings Inc."]both anchoring on the same token range, or an idempotent re-run returning the existing row twice). The dedup is intentional, not masking a bug inalign_text_to_document.4. Test style nit
Replaced the redundant
assertIsNotNone(...) + assert ... is not Nonepair with a single plainassert— only the latter narrows for mypy / pyright. Inline comment explains the choice.Test plan
docker compose -f test.yml run django pytest opencontractserver/tests/test_extraction_grounding.py -n 4 --dist loadscope --create-db(existing idempotency tests now also exercise the DB-level constraint path)docker compose -f test.yml run django python manage.py migrate annotations 0069_grounding_annotation_unique_constraintsagainst a database seeded with deliberate duplicates to exercise the dedupRunPythonstepcreatorin the key keeps the blast radius narrow)Generated by Claude Code