Back grounding get_or_create with DB-level uniqueness by JSv4 · Pull Request #1418 · Open-Source-Legal/OpenContracts

JSv4 · 2026-04-30T14:24:11Z

Closes #1407. Follow-up to #1397.

Summary

Addresses the four review concerns raised in #1407 against the extraction-grounding hardening in #1397.

1. Race condition: DB-level uniqueness backing `get_or_create` (medium)

opencontractserver/utils/extraction_grounding.py already used Annotation.objects.get_or_create() to make Celery retries idempotent, but without a backing UniqueConstraint two workers racing on the same datacell could both miss on the SELECT and both succeed on the CREATE.

Added two partial UniqueConstraints on Annotation whose fields exactly match the application-level lookup keys:

Annotation type	Constraint name	Fields
`TOKEN_LABEL` (PDF)	`annotation_unique_token_label_grounding_key`	`(document, corpus, annotation_label, page, raw_text, creator)`
`SPAN_LABEL` (text/DOCX)	`annotation_unique_span_label_grounding_key`	`(document, corpus, annotation_label, raw_text, json, creator)`

Both scoped via condition=Q(structural=False). creator is in the key so two distinct users manually creating identical annotations are not blocked — only the realistic Celery-race target (the grounding pipeline always uses the datacell owner's ID) is.

To make the constraints actually back the lookup, creator_id is moved out of defaults={...} and into the lookup keys for both _create_pdf_annotation and _create_span_annotation.

opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py first deduplicates any pre-existing rows (keeping the lowest-pk row in each group and re-pointing datacell.sources M2M FKs to it) before adding the constraints — same pattern used in 0068_enforce_embedder_path_not_null.py.

2. PostgreSQL-specific `json` lookup (low)

The _create_span_annotation docstring now explicitly notes that the json={"start", "end"} lookup relies on JSONField mapping to PostgreSQL jsonb (structural equality, key-order independent) and that on SQLite the comparison would be lexical. Both runtime and test databases are PostgreSQL, so this is documented rather than guarded.

3. End-of-function dedup rationale (minor)

The comment now spells out why dedup is needed rather than treating it as a symptom: distinct extraction values can legitimately resolve to the same source span (e.g. ["Acme Holdings", "Acme Holdings Inc."] both anchoring on the same token range, or an idempotent re-run returning the existing row twice). The dedup is intentional, not masking a bug in align_text_to_document.

4. Test style nit

Replaced the redundant assertIsNotNone(...) + assert ... is not None pair with a single plain assert — only the latter narrows for mypy / pyright. Inline comment explains the choice.

Test plan

docker compose -f test.yml run django pytest opencontractserver/tests/test_extraction_grounding.py -n 4 --dist loadscope --create-db (existing idempotency tests now also exercise the DB-level constraint path)
docker compose -f test.yml run django python manage.py migrate annotations 0069_grounding_annotation_unique_constraints against a database seeded with deliberate duplicates to exercise the dedup RunPython step
CI green on the full backend test suite (verifies the new constraints don't regress unrelated annotation flows; creator in the key keeps the blast radius narrow)

Generated by Claude Code

Closes #1407. Follow-up to #1397. * Adds two partial UniqueConstraints on Annotation that exactly match the application-level get_or_create lookup keys in the extraction grounding pipeline so concurrent Celery retries can't bypass idempotency. Migration 0069 deduplicates any existing rows before adding the constraints. * Moves creator_id into the get_or_create lookup so the new constraints back the lookup; without that the race-loser's IntegrityError wouldn't fire. * Documents the PostgreSQL-specific json equality semantics on the span path and the rationale for the end-of-function dedup loop. * Replaces the redundant assertIsNotNone + assert pair with a single assert in test_extraction_grounding.py (only the latter narrows for mypy / pyright).

claude · 2026-04-30T14:26:22Z

Code Review — PR #1418: Back grounding `get_or_create` with DB-level uniqueness

Overall this is a solid correctness fix. The core change is small and well-targeted; the migration follows the established pattern from 0068; and the PR description + docstrings thoroughly explain the reasoning. A few items worth discussing:

✅ What's done well

Core fix is correct. Moving creator_id from defaults into the lookup key is the necessary step to make the constraint actually back get_or_create. Without it, the partial UniqueConstraint would fire on the loser of the race, but Django's retry SELECT would use a different predicate and miss, re-raising the IntegrityError instead of returning the existing row.
Migration safety. atomic = False is appropriate here — dedup + constraint-add can't share an outer transaction on a large table, and partial failure leaves data in a cleaner state (deduped without the constraint) that allows a safe re-run. The per-group transaction.atomic() inside the dedup loop is correct.
Blast radius is narrow. The condition=Q(structural=False, ...) partial constraint means structural / layout annotations (which can legitimately repeat across parser runs) are completely unaffected.
Dedup handles M2M re-pointing before deleting duplicates — same pattern as 0068, and the only real FK concern for annotations created exclusively by the grounding pipeline.

Issues

Medium — test coverage gap for the `IntegrityError` retry path

The existing idempotency tests call the grounding functions sequentially (same process, same transaction). They verify that a second call returns the same annotation PK, but they don't exercise the DB-level constraint path: two concurrent INSERTs where the loser gets an IntegrityError and Django's get_or_create retries with a SELECT.

The constraint is the new correctness invariant, but there's no test that fires it. A minimal test could insert the annotation directly, then call _create_pdf_annotation / _create_span_annotation again and assert that created=False is returned and the annotation count hasn't increased — which would prove that the constraint + retry round-trip works end-to-end.

def test_db_constraint_backs_get_or_create_token_label(self):
    # Pre-insert the row that the pipeline would create
    existing = Annotation.objects.create(
        document=self.document,
        corpus=self.corpus,
        annotation_label=self.label,
        page=0,
        annotation_type=TOKEN_LABEL,
        raw_text="some text",
        creator=self.user,
        json={...},
        structural=False,
    )
    count_before = Annotation.objects.count()
    annot, created = _create_pdf_annotation(...)
    self.assertFalse(created)
    self.assertEqual(annot.pk, existing.pk)
    self.assertEqual(Annotation.objects.count(), count_before)

Minor — dedup migration doesn't account for other Annotation FK references

The dedup loop re-points Datacell.sources (the M2M through table) before deleting duplicate annotations, which is correct for the grounding pipeline's usage. However, if any other part of the codebase holds FKs to Annotation (e.g. via annotationset, relationship models, or comment threads), those references to the deleted duplicates would either cascade-delete or raise a constraint error depending on on_delete settings.

This is low risk because the duplicates being targeted were created exclusively by the grounding pipeline (and thus only referenced by Datacell.sources), but it would be worth a quick grep to confirm no other FK column points at Annotation without being handled:

grep -r "ForeignKey.*Annotation\|OneToOneField.*Annotation" opencontractserver/ --include="*.py" | grep -v migrations

If there are other FK references, they should either be added to the dedup loop or confirmed to have on_delete=CASCADE (so they'd self-clean on delete).

Minor — test comment verbosity exceeds project conventions

CLAUDE.md says to default to no comments and only add one when the WHY is non-obvious. The 5-line block comment explaining assert vs assertIsNotNone is more explanation than the change warrants — the original paired assertion was already being replaced, so the comment is defending against a pattern that no longer exists in the file.

A single trailing comment (# plain assert narrows for mypy) or no comment at all would fit the project convention better. The existing code already has many examples of assert x is not None without explanation.

Nits

The migration module-level docstring is long but appropriate given the migration's complexity — this is an exception to the "no multi-line comments" rule because migration context genuinely needs to survive outside of git blame.
In models.py, the constraints use django.db.models.UniqueConstraint with the fully-qualified prefix rather than a bare UniqueConstraint — minor inconsistency with the style used elsewhere in the same Meta class, but not worth a separate PR.

Summary

The main correctness fix (moving creator_id into the lookup key) and the migration (dedup + partial UniqueConstraint) are both correct. The missing piece is a direct test that fires the IntegrityError path and confirms the Django retry round-trip works — without it, a regression in how the constraint interacts with get_or_create wouldn't be caught by the existing test suite. I'd recommend adding that test before merging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Back grounding get_or_create with DB-level uniqueness#1418

Back grounding get_or_create with DB-level uniqueness#1418
JSv4 wants to merge 1 commit intomainfrom
claude/resolve-issue-1407-EbKyf

JSv4 commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JSv4 commented Apr 30, 2026

Summary

1. Race condition: DB-level uniqueness backing get_or_create (medium)

2. PostgreSQL-specific json lookup (low)

3. End-of-function dedup rationale (minor)

4. Test style nit

Test plan

Uh oh!

claude Bot commented Apr 30, 2026

Code Review — PR #1418: Back grounding get_or_create with DB-level uniqueness

✅ What's done well

Issues

Medium — test coverage gap for the IntegrityError retry path

Minor — dedup migration doesn't account for other Annotation FK references

Minor — test comment verbosity exceeds project conventions

Nits

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Race condition: DB-level uniqueness backing `get_or_create` (medium)

2. PostgreSQL-specific `json` lookup (low)

Code Review — PR #1418: Back grounding `get_or_create` with DB-level uniqueness

Medium — test coverage gap for the `IntegrityError` retry path