Skip to content

Back grounding get_or_create with DB-level uniqueness#1418

Open
JSv4 wants to merge 1 commit intomainfrom
claude/resolve-issue-1407-EbKyf
Open

Back grounding get_or_create with DB-level uniqueness#1418
JSv4 wants to merge 1 commit intomainfrom
claude/resolve-issue-1407-EbKyf

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Apr 30, 2026

Closes #1407. Follow-up to #1397.

Summary

Addresses the four review concerns raised in #1407 against the extraction-grounding hardening in #1397.

1. Race condition: DB-level uniqueness backing get_or_create (medium)

opencontractserver/utils/extraction_grounding.py already used Annotation.objects.get_or_create() to make Celery retries idempotent, but without a backing UniqueConstraint two workers racing on the same datacell could both miss on the SELECT and both succeed on the CREATE.

Added two partial UniqueConstraints on Annotation whose fields exactly match the application-level lookup keys:

Annotation type Constraint name Fields
TOKEN_LABEL (PDF) annotation_unique_token_label_grounding_key (document, corpus, annotation_label, page, raw_text, creator)
SPAN_LABEL (text/DOCX) annotation_unique_span_label_grounding_key (document, corpus, annotation_label, raw_text, json, creator)

Both scoped via condition=Q(structural=False). creator is in the key so two distinct users manually creating identical annotations are not blocked — only the realistic Celery-race target (the grounding pipeline always uses the datacell owner's ID) is.

To make the constraints actually back the lookup, creator_id is moved out of defaults={...} and into the lookup keys for both _create_pdf_annotation and _create_span_annotation.

opencontractserver/annotations/migrations/0069_grounding_annotation_unique_constraints.py first deduplicates any pre-existing rows (keeping the lowest-pk row in each group and re-pointing datacell.sources M2M FKs to it) before adding the constraints — same pattern used in 0068_enforce_embedder_path_not_null.py.

2. PostgreSQL-specific json lookup (low)

The _create_span_annotation docstring now explicitly notes that the json={"start", "end"} lookup relies on JSONField mapping to PostgreSQL jsonb (structural equality, key-order independent) and that on SQLite the comparison would be lexical. Both runtime and test databases are PostgreSQL, so this is documented rather than guarded.

3. End-of-function dedup rationale (minor)

The comment now spells out why dedup is needed rather than treating it as a symptom: distinct extraction values can legitimately resolve to the same source span (e.g. ["Acme Holdings", "Acme Holdings Inc."] both anchoring on the same token range, or an idempotent re-run returning the existing row twice). The dedup is intentional, not masking a bug in align_text_to_document.

4. Test style nit

Replaced the redundant assertIsNotNone(...) + assert ... is not None pair with a single plain assert — only the latter narrows for mypy / pyright. Inline comment explains the choice.

Test plan

  • docker compose -f test.yml run django pytest opencontractserver/tests/test_extraction_grounding.py -n 4 --dist loadscope --create-db (existing idempotency tests now also exercise the DB-level constraint path)
  • docker compose -f test.yml run django python manage.py migrate annotations 0069_grounding_annotation_unique_constraints against a database seeded with deliberate duplicates to exercise the dedup RunPython step
  • CI green on the full backend test suite (verifies the new constraints don't regress unrelated annotation flows; creator in the key keeps the blast radius narrow)

Generated by Claude Code

Closes #1407. Follow-up to #1397.

* Adds two partial UniqueConstraints on Annotation that exactly match
  the application-level get_or_create lookup keys in the extraction
  grounding pipeline so concurrent Celery retries can't bypass
  idempotency. Migration 0069 deduplicates any existing rows before
  adding the constraints.
* Moves creator_id into the get_or_create lookup so the new constraints
  back the lookup; without that the race-loser's IntegrityError
  wouldn't fire.
* Documents the PostgreSQL-specific json equality semantics on the
  span path and the rationale for the end-of-function dedup loop.
* Replaces the redundant assertIsNotNone + assert pair with a single
  assert in test_extraction_grounding.py (only the latter narrows for
  mypy / pyright).
@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review — PR #1418: Back grounding get_or_create with DB-level uniqueness

Overall this is a solid correctness fix. The core change is small and well-targeted; the migration follows the established pattern from 0068; and the PR description + docstrings thoroughly explain the reasoning. A few items worth discussing:


✅ What's done well

  • Core fix is correct. Moving creator_id from defaults into the lookup key is the necessary step to make the constraint actually back get_or_create. Without it, the partial UniqueConstraint would fire on the loser of the race, but Django's retry SELECT would use a different predicate and miss, re-raising the IntegrityError instead of returning the existing row.
  • Migration safety. atomic = False is appropriate here — dedup + constraint-add can't share an outer transaction on a large table, and partial failure leaves data in a cleaner state (deduped without the constraint) that allows a safe re-run. The per-group transaction.atomic() inside the dedup loop is correct.
  • Blast radius is narrow. The condition=Q(structural=False, ...) partial constraint means structural / layout annotations (which can legitimately repeat across parser runs) are completely unaffected.
  • Dedup handles M2M re-pointing before deleting duplicates — same pattern as 0068, and the only real FK concern for annotations created exclusively by the grounding pipeline.

Issues

Medium — test coverage gap for the IntegrityError retry path

The existing idempotency tests call the grounding functions sequentially (same process, same transaction). They verify that a second call returns the same annotation PK, but they don't exercise the DB-level constraint path: two concurrent INSERTs where the loser gets an IntegrityError and Django's get_or_create retries with a SELECT.

The constraint is the new correctness invariant, but there's no test that fires it. A minimal test could insert the annotation directly, then call _create_pdf_annotation / _create_span_annotation again and assert that created=False is returned and the annotation count hasn't increased — which would prove that the constraint + retry round-trip works end-to-end.

def test_db_constraint_backs_get_or_create_token_label(self):
    # Pre-insert the row that the pipeline would create
    existing = Annotation.objects.create(
        document=self.document,
        corpus=self.corpus,
        annotation_label=self.label,
        page=0,
        annotation_type=TOKEN_LABEL,
        raw_text="some text",
        creator=self.user,
        json={...},
        structural=False,
    )
    count_before = Annotation.objects.count()
    annot, created = _create_pdf_annotation(...)
    self.assertFalse(created)
    self.assertEqual(annot.pk, existing.pk)
    self.assertEqual(Annotation.objects.count(), count_before)

Minor — dedup migration doesn't account for other Annotation FK references

The dedup loop re-points Datacell.sources (the M2M through table) before deleting duplicate annotations, which is correct for the grounding pipeline's usage. However, if any other part of the codebase holds FKs to Annotation (e.g. via annotationset, relationship models, or comment threads), those references to the deleted duplicates would either cascade-delete or raise a constraint error depending on on_delete settings.

This is low risk because the duplicates being targeted were created exclusively by the grounding pipeline (and thus only referenced by Datacell.sources), but it would be worth a quick grep to confirm no other FK column points at Annotation without being handled:

grep -r "ForeignKey.*Annotation\|OneToOneField.*Annotation" opencontractserver/ --include="*.py" | grep -v migrations

If there are other FK references, they should either be added to the dedup loop or confirmed to have on_delete=CASCADE (so they'd self-clean on delete).

Minor — test comment verbosity exceeds project conventions

CLAUDE.md says to default to no comments and only add one when the WHY is non-obvious. The 5-line block comment explaining assert vs assertIsNotNone is more explanation than the change warrants — the original paired assertion was already being replaced, so the comment is defending against a pattern that no longer exists in the file.

A single trailing comment (# plain assert narrows for mypy) or no comment at all would fit the project convention better. The existing code already has many examples of assert x is not None without explanation.


Nits

  • The migration module-level docstring is long but appropriate given the migration's complexity — this is an exception to the "no multi-line comments" rule because migration context genuinely needs to survive outside of git blame.
  • In models.py, the constraints use django.db.models.UniqueConstraint with the fully-qualified prefix rather than a bare UniqueConstraint — minor inconsistency with the style used elsewhere in the same Meta class, but not worth a separate PR.

Summary

The main correctness fix (moving creator_id into the lookup key) and the migration (dedup + partial UniqueConstraint) are both correct. The missing piece is a direct test that fires the IntegrityError path and confirms the Django retry round-trip works — without it, a regression in how the constraint interacts with get_or_create wouldn't be caught by the existing test suite. I'd recommend adding that test before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve extraction grounding

2 participants