Fix extraction grounding bugs: page fallback, idempotency, label-set scoping (#1246)

claude · claude · commit 8612d97688d3 · 2026-04-28T02:30:35.000Z
Resolves the follow-up review issues raised on the extraction-grounding pipeline added in #1245: - _create_pdf_annotation now skips (raises inside the savepoint) when PlasmaPDF cannot determine a page, instead of silently saving on page 1 with a wrong bounding-box context. - ensure_label_and_labelset is invoked inside the per-annotation try/transaction.atomic() so a label-set failure no longer aborts the whole batch. - Annotation creation is now idempotent via get_or_create keyed on (document, label, type, raw_text, …); Celery retries no longer duplicate OC_EXTRACT_SOURCE annotations. - DOCX MIME literal moved to opencontractserver.constants.document_processing. - Public/helper functions in extraction_grounding.py picked up Document / Corpus / Datacell / Annotation / AnnotationLabel type annotations. - Span (text/DOCX) page=1 placeholder now documented in the docstring; data_extract_tasks.py picks up an inline comment explaining why corpus_id (int) is passed to ground_extraction_to_annotations. - Tests: new TestGroundingPipelinePDFIntegration covers the TOKEN_LABEL/PDF path with a synthetic two-page PAWLS payload, the page=None skip behavior, and idempotency on retry. SPAN_LABEL idempotency is covered by test_ground_text_document_is_idempotent. - CHANGELOG updated. Closes #1246
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
+  - **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
+  - **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
+  - **Bug — duplicate `OC_EXTRACT_SOURCE` annotations on Celery retry** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation` & `_create_span_annotation`): nothing prevented the grounding pipeline from creating fresh annotations and re-linking them via `datacell.sources.add(*annotations)` if `ground_extraction_to_annotations` ran twice on the same datacell (Celery retry after partial failure was the realistic trigger). Replaced the construct-then-`save()` flow with `Annotation.objects.get_or_create()` keyed on `(document, annotation_label, annotation_type, raw_text, …)` so retries reuse existing rows. `datacell.sources` is a `ManyToManyField`, so re-linking the same row is already a no-op once the row is shared.
+  - **Constant — extracted `DOCX_MIME_TYPE`** (`opencontractserver/constants/document_processing.py`): the long `application/vnd.openxmlformats-officedocument.wordprocessingml.document` literal previously lived inline in `_load_document_text_and_layer`. Per the project's no-magic-strings rule it now sits next to `MARKDOWN_MIME_TYPE` and is imported from one place.
+  - **Type annotations** (`opencontractserver/utils/extraction_grounding.py`): `Document`, `Corpus`, `Datacell`, `Annotation`, and `AnnotationLabel` parameters and return types added via a `TYPE_CHECKING` block on every public and helper function. No runtime change.
+  - **Documentation** — the `page=1` placeholder for SPAN_LABEL annotations (text/DOCX) is now documented in the function docstring, explaining that the `txt_extract_file` pipeline does not preserve a page-break map and the actual location lives in the character offsets in `json`.
+  - **Tests** — `opencontractserver/tests/test_extraction_grounding.py`:
+    - `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
+    - `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
 - **Merged `frontend` Codecov flag drops to ~33% on every commit where Frontend CI's CT job fails** (`frontend/package.json` `test:coverage:ct`): the script chained `playwright test ... && mkdir -p ... && nyc report ...`, so a failing CT run short-circuited before `nyc report` could turn the per-test JSON files in `.nyc_output` into an `lcov.info`. The downstream `Upload CT Coverage to Codecov` step (`if: success() || failure()`) then errored with "No coverage reports found" and `frontend-component` did not upload for that SHA. Codecov's server-side aggregation of the `frontend` flag was left with only `frontend-unit` (~23%) and `frontend-e2e` (~24%), pulling the merged number down to ~33% even though the previous commit was at ~67% — observed on six consecutive main commits 2026-04-26T01:02..02:58Z (`2d7033f8`..`be5bcfc8`) before recovering on `30298391`. Mirrored the existing `test:e2e:coverage` pattern (`; CT_EXIT=$?; nyc report ... || echo "No coverage data to report"; exit $CT_EXIT`) so `nyc report` runs regardless of test outcome and the lcov ships even on red CT runs. `frontend-component` will still report a slightly lower number when tests fail (failed tests register fewer hits), but it will report — keeping the merged `frontend` flag's denominator stable.
 - **`User.__init__` shared-state mutation re-introduced by branch merge** (`opencontractserver/users/models.py:172-180` removed): PR #1374 (commit `50ed6740`) deleted the `User.__init__` override that mutated `Field.validators[0]` on every instantiation, but a subsequent merge (`b68c1cb4 → 6d2cddbf`) resurrected the override along with its mypy-narrowing changes. The current main on commit `6d2cddbf` therefore reproduced the original `#1358` bug: `User(...)` rebound `username_field.validators[0]` and clobbered any third-party validator prepended to the list. Removed the `__init__` override entirely; the class-body declaration `validators=[UserUnicodeUsernameValidator()]` on the `username` field (still present from PR #1374) is the canonical and only declaration. Also dropped the now-unused `Field` import. Regression coverage from PR #1374 (`opencontractserver/tests/test_user_username_validator.py`) was already on main and is what surfaced the regression in CI.
 
diff --git a/opencontractserver/constants/document_processing.py b/opencontractserver/constants/document_processing.py
@@ -10,6 +10,11 @@
 # ingestion and in the parser pipeline for type detection.
 MARKDOWN_MIME_TYPE = "text/markdown"
 
+# MIME type for Microsoft Word (DOCX) documents.
+DOCX_MIME_TYPE = (
+    "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+)
+
 # File types that are stored as txt_extract_file (plain text, no parsing needed).
 # Shared between versioning.py and corpus models.py — single source of truth.
 TEXT_MIMETYPES = {"text/plain", MARKDOWN_MIME_TYPE, "application/txt"}
diff --git a/opencontractserver/tasks/data_extract_tasks.py b/opencontractserver/tasks/data_extract_tasks.py
@@ -320,6 +320,11 @@ def sync_add_sources(datacell, sources):
             # Auto-ground: find extracted text values in the source document
             # and create linked source annotations with PDF/text coordinates.
             try:
+                # Pass ``corpus_id`` (int) rather than the Corpus instance:
+                # only the ID is in scope here (the extract task receives
+                # ``corpus_id`` as input) and grounding's ``_resolve_corpus``
+                # handles the lookup safely if the corpus was deleted between
+                # task scheduling and execution.
                 grounding_annotations = await ground_extraction_to_annotations(
                     datacell=datacell,
                     document=document,
diff --git a/opencontractserver/tests/test_extraction_grounding.py b/opencontractserver/tests/test_extraction_grounding.py
@@ -5,12 +5,52 @@
 (integration with Django models).
 """
 
+import json
+
 from asgiref.sync import async_to_sync
 from django.test import SimpleTestCase, TestCase
 
 from opencontractserver.utils.extraction_grounding import extract_groundable_strings
 
 
+def _build_pawls_for_text(
+    pages_text: list[str], page_width: float = 612.0, page_height: float = 792.0
+) -> str:
+    """Build a v1 PAWLS JSON payload that embeds ``pages_text`` as tokens.
+
+    Each page's text is split on whitespace and laid out as a single row
+    of tokens with simple monotonically increasing x-coordinates.  The
+    resulting JSON is suitable for ``build_translation_layer`` and lets
+    integration tests exercise the PDF grounding path without a real PDF.
+    """
+    pages: list[dict] = []
+    for page_index, text in enumerate(pages_text):
+        tokens: list[dict] = []
+        x_pos = 10.0
+        for word in text.split():
+            tokens.append(
+                {
+                    "x": x_pos,
+                    "y": 100.0,
+                    "width": float(len(word)) * 6.0,
+                    "height": 12.0,
+                    "text": word,
+                }
+            )
+            x_pos += float(len(word)) * 6.0 + 4.0
+        pages.append(
+            {
+                "page": {
+                    "width": page_width,
+                    "height": page_height,
+                    "index": page_index,
+                },
+                "tokens": tokens,
+            }
+        )
+    return json.dumps(pages)
+
+
 class TestExtractGroundableStrings(SimpleTestCase):
     """Unit tests for extract_groundable_strings() — no Django DB needed."""
 
@@ -327,3 +367,251 @@ def test_ground_no_matches_returns_empty(self):
         )
 
         self.assertEqual(len(annotations), 0)
+
+    def test_ground_text_document_is_idempotent(self):
+        """Running grounding twice should not create duplicate annotations.
+
+        Simulates a Celery retry after a partial failure.  The second call
+        must reuse existing OC_EXTRACT_SOURCE annotations rather than
+        bloating ``datacell.sources`` with duplicates.
+        """
+        from opencontractserver.annotations.models import Annotation
+        from opencontractserver.utils.extraction_grounding import (
+            ground_extraction_to_annotations,
+        )
+
+        first = async_to_sync(ground_extraction_to_annotations)(
+            datacell=self.datacell,
+            document=self.document,
+            corpus=self.corpus,
+            user_id=self.user.id,
+            enable_fuzzy=False,
+        )
+        self.assertGreater(len(first), 0)
+        first_count = Annotation.objects.filter(document=self.document).count()
+        first_ids = sorted(a.id for a in first)
+
+        second = async_to_sync(ground_extraction_to_annotations)(
+            datacell=self.datacell,
+            document=self.document,
+            corpus=self.corpus,
+            user_id=self.user.id,
+            enable_fuzzy=False,
+        )
+        second_count = Annotation.objects.filter(document=self.document).count()
+        second_ids = sorted(a.id for a in second)
+
+        self.assertEqual(
+            first_count,
+            second_count,
+            "Re-running grounding created duplicate annotations.",
+        )
+        self.assertEqual(
+            first_ids,
+            second_ids,
+            "Re-running grounding returned annotations with different IDs.",
+        )
+
+        self.datacell.refresh_from_db()
+        self.assertEqual(self.datacell.sources.count(), first_count)
+
+
+class TestGroundingPipelinePDFIntegration(TestCase):
+    """Integration tests for grounding against a PDF-shaped document.
+
+    Builds a synthetic multi-page PAWLS payload (no real PDF needed) and
+    exercises the TOKEN_LABEL path through PlasmaPDF's translation layer.
+    """
+
+    def setUp(self):
+        from django.contrib.auth import get_user_model
+        from django.core.files.base import ContentFile
+
+        from opencontractserver.corpuses.models import Corpus
+        from opencontractserver.documents.models import Document
+        from opencontractserver.extracts.models import (
+            Column,
+            Datacell,
+            Extract,
+            Fieldset,
+        )
+        from opencontractserver.notifications.models import Notification
+
+        User = get_user_model()
+        self.user = User.objects.create_user(
+            username="grounding_pdf_user", password="testpass"
+        )
+        Notification.objects.filter(recipient=self.user).delete()
+
+        self.corpus = Corpus.objects.create(
+            title="PDF Grounding Corpus", creator=self.user
+        )
+
+        # Two-page synthetic document; "Acme Holdings" is on page 0,
+        # "Global Acquisitions" on page 1.
+        self.pages_text = [
+            "ASSET PURCHASE AGREEMENT between Acme Holdings Inc and others",
+            "Global Acquisitions LLC shall serve as the Buyer of record",
+        ]
+        pawls_json = _build_pawls_for_text(self.pages_text)
+
+        self.document = Document.objects.create(
+            title="PDF Grounding Test",
+            creator=self.user,
+            file_type="application/pdf",
+        )
+        self.document.pawls_parse_file.save(
+            "test.pawls", ContentFile(pawls_json.encode())
+        )
+        self.corpus.add_document(document=self.document, user=self.user)
+
+        self.fieldset = Fieldset.objects.create(name="PDF Fieldset", creator=self.user)
+        self.column = Column.objects.create(
+            fieldset=self.fieldset,
+            name="Parties",
+            query="Extract parties",
+            output_type="str",
+            creator=self.user,
+        )
+        self.extract = Extract.objects.create(
+            name="PDF Extract",
+            corpus=self.corpus,
+            fieldset=self.fieldset,
+            creator=self.user,
+        )
+        self.datacell = Datacell.objects.create(
+            extract=self.extract,
+            column=self.column,
+            document=self.document,
+            creator=self.user,
+            data={"data": ["Acme Holdings", "Global Acquisitions"]},
+        )
+
+    def test_ground_pdf_creates_token_label_annotations(self):
+        """PDF grounding should create TOKEN_LABEL annotations with valid pages."""
+        from opencontractserver.annotations.models import TOKEN_LABEL
+        from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL
+        from opencontractserver.utils.extraction_grounding import (
+            ground_extraction_to_annotations,
+        )
+
+        annotations = async_to_sync(ground_extraction_to_annotations)(
+            datacell=self.datacell,
+            document=self.document,
+            corpus=self.corpus,
+            user_id=self.user.id,
+            enable_fuzzy=False,
+        )
+
+        self.assertGreater(len(annotations), 0)
+        for annot in annotations:
+            self.assertEqual(annot.annotation_type, TOKEN_LABEL)
+            self.assertEqual(annot.document, self.document)
+            self.assertEqual(annot.corpus, self.corpus)
+            self.assertFalse(annot.structural)
+            self.assertEqual(annot.annotation_label.text, OC_EXTRACT_SOURCE_LABEL)
+            # Page must be a positive integer; never the silent default of 1
+            # for a span that actually lives on page 2.
+            self.assertIsInstance(annot.page, int)
+            self.assertGreaterEqual(annot.page, 1)
+            self.assertLessEqual(annot.page, len(self.pages_text))
+            self.assertTrue(annot.raw_text)
+
+        # "Acme Holdings" is on page 1 (1-indexed) and
+        # "Global Acquisitions" on page 2 — confirm the per-page mapping
+        # actually works by checking we got at least one annotation off
+        # page 1.
+        pages_seen = {a.page for a in annotations}
+        self.assertGreater(
+            len(pages_seen),
+            1,
+            "Expected grounding to span multiple PDF pages.",
+        )
+
+        self.datacell.refresh_from_db()
+        self.assertEqual(self.datacell.sources.count(), len(annotations))
+
+    def test_ground_pdf_is_idempotent(self):
+        """Re-running PDF grounding must not duplicate TOKEN_LABEL annotations."""
+        from opencontractserver.annotations.models import Annotation
+        from opencontractserver.utils.extraction_grounding import (
+            ground_extraction_to_annotations,
+        )
+
+        first = async_to_sync(ground_extraction_to_annotations)(
+            datacell=self.datacell,
+            document=self.document,
+            corpus=self.corpus,
+            user_id=self.user.id,
+            enable_fuzzy=False,
+        )
+        self.assertGreater(len(first), 0)
+        first_count = Annotation.objects.filter(document=self.document).count()
+        first_ids = sorted(a.id for a in first)
+
+        second = async_to_sync(ground_extraction_to_annotations)(
+            datacell=self.datacell,
+            document=self.document,
+            corpus=self.corpus,
+            user_id=self.user.id,
+            enable_fuzzy=False,
+        )
+        second_count = Annotation.objects.filter(document=self.document).count()
+        second_ids = sorted(a.id for a in second)
+
+        self.assertEqual(first_count, second_count)
+        self.assertEqual(first_ids, second_ids)
+
+        self.datacell.refresh_from_db()
+        self.assertEqual(self.datacell.sources.count(), first_count)
+
+    def test_ground_pdf_skips_when_page_is_none(self):
+        """If PlasmaPDF returns page=None, the annotation must be skipped.
+
+        Regression for the silent ``page=1`` fallback bug: a missing page
+        on a multi-page PDF should result in *no* annotation being saved
+        rather than a structurally incorrect one anchored to page 1.
+        """
+        from unittest.mock import patch
+
+        from opencontractserver.annotations.models import Annotation
+        from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL
+        from opencontractserver.utils.extraction_grounding import (
+            ground_extraction_to_annotations,
+        )
+
+        def stub_create(self, span_annotation):
+            # Mimic PlasmaPDF's payload but force page=None so the grounding
+            # pipeline must take the skip-rather-than-fallback path.
+            return {
+                "page": None,
+                "rawText": span_annotation.span["text"],
+                "annotation_json": {},
+            }
+
+        with patch(
+            "plasmapdf.models.PdfDataLayer.PdfDataLayer."
+            "create_opencontract_annotation_from_span",
+            new=stub_create,
+        ):
+            annotations = async_to_sync(ground_extraction_to_annotations)(
+                datacell=self.datacell,
+                document=self.document,
+                corpus=self.corpus,
+                user_id=self.user.id,
+                enable_fuzzy=False,
+            )
+
+        self.assertEqual(
+            len(annotations),
+            0,
+            "Annotations with page=None must be skipped, not saved on page 1.",
+        )
+        # And nothing should have been persisted to the database either.
+        self.assertEqual(
+            Annotation.objects.filter(
+                document=self.document,
+                annotation_label__text=OC_EXTRACT_SOURCE_LABEL,
+            ).count(),
+            0,
+        )
diff --git a/opencontractserver/utils/extraction_grounding.py b/opencontractserver/utils/extraction_grounding.py