Open-Source-Legal
diff --git a/‎CHANGELOG.md‎
Lines changed: 11 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎opencontractserver/constants/document_processing.py‎
Lines changed: 5 additions & 0 deletions b/‎opencontractserver/constants/document_processing.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎opencontractserver/tasks/data_extract_tasks.py‎
Lines changed: 5 additions & 0 deletions b/‎opencontractserver/tasks/data_extract_tasks.py‎
Lines changed: 5 additions & 0 deletions
@@ -23,6 +23,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
+  - **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
+  - **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
+  - **Bug — duplicate `OC_EXTRACT_SOURCE` annotations on Celery retry** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation` & `_create_span_annotation`): nothing prevented the grounding pipeline from creating fresh annotations and re-linking them via `datacell.sources.add(*annotations)` if `ground_extraction_to_annotations` ran twice on the same datacell (Celery retry after partial failure was the realistic trigger). Replaced the construct-then-`save()` flow with `Annotation.objects.get_or_create()` keyed on `(document, annotation_label, annotation_type, raw_text, …)` so retries reuse existing rows. `datacell.sources` is a `ManyToManyField`, so re-linking the same row is already a no-op once the row is shared.
+  - **Constant — extracted `DOCX_MIME_TYPE`** (`opencontractserver/constants/document_processing.py`): the long `application/vnd.openxmlformats-officedocument.wordprocessingml.document` literal previously lived inline in `_load_document_text_and_layer`. Per the project's no-magic-strings rule it now sits next to `MARKDOWN_MIME_TYPE` and is imported from one place.
+  - **Type annotations** (`opencontractserver/utils/extraction_grounding.py`): `Document`, `Corpus`, `Datacell`, `Annotation`, and `AnnotationLabel` parameters and return types added via a `TYPE_CHECKING` block on every public and helper function. No runtime change.
+  - **Documentation** — the `page=1` placeholder for SPAN_LABEL annotations (text/DOCX) is now documented in the function docstring, explaining that the `txt_extract_file` pipeline does not preserve a page-break map and the actual location lives in the character offsets in `json`.
+  - **Tests** — `opencontractserver/tests/test_extraction_grounding.py`:
+    - `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
+    - `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
+
 - **`CreateCorpusActionModal` opened with the wrong default agent instructions for document triggers** (Issue #1385, `frontend/src/components/corpuses/CreateCorpusActionModal.tsx:136-144,168-171`): the `inlineAgentInstructions` state was initialised with `DEFAULT_MODERATOR_INSTRUCTIONS` even though the default trigger is `add_document` (a document trigger). The trigger-change handler at line 611 swaps to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS`, but a user who created an inline agent on the default-selected trigger without first re-selecting the trigger would submit the moderator copy as the new agent's system instructions. Initialised both the `useState` default and `resetForm()` to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` so the pre-interaction value matches the default trigger. Updated `frontend/tests/CreateCorpusActionModal.ct.tsx` "inline-agent create: full happy path" mutation mock to expect `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` — the previous mock variable masked this bug because `MockedProvider` was matching the stale moderator default rather than the trigger-appropriate one.
 
 ### Changed
 
@@ -10,6 +10,11 @@
 # ingestion and in the parser pipeline for type detection.
 MARKDOWN_MIME_TYPE = "text/markdown"
 
+# MIME type for Microsoft Word (DOCX) documents.
+DOCX_MIME_TYPE = (
+    "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+)
+
 # File types that are stored as txt_extract_file (plain text, no parsing needed).
 # Shared between versioning.py and corpus models.py — single source of truth.
 TEXT_MIMETYPES = {"text/plain", MARKDOWN_MIME_TYPE, "application/txt"}
 
@@ -320,6 +320,11 @@ def sync_add_sources(datacell, sources):
             # Auto-ground: find extracted text values in the source document
             # and create linked source annotations with PDF/text coordinates.
             try:
+                # Pass ``corpus_id`` (int) rather than the Corpus instance:
+                # only the ID is in scope here (the extract task receives
+                # ``corpus_id`` as input) and grounding's ``_resolve_corpus``
+                # handles the lookup safely if the corpus was deleted between
+                # task scheduling and execution.
                 grounding_annotations = await ground_extraction_to_annotations(
                     datacell=datacell,
                     document=document,