Skip to content

Commit 8612d97

Browse files
committed
Fix extraction grounding bugs: page fallback, idempotency, label-set scoping (#1246)
Resolves the follow-up review issues raised on the extraction-grounding pipeline added in #1245: - _create_pdf_annotation now skips (raises inside the savepoint) when PlasmaPDF cannot determine a page, instead of silently saving on page 1 with a wrong bounding-box context. - ensure_label_and_labelset is invoked inside the per-annotation try/transaction.atomic() so a label-set failure no longer aborts the whole batch. - Annotation creation is now idempotent via get_or_create keyed on (document, label, type, raw_text, …); Celery retries no longer duplicate OC_EXTRACT_SOURCE annotations. - DOCX MIME literal moved to opencontractserver.constants.document_processing. - Public/helper functions in extraction_grounding.py picked up Document / Corpus / Datacell / Annotation / AnnotationLabel type annotations. - Span (text/DOCX) page=1 placeholder now documented in the docstring; data_extract_tasks.py picks up an inline comment explaining why corpus_id (int) is passed to ground_extraction_to_annotations. - Tests: new TestGroundingPipelinePDFIntegration covers the TOKEN_LABEL/PDF path with a synthetic two-page PAWLS payload, the page=None skip behavior, and idempotency on retry. SPAN_LABEL idempotency is covered by test_ground_text_document_is_idempotent. - CHANGELOG updated. Closes #1246
1 parent f40e91f commit 8612d97

5 files changed

Lines changed: 420 additions & 55 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Fixed
1111

12+
- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
13+
- **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
14+
- **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
15+
- **Bug — duplicate `OC_EXTRACT_SOURCE` annotations on Celery retry** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation` & `_create_span_annotation`): nothing prevented the grounding pipeline from creating fresh annotations and re-linking them via `datacell.sources.add(*annotations)` if `ground_extraction_to_annotations` ran twice on the same datacell (Celery retry after partial failure was the realistic trigger). Replaced the construct-then-`save()` flow with `Annotation.objects.get_or_create()` keyed on `(document, annotation_label, annotation_type, raw_text, …)` so retries reuse existing rows. `datacell.sources` is a `ManyToManyField`, so re-linking the same row is already a no-op once the row is shared.
16+
- **Constant — extracted `DOCX_MIME_TYPE`** (`opencontractserver/constants/document_processing.py`): the long `application/vnd.openxmlformats-officedocument.wordprocessingml.document` literal previously lived inline in `_load_document_text_and_layer`. Per the project's no-magic-strings rule it now sits next to `MARKDOWN_MIME_TYPE` and is imported from one place.
17+
- **Type annotations** (`opencontractserver/utils/extraction_grounding.py`): `Document`, `Corpus`, `Datacell`, `Annotation`, and `AnnotationLabel` parameters and return types added via a `TYPE_CHECKING` block on every public and helper function. No runtime change.
18+
- **Documentation** — the `page=1` placeholder for SPAN_LABEL annotations (text/DOCX) is now documented in the function docstring, explaining that the `txt_extract_file` pipeline does not preserve a page-break map and the actual location lives in the character offsets in `json`.
19+
- **Tests**`opencontractserver/tests/test_extraction_grounding.py`:
20+
- `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
21+
- `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
1222
- **Merged `frontend` Codecov flag drops to ~33% on every commit where Frontend CI's CT job fails** (`frontend/package.json` `test:coverage:ct`): the script chained `playwright test ... && mkdir -p ... && nyc report ...`, so a failing CT run short-circuited before `nyc report` could turn the per-test JSON files in `.nyc_output` into an `lcov.info`. The downstream `Upload CT Coverage to Codecov` step (`if: success() || failure()`) then errored with "No coverage reports found" and `frontend-component` did not upload for that SHA. Codecov's server-side aggregation of the `frontend` flag was left with only `frontend-unit` (~23%) and `frontend-e2e` (~24%), pulling the merged number down to ~33% even though the previous commit was at ~67% — observed on six consecutive main commits 2026-04-26T01:02..02:58Z (`2d7033f8`..`be5bcfc8`) before recovering on `30298391`. Mirrored the existing `test:e2e:coverage` pattern (`; CT_EXIT=$?; nyc report ... || echo "No coverage data to report"; exit $CT_EXIT`) so `nyc report` runs regardless of test outcome and the lcov ships even on red CT runs. `frontend-component` will still report a slightly lower number when tests fail (failed tests register fewer hits), but it will report — keeping the merged `frontend` flag's denominator stable.
1323
- **`User.__init__` shared-state mutation re-introduced by branch merge** (`opencontractserver/users/models.py:172-180` removed): PR #1374 (commit `50ed6740`) deleted the `User.__init__` override that mutated `Field.validators[0]` on every instantiation, but a subsequent merge (`b68c1cb4 → 6d2cddbf`) resurrected the override along with its mypy-narrowing changes. The current main on commit `6d2cddbf` therefore reproduced the original `#1358` bug: `User(...)` rebound `username_field.validators[0]` and clobbered any third-party validator prepended to the list. Removed the `__init__` override entirely; the class-body declaration `validators=[UserUnicodeUsernameValidator()]` on the `username` field (still present from PR #1374) is the canonical and only declaration. Also dropped the now-unused `Field` import. Regression coverage from PR #1374 (`opencontractserver/tests/test_user_username_validator.py`) was already on main and is what surfaced the regression in CI.
1424

opencontractserver/constants/document_processing.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,11 @@
1010
# ingestion and in the parser pipeline for type detection.
1111
MARKDOWN_MIME_TYPE = "text/markdown"
1212

13+
# MIME type for Microsoft Word (DOCX) documents.
14+
DOCX_MIME_TYPE = (
15+
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
16+
)
17+
1318
# File types that are stored as txt_extract_file (plain text, no parsing needed).
1419
# Shared between versioning.py and corpus models.py — single source of truth.
1520
TEXT_MIMETYPES = {"text/plain", MARKDOWN_MIME_TYPE, "application/txt"}

opencontractserver/tasks/data_extract_tasks.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,11 @@ def sync_add_sources(datacell, sources):
320320
# Auto-ground: find extracted text values in the source document
321321
# and create linked source annotations with PDF/text coordinates.
322322
try:
323+
# Pass ``corpus_id`` (int) rather than the Corpus instance:
324+
# only the ID is in scope here (the extract task receives
325+
# ``corpus_id`` as input) and grounding's ``_resolve_corpus``
326+
# handles the lookup safely if the corpus was deleted between
327+
# task scheduling and execution.
323328
grounding_annotations = await ground_extraction_to_annotations(
324329
datacell=datacell,
325330
document=document,

opencontractserver/tests/test_extraction_grounding.py

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,52 @@
55
(integration with Django models).
66
"""
77

8+
import json
9+
810
from asgiref.sync import async_to_sync
911
from django.test import SimpleTestCase, TestCase
1012

1113
from opencontractserver.utils.extraction_grounding import extract_groundable_strings
1214

1315

16+
def _build_pawls_for_text(
17+
pages_text: list[str], page_width: float = 612.0, page_height: float = 792.0
18+
) -> str:
19+
"""Build a v1 PAWLS JSON payload that embeds ``pages_text`` as tokens.
20+
21+
Each page's text is split on whitespace and laid out as a single row
22+
of tokens with simple monotonically increasing x-coordinates. The
23+
resulting JSON is suitable for ``build_translation_layer`` and lets
24+
integration tests exercise the PDF grounding path without a real PDF.
25+
"""
26+
pages: list[dict] = []
27+
for page_index, text in enumerate(pages_text):
28+
tokens: list[dict] = []
29+
x_pos = 10.0
30+
for word in text.split():
31+
tokens.append(
32+
{
33+
"x": x_pos,
34+
"y": 100.0,
35+
"width": float(len(word)) * 6.0,
36+
"height": 12.0,
37+
"text": word,
38+
}
39+
)
40+
x_pos += float(len(word)) * 6.0 + 4.0
41+
pages.append(
42+
{
43+
"page": {
44+
"width": page_width,
45+
"height": page_height,
46+
"index": page_index,
47+
},
48+
"tokens": tokens,
49+
}
50+
)
51+
return json.dumps(pages)
52+
53+
1454
class TestExtractGroundableStrings(SimpleTestCase):
1555
"""Unit tests for extract_groundable_strings() — no Django DB needed."""
1656

@@ -327,3 +367,251 @@ def test_ground_no_matches_returns_empty(self):
327367
)
328368

329369
self.assertEqual(len(annotations), 0)
370+
371+
def test_ground_text_document_is_idempotent(self):
372+
"""Running grounding twice should not create duplicate annotations.
373+
374+
Simulates a Celery retry after a partial failure. The second call
375+
must reuse existing OC_EXTRACT_SOURCE annotations rather than
376+
bloating ``datacell.sources`` with duplicates.
377+
"""
378+
from opencontractserver.annotations.models import Annotation
379+
from opencontractserver.utils.extraction_grounding import (
380+
ground_extraction_to_annotations,
381+
)
382+
383+
first = async_to_sync(ground_extraction_to_annotations)(
384+
datacell=self.datacell,
385+
document=self.document,
386+
corpus=self.corpus,
387+
user_id=self.user.id,
388+
enable_fuzzy=False,
389+
)
390+
self.assertGreater(len(first), 0)
391+
first_count = Annotation.objects.filter(document=self.document).count()
392+
first_ids = sorted(a.id for a in first)
393+
394+
second = async_to_sync(ground_extraction_to_annotations)(
395+
datacell=self.datacell,
396+
document=self.document,
397+
corpus=self.corpus,
398+
user_id=self.user.id,
399+
enable_fuzzy=False,
400+
)
401+
second_count = Annotation.objects.filter(document=self.document).count()
402+
second_ids = sorted(a.id for a in second)
403+
404+
self.assertEqual(
405+
first_count,
406+
second_count,
407+
"Re-running grounding created duplicate annotations.",
408+
)
409+
self.assertEqual(
410+
first_ids,
411+
second_ids,
412+
"Re-running grounding returned annotations with different IDs.",
413+
)
414+
415+
self.datacell.refresh_from_db()
416+
self.assertEqual(self.datacell.sources.count(), first_count)
417+
418+
419+
class TestGroundingPipelinePDFIntegration(TestCase):
420+
"""Integration tests for grounding against a PDF-shaped document.
421+
422+
Builds a synthetic multi-page PAWLS payload (no real PDF needed) and
423+
exercises the TOKEN_LABEL path through PlasmaPDF's translation layer.
424+
"""
425+
426+
def setUp(self):
427+
from django.contrib.auth import get_user_model
428+
from django.core.files.base import ContentFile
429+
430+
from opencontractserver.corpuses.models import Corpus
431+
from opencontractserver.documents.models import Document
432+
from opencontractserver.extracts.models import (
433+
Column,
434+
Datacell,
435+
Extract,
436+
Fieldset,
437+
)
438+
from opencontractserver.notifications.models import Notification
439+
440+
User = get_user_model()
441+
self.user = User.objects.create_user(
442+
username="grounding_pdf_user", password="testpass"
443+
)
444+
Notification.objects.filter(recipient=self.user).delete()
445+
446+
self.corpus = Corpus.objects.create(
447+
title="PDF Grounding Corpus", creator=self.user
448+
)
449+
450+
# Two-page synthetic document; "Acme Holdings" is on page 0,
451+
# "Global Acquisitions" on page 1.
452+
self.pages_text = [
453+
"ASSET PURCHASE AGREEMENT between Acme Holdings Inc and others",
454+
"Global Acquisitions LLC shall serve as the Buyer of record",
455+
]
456+
pawls_json = _build_pawls_for_text(self.pages_text)
457+
458+
self.document = Document.objects.create(
459+
title="PDF Grounding Test",
460+
creator=self.user,
461+
file_type="application/pdf",
462+
)
463+
self.document.pawls_parse_file.save(
464+
"test.pawls", ContentFile(pawls_json.encode())
465+
)
466+
self.corpus.add_document(document=self.document, user=self.user)
467+
468+
self.fieldset = Fieldset.objects.create(name="PDF Fieldset", creator=self.user)
469+
self.column = Column.objects.create(
470+
fieldset=self.fieldset,
471+
name="Parties",
472+
query="Extract parties",
473+
output_type="str",
474+
creator=self.user,
475+
)
476+
self.extract = Extract.objects.create(
477+
name="PDF Extract",
478+
corpus=self.corpus,
479+
fieldset=self.fieldset,
480+
creator=self.user,
481+
)
482+
self.datacell = Datacell.objects.create(
483+
extract=self.extract,
484+
column=self.column,
485+
document=self.document,
486+
creator=self.user,
487+
data={"data": ["Acme Holdings", "Global Acquisitions"]},
488+
)
489+
490+
def test_ground_pdf_creates_token_label_annotations(self):
491+
"""PDF grounding should create TOKEN_LABEL annotations with valid pages."""
492+
from opencontractserver.annotations.models import TOKEN_LABEL
493+
from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL
494+
from opencontractserver.utils.extraction_grounding import (
495+
ground_extraction_to_annotations,
496+
)
497+
498+
annotations = async_to_sync(ground_extraction_to_annotations)(
499+
datacell=self.datacell,
500+
document=self.document,
501+
corpus=self.corpus,
502+
user_id=self.user.id,
503+
enable_fuzzy=False,
504+
)
505+
506+
self.assertGreater(len(annotations), 0)
507+
for annot in annotations:
508+
self.assertEqual(annot.annotation_type, TOKEN_LABEL)
509+
self.assertEqual(annot.document, self.document)
510+
self.assertEqual(annot.corpus, self.corpus)
511+
self.assertFalse(annot.structural)
512+
self.assertEqual(annot.annotation_label.text, OC_EXTRACT_SOURCE_LABEL)
513+
# Page must be a positive integer; never the silent default of 1
514+
# for a span that actually lives on page 2.
515+
self.assertIsInstance(annot.page, int)
516+
self.assertGreaterEqual(annot.page, 1)
517+
self.assertLessEqual(annot.page, len(self.pages_text))
518+
self.assertTrue(annot.raw_text)
519+
520+
# "Acme Holdings" is on page 1 (1-indexed) and
521+
# "Global Acquisitions" on page 2 — confirm the per-page mapping
522+
# actually works by checking we got at least one annotation off
523+
# page 1.
524+
pages_seen = {a.page for a in annotations}
525+
self.assertGreater(
526+
len(pages_seen),
527+
1,
528+
"Expected grounding to span multiple PDF pages.",
529+
)
530+
531+
self.datacell.refresh_from_db()
532+
self.assertEqual(self.datacell.sources.count(), len(annotations))
533+
534+
def test_ground_pdf_is_idempotent(self):
535+
"""Re-running PDF grounding must not duplicate TOKEN_LABEL annotations."""
536+
from opencontractserver.annotations.models import Annotation
537+
from opencontractserver.utils.extraction_grounding import (
538+
ground_extraction_to_annotations,
539+
)
540+
541+
first = async_to_sync(ground_extraction_to_annotations)(
542+
datacell=self.datacell,
543+
document=self.document,
544+
corpus=self.corpus,
545+
user_id=self.user.id,
546+
enable_fuzzy=False,
547+
)
548+
self.assertGreater(len(first), 0)
549+
first_count = Annotation.objects.filter(document=self.document).count()
550+
first_ids = sorted(a.id for a in first)
551+
552+
second = async_to_sync(ground_extraction_to_annotations)(
553+
datacell=self.datacell,
554+
document=self.document,
555+
corpus=self.corpus,
556+
user_id=self.user.id,
557+
enable_fuzzy=False,
558+
)
559+
second_count = Annotation.objects.filter(document=self.document).count()
560+
second_ids = sorted(a.id for a in second)
561+
562+
self.assertEqual(first_count, second_count)
563+
self.assertEqual(first_ids, second_ids)
564+
565+
self.datacell.refresh_from_db()
566+
self.assertEqual(self.datacell.sources.count(), first_count)
567+
568+
def test_ground_pdf_skips_when_page_is_none(self):
569+
"""If PlasmaPDF returns page=None, the annotation must be skipped.
570+
571+
Regression for the silent ``page=1`` fallback bug: a missing page
572+
on a multi-page PDF should result in *no* annotation being saved
573+
rather than a structurally incorrect one anchored to page 1.
574+
"""
575+
from unittest.mock import patch
576+
577+
from opencontractserver.annotations.models import Annotation
578+
from opencontractserver.constants.annotations import OC_EXTRACT_SOURCE_LABEL
579+
from opencontractserver.utils.extraction_grounding import (
580+
ground_extraction_to_annotations,
581+
)
582+
583+
def stub_create(self, span_annotation):
584+
# Mimic PlasmaPDF's payload but force page=None so the grounding
585+
# pipeline must take the skip-rather-than-fallback path.
586+
return {
587+
"page": None,
588+
"rawText": span_annotation.span["text"],
589+
"annotation_json": {},
590+
}
591+
592+
with patch(
593+
"plasmapdf.models.PdfDataLayer.PdfDataLayer."
594+
"create_opencontract_annotation_from_span",
595+
new=stub_create,
596+
):
597+
annotations = async_to_sync(ground_extraction_to_annotations)(
598+
datacell=self.datacell,
599+
document=self.document,
600+
corpus=self.corpus,
601+
user_id=self.user.id,
602+
enable_fuzzy=False,
603+
)
604+
605+
self.assertEqual(
606+
len(annotations),
607+
0,
608+
"Annotations with page=None must be skipped, not saved on page 1.",
609+
)
610+
# And nothing should have been persisted to the database either.
611+
self.assertEqual(
612+
Annotation.objects.filter(
613+
document=self.document,
614+
annotation_label__text=OC_EXTRACT_SOURCE_LABEL,
615+
).count(),
616+
0,
617+
)

0 commit comments

Comments
 (0)