Skip to content

Commit bcee7ff

Browse files
authored
Merge branch 'main' into claude/fix-issue-1359-KAz9F
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
2 parents 9ab190f + 95d95ee commit bcee7ff

6 files changed

Lines changed: 192 additions & 17 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4545
### Fixed
4646

4747
- **`RemoveLabelsFromLabelsetMutation` silently did nothing** (Issue #1359, `config/graphql/label_mutations.py:296`): The resolver referenced `labelset.documents.filter(pk__in=label_pks)`, but `LabelSet` has no `documents` relation — the M2M to labels is `annotation_labels` (see `opencontractserver/annotations/models.py:1284`). Every invocation therefore raised `AttributeError`, which was swallowed by the surrounding `except Exception as e:` block and returned as a generic `"Error removing label(s) from labelset: ..."` with `ok=False`. Because the frontend `REMOVE_ANNOTATION_LABELS_FROM_LABELSET` mutation was itself unused (#1244 swept it out), this bug went unnoticed in production for an unknown length of time; discovered while grading mypy errors for #1332 (`config/graphql/label_mutations.py:296: error: "LabelSet" has no attribute "documents" [attr-defined]`). Swapped `documents` → `annotation_labels`, removed the now-resolved error from `docs/typing/mypy_baseline.txt`, and added `opencontractserver/tests/test_label_mutations.py` with four regression cases covering: labels are actually removed from the M2M, IDs not in the labelset are silently ignored, a non-owner / non-public caller cannot mutate the labelset, and a public labelset remains editable (pinning the current `Q(creator=user) | Q(is_public=True)` resolver behaviour so any future permission hardening is explicit).
48+
- **`package_annotated_docs` silently corrupted exports when a document failed to burn** (Issue #1356, `opencontractserver/tasks/export_tasks.py:150-212`, `opencontractserver/utils/etl.py:198-463`, `opencontractserver/tasks/doc_tasks.py:463-504`): `build_document_export()` returns `("", "", None, {}, {})` when a per-document export fails (e.g. the underlying file cannot be loaded). The V1 consumer `package_annotated_docs` had no guard for that placeholder — it ran `doc[1].encode("utf-8")` on the empty string (harmless, no crash), wrote an empty-named entry into the zip, and inserted `annotated_docs[""] = None` into the final `data.json`, so a single failed document silently poisoned the export. The V2 pipeline in `export_tasks_v2.py:126-128` already has this guard; V1 did not. Added `if not doc_name or doc_export is None: continue` (mirroring V2's check) and logged a warning identifying the skipped doc. Also tightened the return-type annotations on `build_document_export` and `burn_doc_annotations`: slots 0 and 1 are always `str` (never `None`; they are empty strings on the failure path), so the signature is now `tuple[str, str, OpenContractDocExport | None, dict[...], dict[...]]`. Corrected the `burned_docs` parameter annotation on `package_annotated_docs` from a single-element tuple-of-tuples to a variadic `tuple[tuple[...], ...]` — the runtime iteration is variadic, and the previous annotation was a red herring uncovered while typing for #1334. Regression test in `opencontractserver/tests/test_package_annotated_docs.py` covers both the mixed-success and all-failed scenarios, asserting the zip contains no empty-named entries and `annotated_docs` holds no `None` values.
4849
- **Frontend coverage badge stuck on "unknown"** (`README.md:12`, `.github/workflows/codecov-notify.yml`, `.github/workflows/frontend.yml`, `.github/workflows/frontend-e2e.yml`, `frontend/package.json`, `frontend/yarn.lock`): PR #1322 pointed the README badge at a new merged `frontend` flag fed by a cross-workflow `lcov-result-merger@5` step inside `codecov-notify.yml`. Every `frontend-merged-coverage` upload since has landed at Codecov with `state: error` / `totals: null`, so the badge rendered "unknown" even though `frontend-unit` (31%), `frontend-component` (61%), and `frontend-e2e` (24%) were all processing correctly. Two defects in the merged lcov confirmed by local repro of the CI merge step: (1) `lcov-result-merger@5` emits a stripped lcov containing only `SF:`, `DA:`, `BRDA:`, `end_of_record` — it drops `TN:`, `FN`, `FNDA`, `FNF`, `FNH`, `LF`, `LH`, `BRF`, `BRH`, so line-summary fields required by Codecov's parser are absent; (2) Vitest v8 emits `src/...` (relative to `frontend/`) while `vite-plugin-istanbul` + `nyc report` emit `/home/runner/work/OpenContracts/OpenContracts/frontend/src/...` (absolute), and the merger keys on the literal path string so the same file appears as two records with conflicting hit counts. `codecov-notify.yml` also ran without `actions/checkout`, which Codecov's action docs explicitly recommend against. Fix: stop merging client-side and let Codecov aggregate server-side, since that is what flags are for. Each per-suite upload now declares two flags — `frontend-unit,frontend`, `frontend-component,frontend`, `frontend-e2e,frontend` — so the `frontend` flag total is the union of the three uploads computed by Codecov. `codecov-notify.yml` is reduced to its original gate-and-notify role (no artifact downloads, no `lcov-result-merger`, no merged upload). Deleted the `frontend-{unit,ct,e2e}-lcov` artifact publishes in `frontend.yml` / `frontend-e2e.yml`, removed the `lcov-result-merger@^5.0.1` devDep and the `coverage:merge` script from `frontend/package.json`, and pruned the orphaned entries from `frontend/yarn.lock`. README badge URL unchanged (`flag=frontend`).
4950

5051
### Changed

mypy.ini

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -925,6 +925,9 @@ ignore_errors = True
925925
[mypy-opencontractserver.tests.test_openai_embedder]
926926
ignore_errors = True
927927

928+
[mypy-opencontractserver.tests.test_package_annotated_docs]
929+
ignore_errors = True
930+
928931
[mypy-opencontractserver.tests.test_page_imaging_tool]
929932
ignore_errors = True
930933

opencontractserver/tasks/doc_tasks.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -467,8 +467,8 @@ def burn_doc_annotations(
467467
analysis_ids: list[int] | None = None,
468468
annotation_filter_mode: str = "CORPUS_LABELSET_ONLY",
469469
) -> tuple[
470-
str | None,
471-
str | None,
470+
str,
471+
str,
472472
OpenContractDocExport | None,
473473
dict[str | int, AnnotationLabelPythonType],
474474
dict[str | int, AnnotationLabelPythonType],
@@ -485,6 +485,10 @@ def burn_doc_annotations(
485485
486486
Returns a tuple containing all data needed for packaging:
487487
(filename, base64-encoded file, doc_export_data, text_labels, doc_labels)
488+
489+
On failure, ``filename`` and ``base64-encoded file`` are empty strings and
490+
``doc_export_data`` is ``None``. Downstream consumers (e.g.
491+
``package_annotated_docs``) must skip such entries.
488492
"""
489493
from opencontractserver.types.enums import AnnotationFilterMode
490494

opencontractserver/tasks/export_tasks.py

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -150,12 +150,13 @@ def on_demand_post_processors(
150150
def package_annotated_docs(
151151
burned_docs: tuple[
152152
tuple[
153-
str | None,
154-
str | None,
153+
str,
154+
str,
155155
OpenContractDocExport | None,
156156
dict[str | int, AnnotationLabelPythonType],
157157
dict[str | int, AnnotationLabelPythonType],
158-
]
158+
],
159+
...,
159160
],
160161
export_id: int,
161162
corpus_pk: int,
@@ -173,7 +174,7 @@ def package_annotated_docs(
173174
"""
174175
logger.info(f"Package corpus for export {export_id}...")
175176

176-
annotated_docs = {}
177+
annotated_docs: dict[str, OpenContractDocExport] = {}
177178
doc_labels: dict[str | int, AnnotationLabelPythonType] | None = None
178179
text_labels: dict[str | int, AnnotationLabelPythonType] | None = None
179180

@@ -184,23 +185,31 @@ def package_annotated_docs(
184185

185186
for doc in burned_docs:
186187

187-
# logger.info(f"Handling burned doc: {doc[0]}")
188+
doc_name, base64_file, doc_export, doc_text_labels, doc_doc_labels = doc
189+
190+
# build_document_export returns ("", "", None, {}, {}) when the per-doc
191+
# export failed (e.g. the underlying file could not be loaded). Skip
192+
# those placeholders so we don't emit an empty-keyed / None-valued
193+
# entry into the final zip and data.json.
194+
if not doc_name or doc_export is None:
195+
logger.warning(
196+
f"Skipping failed burned doc in export {export_id}: "
197+
f"doc_name={doc_name!r}, has_export={doc_export is not None}"
198+
)
199+
continue
188200

189201
if not doc_labels:
190-
doc_labels: dict[str | int, AnnotationLabelPythonType] = doc[4]
202+
doc_labels = doc_doc_labels
191203

192204
if not text_labels:
193-
text_labels: dict[str | int, AnnotationLabelPythonType] = doc[3]
205+
text_labels = doc_text_labels
194206

195-
base64_img_bytes = doc[1].encode("utf-8")
207+
base64_img_bytes = base64_file.encode("utf-8")
196208
decoded_file_data = base64.decodebytes(base64_img_bytes)
197-
# logger.info("Data decoded successfully")
198209

199-
zip_file.writestr(doc[0], decoded_file_data)
200-
# logger.info("Pdf written successfully")
210+
zip_file.writestr(doc_name, decoded_file_data)
201211

202-
annotated_docs[doc[0]] = doc[2]
203-
# logger.info("doc json added to json")
212+
annotated_docs[doc_name] = doc_export
204213

205214
export_file_data: OpenContractsExportDataJsonPythonType = {
206215
"annotated_docs": annotated_docs,
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
"""Regression tests for ``package_annotated_docs`` skipping failed placeholders."""
2+
3+
from __future__ import annotations
4+
5+
import base64
6+
import io
7+
import json
8+
import zipfile
9+
from unittest.mock import patch
10+
11+
from django.contrib.auth import get_user_model
12+
from django.test import TestCase
13+
14+
from opencontractserver.corpuses.models import Corpus
15+
from opencontractserver.tasks.export_tasks import package_annotated_docs
16+
from opencontractserver.users.models import UserExport
17+
18+
User = get_user_model()
19+
20+
21+
def _make_label(pk: str, text: str) -> dict:
22+
return {
23+
"id": pk,
24+
"color": "#FF0000",
25+
"description": "",
26+
"icon": "tag",
27+
"text": text,
28+
"label_type": "TOKEN_LABEL",
29+
}
30+
31+
32+
def _make_doc_export(title: str) -> dict:
33+
return {
34+
"doc_labels": [],
35+
"labelled_text": [],
36+
"title": title,
37+
"description": "",
38+
"content": "",
39+
"pawls_file_content": [],
40+
"page_count": 1,
41+
"file_type": "application/pdf",
42+
}
43+
44+
45+
class PackageAnnotatedDocsSkipTestCase(TestCase):
46+
"""Verifies package_annotated_docs skips failed burn_doc_annotations tuples."""
47+
48+
def setUp(self) -> None:
49+
self.user = User.objects.create_user(username="bob", password="12345678")
50+
self.corpus = Corpus.objects.create(
51+
title="Test Corpus",
52+
description="For package_annotated_docs tests",
53+
creator=self.user,
54+
)
55+
self.export = UserExport.objects.create(
56+
name="test-export",
57+
creator=self.user,
58+
)
59+
60+
# A tiny payload that survives base64 + zip round-trip.
61+
self.fake_pdf_bytes = b"%PDF-1.4\n% fake body\n"
62+
self.fake_pdf_b64 = base64.b64encode(self.fake_pdf_bytes).decode("utf-8")
63+
64+
self.text_labels = {"1": _make_label("1", "Important")}
65+
self.doc_labels = {"2": _make_label("2", "Contract")}
66+
67+
def _collect_finalize(self):
68+
"""Return a side_effect that captures finalize_export's output_bytes."""
69+
captured: dict = {}
70+
71+
def _capture(export_id, filename, output_bytes, corpus_title):
72+
# finalize_export does output_bytes.seek(0) before saving; mirror
73+
# that so the captured buffer is ready to read in-test.
74+
output_bytes.seek(0)
75+
captured["bytes"] = output_bytes.getvalue()
76+
captured["filename"] = filename
77+
78+
return captured, _capture
79+
80+
def test_skips_failed_placeholder_and_packages_successful_docs(self) -> None:
81+
"""Mixed input: one good doc + one failed placeholder -> only good doc lands in zip."""
82+
good_doc = (
83+
"good.pdf",
84+
self.fake_pdf_b64,
85+
_make_doc_export("Good Doc"),
86+
self.text_labels,
87+
self.doc_labels,
88+
)
89+
# This is exactly what build_document_export returns on failure.
90+
failed_doc = ("", "", None, {}, {})
91+
92+
captured, capture_fn = self._collect_finalize()
93+
with patch(
94+
"opencontractserver.tasks.export_tasks.finalize_export",
95+
side_effect=capture_fn,
96+
), self.assertLogs(
97+
"opencontractserver.tasks.export_tasks", level="WARNING"
98+
) as log_cm:
99+
package_annotated_docs(
100+
burned_docs=(good_doc, failed_doc),
101+
export_id=self.export.id,
102+
corpus_pk=self.corpus.id,
103+
)
104+
105+
self.assertTrue(
106+
any("Skipping failed burned doc" in line for line in log_cm.output),
107+
f"Expected skip-warning log, got: {log_cm.output}",
108+
)
109+
self.assertIn("bytes", captured, "finalize_export was not called")
110+
zf = zipfile.ZipFile(io.BytesIO(captured["bytes"]))
111+
names = set(zf.namelist())
112+
113+
# The failed placeholder's empty-string filename must NOT be in the zip.
114+
self.assertNotIn("", names)
115+
self.assertIn("good.pdf", names)
116+
self.assertIn("data.json", names)
117+
118+
data = json.loads(zf.read("data.json").decode("utf-8"))
119+
self.assertIn("good.pdf", data["annotated_docs"])
120+
self.assertNotIn("", data["annotated_docs"])
121+
# No None values should leak into annotated_docs.
122+
self.assertTrue(
123+
all(v is not None for v in data["annotated_docs"].values()),
124+
"annotated_docs must not contain None values for failed exports",
125+
)
126+
# Labels from the successful doc still propagate.
127+
self.assertEqual(data["text_labels"], self.text_labels)
128+
self.assertEqual(data["doc_labels"], self.doc_labels)
129+
130+
def test_all_failed_produces_empty_annotated_docs_without_crashing(self) -> None:
131+
"""Every doc failed -> no crash, empty annotated_docs, no bogus keys."""
132+
failed_doc_a = ("", "", None, {}, {})
133+
failed_doc_b = ("", "", None, {}, {})
134+
135+
captured, capture_fn = self._collect_finalize()
136+
with patch(
137+
"opencontractserver.tasks.export_tasks.finalize_export",
138+
side_effect=capture_fn,
139+
):
140+
package_annotated_docs(
141+
burned_docs=(failed_doc_a, failed_doc_b),
142+
export_id=self.export.id,
143+
corpus_pk=self.corpus.id,
144+
)
145+
146+
self.assertIn("bytes", captured, "finalize_export was not called")
147+
zf = zipfile.ZipFile(io.BytesIO(captured["bytes"]))
148+
# Only data.json, no empty-filename entry.
149+
self.assertEqual(set(zf.namelist()), {"data.json"})
150+
151+
data = json.loads(zf.read("data.json").decode("utf-8"))
152+
self.assertEqual(data["annotated_docs"], {})

opencontractserver/utils/etl.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -202,8 +202,8 @@ def build_document_export(
202202
analysis_ids: list[int] | None = None,
203203
annotation_filter_mode: AnnotationFilterMode = AnnotationFilterMode.CORPUS_LABELSET_ONLY,
204204
) -> tuple[
205-
str | None,
206-
str | None,
205+
str,
206+
str,
207207
OpenContractDocExport | None,
208208
dict[str | int, AnnotationLabelPythonType],
209209
dict[str | int, AnnotationLabelPythonType],
@@ -217,6 +217,12 @@ def build_document_export(
217217
analysis_ids: Optional list of analysis PKs to include in annotation selection
218218
annotation_filter_mode: How to filter annotations - "CORPUS_LABELSET_ONLY" (default),
219219
"CORPUS_LABELSET_PLUS_ANALYSES", or "ANALYSES_ONLY"
220+
221+
Returns a 5-tuple ``(doc_name, base64_encoded_file, doc_annotation_json,
222+
text_labels, doc_labels)``. On failure, ``doc_name`` and
223+
``base64_encoded_file`` are empty strings and ``doc_annotation_json`` is
224+
``None`` — consumers should treat an empty ``doc_name`` OR a ``None``
225+
``doc_annotation_json`` as a signal to skip that document.
220226
"""
221227

222228
logger.info(f"burn_doc_annotations - label_lookups: {label_lookups}")

0 commit comments

Comments
 (0)