Merge branch 'main' into claude/fix-issue-1359-KAz9F

JSv4 · web-flow · commit bcee7ff68906 · 2026-04-25T16:24:24.000-07:00
Signed-off-by: JSIV &lt;5049984+JSv4@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -45,6 +45,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 
 - **`RemoveLabelsFromLabelsetMutation` silently did nothing** (Issue #1359, `config/graphql/label_mutations.py:296`): The resolver referenced `labelset.documents.filter(pk__in=label_pks)`, but `LabelSet` has no `documents` relation — the M2M to labels is `annotation_labels` (see `opencontractserver/annotations/models.py:1284`). Every invocation therefore raised `AttributeError`, which was swallowed by the surrounding `except Exception as e:` block and returned as a generic `"Error removing label(s) from labelset: ..."` with `ok=False`. Because the frontend `REMOVE_ANNOTATION_LABELS_FROM_LABELSET` mutation was itself unused (#1244 swept it out), this bug went unnoticed in production for an unknown length of time; discovered while grading mypy errors for #1332 (`config/graphql/label_mutations.py:296: error: "LabelSet" has no attribute "documents"  [attr-defined]`). Swapped `documents` → `annotation_labels`, removed the now-resolved error from `docs/typing/mypy_baseline.txt`, and added `opencontractserver/tests/test_label_mutations.py` with four regression cases covering: labels are actually removed from the M2M, IDs not in the labelset are silently ignored, a non-owner / non-public caller cannot mutate the labelset, and a public labelset remains editable (pinning the current `Q(creator=user) | Q(is_public=True)` resolver behaviour so any future permission hardening is explicit).
+- **`package_annotated_docs` silently corrupted exports when a document failed to burn** (Issue #1356, `opencontractserver/tasks/export_tasks.py:150-212`, `opencontractserver/utils/etl.py:198-463`, `opencontractserver/tasks/doc_tasks.py:463-504`): `build_document_export()` returns `("", "", None, {}, {})` when a per-document export fails (e.g. the underlying file cannot be loaded). The V1 consumer `package_annotated_docs` had no guard for that placeholder — it ran `doc[1].encode("utf-8")` on the empty string (harmless, no crash), wrote an empty-named entry into the zip, and inserted `annotated_docs[""] = None` into the final `data.json`, so a single failed document silently poisoned the export. The V2 pipeline in `export_tasks_v2.py:126-128` already has this guard; V1 did not. Added `if not doc_name or doc_export is None: continue` (mirroring V2's check) and logged a warning identifying the skipped doc. Also tightened the return-type annotations on `build_document_export` and `burn_doc_annotations`: slots 0 and 1 are always `str` (never `None`; they are empty strings on the failure path), so the signature is now `tuple[str, str, OpenContractDocExport | None, dict[...], dict[...]]`. Corrected the `burned_docs` parameter annotation on `package_annotated_docs` from a single-element tuple-of-tuples to a variadic `tuple[tuple[...], ...]` — the runtime iteration is variadic, and the previous annotation was a red herring uncovered while typing for #1334. Regression test in `opencontractserver/tests/test_package_annotated_docs.py` covers both the mixed-success and all-failed scenarios, asserting the zip contains no empty-named entries and `annotated_docs` holds no `None` values.
 - **Frontend coverage badge stuck on "unknown"** (`README.md:12`, `.github/workflows/codecov-notify.yml`, `.github/workflows/frontend.yml`, `.github/workflows/frontend-e2e.yml`, `frontend/package.json`, `frontend/yarn.lock`): PR #1322 pointed the README badge at a new merged `frontend` flag fed by a cross-workflow `lcov-result-merger@5` step inside `codecov-notify.yml`. Every `frontend-merged-coverage` upload since has landed at Codecov with `state: error` / `totals: null`, so the badge rendered "unknown" even though `frontend-unit` (31%), `frontend-component` (61%), and `frontend-e2e` (24%) were all processing correctly. Two defects in the merged lcov confirmed by local repro of the CI merge step: (1) `lcov-result-merger@5` emits a stripped lcov containing only `SF:`, `DA:`, `BRDA:`, `end_of_record` — it drops `TN:`, `FN`, `FNDA`, `FNF`, `FNH`, `LF`, `LH`, `BRF`, `BRH`, so line-summary fields required by Codecov's parser are absent; (2) Vitest v8 emits `src/...` (relative to `frontend/`) while `vite-plugin-istanbul` + `nyc report` emit `/home/runner/work/OpenContracts/OpenContracts/frontend/src/...` (absolute), and the merger keys on the literal path string so the same file appears as two records with conflicting hit counts. `codecov-notify.yml` also ran without `actions/checkout`, which Codecov's action docs explicitly recommend against. Fix: stop merging client-side and let Codecov aggregate server-side, since that is what flags are for. Each per-suite upload now declares two flags — `frontend-unit,frontend`, `frontend-component,frontend`, `frontend-e2e,frontend` — so the `frontend` flag total is the union of the three uploads computed by Codecov. `codecov-notify.yml` is reduced to its original gate-and-notify role (no artifact downloads, no `lcov-result-merger`, no merged upload). Deleted the `frontend-{unit,ct,e2e}-lcov` artifact publishes in `frontend.yml` / `frontend-e2e.yml`, removed the `lcov-result-merger@^5.0.1` devDep and the `coverage:merge` script from `frontend/package.json`, and pruned the orphaned entries from `frontend/yarn.lock`. README badge URL unchanged (`flag=frontend`).
 
 ### Changed
diff --git a/mypy.ini b/mypy.ini
@@ -925,6 +925,9 @@ ignore_errors = True
 [mypy-opencontractserver.tests.test_openai_embedder]
 ignore_errors = True
 
+[mypy-opencontractserver.tests.test_package_annotated_docs]
+ignore_errors = True
+
 [mypy-opencontractserver.tests.test_page_imaging_tool]
 ignore_errors = True
 
diff --git a/opencontractserver/tasks/doc_tasks.py b/opencontractserver/tasks/doc_tasks.py
@@ -467,8 +467,8 @@ def burn_doc_annotations(
     analysis_ids: list[int] | None = None,
     annotation_filter_mode: str = "CORPUS_LABELSET_ONLY",
 ) -> tuple[
-    str | None,
-    str | None,
+    str,
+    str,
     OpenContractDocExport | None,
     dict[str | int, AnnotationLabelPythonType],
     dict[str | int, AnnotationLabelPythonType],
@@ -485,6 +485,10 @@ def burn_doc_annotations(
 
     Returns a tuple containing all data needed for packaging:
       (filename, base64-encoded file, doc_export_data, text_labels, doc_labels)
+
+    On failure, ``filename`` and ``base64-encoded file`` are empty strings and
+    ``doc_export_data`` is ``None``. Downstream consumers (e.g.
+    ``package_annotated_docs``) must skip such entries.
     """
     from opencontractserver.types.enums import AnnotationFilterMode
 
diff --git a/opencontractserver/tasks/export_tasks.py b/opencontractserver/tasks/export_tasks.py
@@ -150,12 +150,13 @@ def on_demand_post_processors(
 def package_annotated_docs(
     burned_docs: tuple[
         tuple[
-            str | None,
-            str | None,
+            str,
+            str,
             OpenContractDocExport | None,
             dict[str | int, AnnotationLabelPythonType],
             dict[str | int, AnnotationLabelPythonType],
-        ]
+        ],
+        ...,
     ],
     export_id: int,
     corpus_pk: int,
@@ -173,7 +174,7 @@ def package_annotated_docs(
     """
     logger.info(f"Package corpus for export {export_id}...")
 
-    annotated_docs = {}
+    annotated_docs: dict[str, OpenContractDocExport] = {}
     doc_labels: dict[str | int, AnnotationLabelPythonType] | None = None
     text_labels: dict[str | int, AnnotationLabelPythonType] | None = None
 
@@ -184,23 +185,31 @@ def package_annotated_docs(
 
     for doc in burned_docs:
 
-        # logger.info(f"Handling burned doc: {doc[0]}")
+        doc_name, base64_file, doc_export, doc_text_labels, doc_doc_labels = doc
+
+        # build_document_export returns ("", "", None, {}, {}) when the per-doc
+        # export failed (e.g. the underlying file could not be loaded). Skip
+        # those placeholders so we don't emit an empty-keyed / None-valued
+        # entry into the final zip and data.json.
+        if not doc_name or doc_export is None:
+            logger.warning(
+                f"Skipping failed burned doc in export {export_id}: "
+                f"doc_name={doc_name!r}, has_export={doc_export is not None}"
+            )
+            continue
 
         if not doc_labels:
-            doc_labels: dict[str | int, AnnotationLabelPythonType] = doc[4]
+            doc_labels = doc_doc_labels
 
         if not text_labels:
-            text_labels: dict[str | int, AnnotationLabelPythonType] = doc[3]
+            text_labels = doc_text_labels
 
-        base64_img_bytes = doc[1].encode("utf-8")
+        base64_img_bytes = base64_file.encode("utf-8")
         decoded_file_data = base64.decodebytes(base64_img_bytes)
-        # logger.info("Data decoded successfully")
 
-        zip_file.writestr(doc[0], decoded_file_data)
-        # logger.info("Pdf written successfully")
+        zip_file.writestr(doc_name, decoded_file_data)
 
-        annotated_docs[doc[0]] = doc[2]
-        # logger.info("doc json added to json")
+        annotated_docs[doc_name] = doc_export
 
     export_file_data: OpenContractsExportDataJsonPythonType = {
         "annotated_docs": annotated_docs,
diff --git a/opencontractserver/tests/test_package_annotated_docs.py b/opencontractserver/tests/test_package_annotated_docs.py
@@ -0,0 +1,152 @@
+"""Regression tests for ``package_annotated_docs`` skipping failed placeholders."""
+
+from __future__ import annotations
+
+import base64
+import io
+import json
+import zipfile
+from unittest.mock import patch
+
+from django.contrib.auth import get_user_model
+from django.test import TestCase
+
+from opencontractserver.corpuses.models import Corpus
+from opencontractserver.tasks.export_tasks import package_annotated_docs
+from opencontractserver.users.models import UserExport
+
+User = get_user_model()
+
+
+def _make_label(pk: str, text: str) -> dict:
+    return {
+        "id": pk,
+        "color": "#FF0000",
+        "description": "",
+        "icon": "tag",
+        "text": text,
+        "label_type": "TOKEN_LABEL",
+    }
+
+
+def _make_doc_export(title: str) -> dict:
+    return {
+        "doc_labels": [],
+        "labelled_text": [],
+        "title": title,
+        "description": "",
+        "content": "",
+        "pawls_file_content": [],
+        "page_count": 1,
+        "file_type": "application/pdf",
+    }
+
+
+class PackageAnnotatedDocsSkipTestCase(TestCase):
+    """Verifies package_annotated_docs skips failed burn_doc_annotations tuples."""
+
+    def setUp(self) -> None:
+        self.user = User.objects.create_user(username="bob", password="12345678")
+        self.corpus = Corpus.objects.create(
+            title="Test Corpus",
+            description="For package_annotated_docs tests",
+            creator=self.user,
+        )
+        self.export = UserExport.objects.create(
+            name="test-export",
+            creator=self.user,
+        )
+
+        # A tiny payload that survives base64 + zip round-trip.
+        self.fake_pdf_bytes = b"%PDF-1.4\n% fake body\n"
+        self.fake_pdf_b64 = base64.b64encode(self.fake_pdf_bytes).decode("utf-8")
+
+        self.text_labels = {"1": _make_label("1", "Important")}
+        self.doc_labels = {"2": _make_label("2", "Contract")}
+
+    def _collect_finalize(self):
+        """Return a side_effect that captures finalize_export's output_bytes."""
+        captured: dict = {}
+
+        def _capture(export_id, filename, output_bytes, corpus_title):
+            # finalize_export does output_bytes.seek(0) before saving; mirror
+            # that so the captured buffer is ready to read in-test.
+            output_bytes.seek(0)
+            captured["bytes"] = output_bytes.getvalue()
+            captured["filename"] = filename
+
+        return captured, _capture
+
+    def test_skips_failed_placeholder_and_packages_successful_docs(self) -> None:
+        """Mixed input: one good doc + one failed placeholder -> only good doc lands in zip."""
+        good_doc = (
+            "good.pdf",
+            self.fake_pdf_b64,
+            _make_doc_export("Good Doc"),
+            self.text_labels,
+            self.doc_labels,
+        )
+        # This is exactly what build_document_export returns on failure.
+        failed_doc = ("", "", None, {}, {})
+
+        captured, capture_fn = self._collect_finalize()
+        with patch(
+            "opencontractserver.tasks.export_tasks.finalize_export",
+            side_effect=capture_fn,
+        ), self.assertLogs(
+            "opencontractserver.tasks.export_tasks", level="WARNING"
+        ) as log_cm:
+            package_annotated_docs(
+                burned_docs=(good_doc, failed_doc),
+                export_id=self.export.id,
+                corpus_pk=self.corpus.id,
+            )
+
+        self.assertTrue(
+            any("Skipping failed burned doc" in line for line in log_cm.output),
+            f"Expected skip-warning log, got: {log_cm.output}",
+        )
+        self.assertIn("bytes", captured, "finalize_export was not called")
+        zf = zipfile.ZipFile(io.BytesIO(captured["bytes"]))
+        names = set(zf.namelist())
+
+        # The failed placeholder's empty-string filename must NOT be in the zip.
+        self.assertNotIn("", names)
+        self.assertIn("good.pdf", names)
+        self.assertIn("data.json", names)
+
+        data = json.loads(zf.read("data.json").decode("utf-8"))
+        self.assertIn("good.pdf", data["annotated_docs"])
+        self.assertNotIn("", data["annotated_docs"])
+        # No None values should leak into annotated_docs.
+        self.assertTrue(
+            all(v is not None for v in data["annotated_docs"].values()),
+            "annotated_docs must not contain None values for failed exports",
+        )
+        # Labels from the successful doc still propagate.
+        self.assertEqual(data["text_labels"], self.text_labels)
+        self.assertEqual(data["doc_labels"], self.doc_labels)
+
+    def test_all_failed_produces_empty_annotated_docs_without_crashing(self) -> None:
+        """Every doc failed -> no crash, empty annotated_docs, no bogus keys."""
+        failed_doc_a = ("", "", None, {}, {})
+        failed_doc_b = ("", "", None, {}, {})
+
+        captured, capture_fn = self._collect_finalize()
+        with patch(
+            "opencontractserver.tasks.export_tasks.finalize_export",
+            side_effect=capture_fn,
+        ):
+            package_annotated_docs(
+                burned_docs=(failed_doc_a, failed_doc_b),
+                export_id=self.export.id,
+                corpus_pk=self.corpus.id,
+            )
+
+        self.assertIn("bytes", captured, "finalize_export was not called")
+        zf = zipfile.ZipFile(io.BytesIO(captured["bytes"]))
+        # Only data.json, no empty-filename entry.
+        self.assertEqual(set(zf.namelist()), {"data.json"})
+
+        data = json.loads(zf.read("data.json").decode("utf-8"))
+        self.assertEqual(data["annotated_docs"], {})
diff --git a/opencontractserver/utils/etl.py b/opencontractserver/utils/etl.py
@@ -202,8 +202,8 @@ def build_document_export(
     analysis_ids: list[int] | None = None,
     annotation_filter_mode: AnnotationFilterMode = AnnotationFilterMode.CORPUS_LABELSET_ONLY,
 ) -> tuple[
-    str | None,
-    str | None,
+    str,
+    str,
     OpenContractDocExport | None,
     dict[str | int, AnnotationLabelPythonType],
     dict[str | int, AnnotationLabelPythonType],
@@ -217,6 +217,12 @@ def build_document_export(
         analysis_ids: Optional list of analysis PKs to include in annotation selection
         annotation_filter_mode: How to filter annotations - "CORPUS_LABELSET_ONLY" (default),
             "CORPUS_LABELSET_PLUS_ANALYSES", or "ANALYSES_ONLY"
+
+    Returns a 5-tuple ``(doc_name, base64_encoded_file, doc_annotation_json,
+    text_labels, doc_labels)``. On failure, ``doc_name`` and
+    ``base64_encoded_file`` are empty strings and ``doc_annotation_json`` is
+    ``None`` — consumers should treat an empty ``doc_name`` OR a ``None``
+    ``doc_annotation_json`` as a signal to skip that document.
     """
 
     logger.info(f"burn_doc_annotations - label_lookups: {label_lookups}")