Make Anthropic structured extraction reliable + classify None failures

claude · claude · commit 3426cff157b7 · 2026-04-28T02:37:41.000Z
doc_extract_query_task with anthropic:claude-sonnet-4-6 was failing ~85% of cells on the LegalBench-RAG benchmark with the canonical "extraction returned None" message. The captured Datacell.llm_call_log showed Claude's last assistant message was always text + tool_use parts and never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned None. The same prompts on gpt-4o-mini / gpt-4o produced 0-3% failure rate. Three coordinated changes: 1. Strengthen the structured-extraction system prompt (all three _build_structured_system_prompt overrides in pydantic_ai_agents.py) to explicitly tell the agent it MUST commit to the final structured response by calling the result tool after gathering information. Wording is universal — harmless for OpenAI, necessary for Claude. 2. Pass output_retries=3 to PydanticAIAgent for structured runs (was pydantic-ai's default of 1) so the loop has room to retry the final-result tool call when the model fails to commit on the first pass. Add an _is_anthropic_model() helper and force temperature=0 for structured runs against Anthropic models when the caller did not pin a temperature — Claude is reluctant to commit at non-zero temperatures. 3. Classify why doc_extract_query_task got None back, instead of reporting a single ambiguous error. New _classify_none_result() helper inspects the captured pydantic-ai message log and returns one of agent_committed_none (legitimate), no_final_response (integration failure — the canonical Anthropic mode), tool_loop_no_output (integration failure), or unknown. The Datacell.stacktrace records failure_mode=<classification> plus a human-readable message, so operators can grep failure_mode= to separate "data not present" outcomes from pipeline bugs. Closes #1381
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
+  - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent` line 491, `PydanticAIDocumentAgent` line 1971, `PydanticAICorpusAgent` line 2476) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` (line 1321) now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. New `_is_anthropic_model()` helper detects `anthropic:` prefix or bare `claude` substring; when an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is already reluctant to commit; non-zero temperature pushes it toward more exploratory text).
+  - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs.
+  - **`opencontractserver/tests/test_data_extract_failure_classification.py`** — New `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, loop-then-commit precedence, mixed text + tool path) and `_is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None).
 - **Merged `frontend` Codecov flag drops to ~33% on every commit where Frontend CI's CT job fails** (`frontend/package.json` `test:coverage:ct`): the script chained `playwright test ... && mkdir -p ... && nyc report ...`, so a failing CT run short-circuited before `nyc report` could turn the per-test JSON files in `.nyc_output` into an `lcov.info`. The downstream `Upload CT Coverage to Codecov` step (`if: success() || failure()`) then errored with "No coverage reports found" and `frontend-component` did not upload for that SHA. Codecov's server-side aggregation of the `frontend` flag was left with only `frontend-unit` (~23%) and `frontend-e2e` (~24%), pulling the merged number down to ~33% even though the previous commit was at ~67% — observed on six consecutive main commits 2026-04-26T01:02..02:58Z (`2d7033f8`..`be5bcfc8`) before recovering on `30298391`. Mirrored the existing `test:e2e:coverage` pattern (`; CT_EXIT=$?; nyc report ... || echo "No coverage data to report"; exit $CT_EXIT`) so `nyc report` runs regardless of test outcome and the lcov ships even on red CT runs. `frontend-component` will still report a slightly lower number when tests fail (failed tests register fewer hits), but it will report — keeping the merged `frontend` flag's denominator stable.
 - **`User.__init__` shared-state mutation re-introduced by branch merge** (`opencontractserver/users/models.py:172-180` removed): PR #1374 (commit `50ed6740`) deleted the `User.__init__` override that mutated `Field.validators[0]` on every instantiation, but a subsequent merge (`b68c1cb4 → 6d2cddbf`) resurrected the override along with its mypy-narrowing changes. The current main on commit `6d2cddbf` therefore reproduced the original `#1358` bug: `User(...)` rebound `username_field.validators[0]` and clobbered any third-party validator prepended to the list. Removed the `__init__` override entirely; the class-body declaration `validators=[UserUnicodeUsernameValidator()]` on the `username` field (still present from PR #1374) is the canonical and only declaration. Also dropped the now-unused `Field` import. Regression coverage from PR #1374 (`opencontractserver/tests/test_user_username_validator.py`) was already on main and is what surfaced the regression in CI.
 
diff --git a/opencontractserver/llms/agents/pydantic_ai_agents.py b/opencontractserver/llms/agents/pydantic_ai_agents.py
@@ -103,6 +103,25 @@
 # Type variable for structured responses
 T = TypeVar("T")
 
+# Number of retries pydantic-ai will attempt when the model fails to produce
+# a valid structured response. Bumped above pydantic-ai's default of 1 because
+# Anthropic models routinely need an extra round-trip after tool calls before
+# they commit to the final result tool. See issue #1381.
+STRUCTURED_OUTPUT_RETRIES = 3
+
+
+def _is_anthropic_model(model_name: Optional[str]) -> bool:
+    """Return True if ``model_name`` looks like an Anthropic / Claude model.
+
+    Accepts both pydantic-ai-style ``"anthropic:..."`` prefixes and bare model
+    names containing ``"claude"``. Used to apply Anthropic-specific structured
+    extraction tweaks (lower temperature, etc.).
+    """
+    if not model_name:
+        return False
+    name = model_name.lower()
+    return name.startswith("anthropic:") or "claude" in name
+
 
 def _get_function_tools(agent: PydanticAIAgent) -> dict:
     """Return the function-tools dict from a pydantic-ai Agent.
@@ -496,13 +515,24 @@ def _build_structured_system_prompt(
         Subclasses may override this to include document or corpus context.
         The base implementation intentionally avoids any citation or
         conversational guidance to minimize iterations and enforce raw output.
+
+        The wording explicitly tells the agent to commit to the final
+        structured response after gathering information. Some models (notably
+        Anthropic's Claude family) tend to keep narrating or invoking tools
+        instead of producing the structured output unless told to stop. See
+        issue #1381.
         """
         return (
             "You are in data extraction mode.\n"
             "Use available tools to locate the requested information.\n"
+            "After gathering enough information from the tools, you MUST "
+            "produce the final structured response by calling the result "
+            "tool. Do not narrate further; do not keep invoking tools "
+            "indefinitely.\n"
             "Return ONLY the raw value matching the target type. "
             "No explanations, no citations, no extra words.\n"
-            "If the information cannot be found using the tools, return null/None."
+            "If the information genuinely cannot be found after using the "
+            "tools, return null/None."
         )
 
     async def _chat_raw(
@@ -1311,11 +1341,27 @@ async def _structured_response_raw(
         try:
             # Build model settings with overrides
             model_settings = _prepare_pydantic_ai_model_settings(self.config)
+            if model_settings is None:
+                model_settings = {}
             if temperature is not None:
                 model_settings["temperature"] = temperature
             if max_tokens is not None:
                 model_settings["max_tokens"] = max_tokens
 
+            # Anthropic models tend to keep narrating / calling tools instead
+            # of committing to the structured output when given any wiggle
+            # room (issue #1381). Force temperature down to 0 unless the
+            # caller explicitly asked for something else.
+            effective_model = model or self.config.model_name
+            if _is_anthropic_model(effective_model) and temperature is None:
+                if self.config.temperature is None or self.config.temperature > 0:
+                    logger.info(
+                        "Forcing temperature=0 for structured extraction with "
+                        "Anthropic model %s (issue #1381).",
+                        effective_model,
+                    )
+                model_settings["temperature"] = 0
+
             # Seed tools from the main agent so the structured run has the same capabilities
             seeded_tools_dict = _get_function_tools(self.pydantic_ai_agent)
             seeded_tools = list(seeded_tools_dict.values())
@@ -1346,12 +1392,15 @@ async def _structured_response_raw(
             logger.info(f"Structured system prompt: {structured_system_prompt}")
 
             structured_agent = PydanticAIAgent(
-                model=model or self.config.model_name,
+                model=effective_model,
                 instructions=structured_system_prompt,
                 output_type=target_type,
                 deps_type=PydanticAIDependencies,
                 tools=final_tools,
                 model_settings=model_settings,
+                # Give pydantic-ai room to retry the structured output when
+                # the model fails to commit on the first pass (issue #1381).
+                output_retries=STRUCTURED_OUTPUT_RETRIES,
             )
 
             # Include prior conversation context if available
@@ -1960,11 +2009,18 @@ def _build_structured_system_prompt(
             f"{UNTRUSTED_CONTENT_NOTICE}\n\n"
             f"You are a data extraction specialist for document {fenced_title} (ID: {document_id}).\n\n"
             "EXTRACTION PROTOCOL:\n"
-            "1. You have access to tools to analyze this document. Use them to find the requested information.\n"
-            "2. Use vector search, summary loaders, and note access as needed to locate data.\n"
-            "3. Return ONLY the raw extracted value matching the target type.\n"
-            "4. No explanations, no citations, no commentary – just the data.\n\n"
-            "If the information cannot be found using the tools, return null/None."
+            "1. You have access to tools to analyze this document. "
+            "Use them to find the requested information.\n"
+            "2. Use vector search, summary loaders, and note access as "
+            "needed to locate data.\n"
+            "3. After gathering enough information from the tools, you "
+            "MUST commit to the final structured response by calling the "
+            "result tool. Do not narrate further; do not keep invoking "
+            "tools indefinitely.\n"
+            "4. Return ONLY the raw extracted value matching the target type.\n"
+            "5. No explanations, no citations, no commentary – just the data.\n\n"
+            "If the information genuinely cannot be found after using the "
+            "tools, return null/None."
         )
 
     @classmethod
@@ -2465,11 +2521,18 @@ def _build_structured_system_prompt(
             f"{UNTRUSTED_CONTENT_NOTICE}\n\n"
             f"You are a data extraction specialist for corpus {fenced_title} (ID: {corpus_id}).\n\n"
             "EXTRACTION PROTOCOL:\n"
-            "1. You have access to tools to analyze this corpus. Use them to find the requested information.\n"
-            "2. Leverage vector search and document coordination tools as needed.\n"
-            "3. Return ONLY the raw extracted value matching the target type.\n"
-            "4. No explanations, no citations, no commentary – just the data.\n\n"
-            "If the information cannot be found using the tools, return null/None."
+            "1. You have access to tools to analyze this corpus. "
+            "Use them to find the requested information.\n"
+            "2. Leverage vector search and document coordination tools "
+            "as needed.\n"
+            "3. After gathering enough information from the tools, you "
+            "MUST commit to the final structured response by calling the "
+            "result tool. Do not narrate further; do not keep invoking "
+            "tools indefinitely.\n"
+            "4. Return ONLY the raw extracted value matching the target type.\n"
+            "5. No explanations, no citations, no commentary – just the data.\n\n"
+            "If the information genuinely cannot be found after using the "
+            "tools, return null/None."
         )
 
     @classmethod
diff --git a/opencontractserver/tasks/data_extract_tasks.py b/opencontractserver/tasks/data_extract_tasks.py
@@ -2,6 +2,8 @@
 import json
 import logging
 import os
+from collections import Counter
+from typing import Any, Optional
 
 from asgiref.sync import sync_to_async
 
@@ -16,6 +18,115 @@
 logger = logging.getLogger(__name__)
 
 
+# ---------------------------------------------------------------------------
+# Failure-mode classification for structured-extraction None results
+# ---------------------------------------------------------------------------
+# When pydantic-ai returns ``None`` from the structured extraction agent, we
+# previously reported a single error message that conflated three very
+# different outcomes:
+#
+# 1. ``agent_committed_none``   — the agent searched, decided the data was
+#                                 absent, and explicitly returned None.
+#                                 Legitimate signal: the document doesn't
+#                                 contain the requested information.
+# 2. ``no_final_response``      — the agent never produced a final structured
+#                                 response. The pydantic-ai loop exhausted
+#                                 without the model calling the result tool.
+#                                 Common with Anthropic models (issue #1381).
+#                                 This is an integration failure, not a
+#                                 statement about the document.
+# 3. ``tool_loop_no_output``    — the agent issued the same tool call multiple
+#                                 times without ever synthesising a final
+#                                 answer. Also an integration failure.
+#
+# Operators want to grep ``failure_mode=`` to separate legitimate "data not
+# present" outcomes from pipeline bugs.
+
+NONE_RESULT_AGENT_COMMITTED = "agent_committed_none"
+NONE_RESULT_NO_FINAL = "no_final_response"
+NONE_RESULT_TOOL_LOOP = "tool_loop_no_output"
+NONE_RESULT_UNKNOWN = "unknown"
+
+# Threshold for declaring a tool loop. If any single tool name + arguments
+# combination appears at least this many times in the captured message log
+# without a final structured response, classify as ``tool_loop_no_output``.
+_TOOL_LOOP_THRESHOLD = 3
+
+
+def _classify_none_result(messages: Optional[list[Any]]) -> str:
+    """Classify *why* a structured extraction returned ``None``.
+
+    Examines the captured pydantic-ai message history and returns one of the
+    ``NONE_RESULT_*`` constants. Designed to be defensive: any unexpected
+    shape falls back to :data:`NONE_RESULT_UNKNOWN` rather than raising.
+    """
+    if not messages:
+        return NONE_RESULT_UNKNOWN
+
+    try:
+        from pydantic_ai.messages import ModelResponse, ToolCallPart
+    except ImportError:  # pragma: no cover - pydantic_ai is a hard dep
+        return NONE_RESULT_UNKNOWN
+
+    # Scan the log for ``ModelResponse`` parts. A "final structured response"
+    # is a ToolCallPart whose tool_name starts with ``final_result`` —
+    # pydantic-ai routes structured outputs through this synthetic tool.
+    saw_final_result = False
+    tool_call_signatures: list[tuple[str, str]] = []
+
+    for msg in messages:
+        if not isinstance(msg, ModelResponse):
+            continue
+        for part in getattr(msg, "parts", []) or []:
+            if isinstance(part, ToolCallPart):
+                tool_name = getattr(part, "tool_name", "") or ""
+                if tool_name.startswith("final_result"):
+                    saw_final_result = True
+                else:
+                    args_repr = repr(getattr(part, "args", None))
+                    tool_call_signatures.append((tool_name, args_repr))
+
+    if saw_final_result:
+        # Pydantic-ai received a final_result call but the structured output
+        # was None. That means the agent explicitly committed to the absence
+        # of data — legitimate.
+        return NONE_RESULT_AGENT_COMMITTED
+
+    # No final_result anywhere. Look for tool-call repetition.
+    if tool_call_signatures:
+        most_common = Counter(tool_call_signatures).most_common(1)
+        if most_common and most_common[0][1] >= _TOOL_LOOP_THRESHOLD:
+            return NONE_RESULT_TOOL_LOOP
+
+    return NONE_RESULT_NO_FINAL
+
+
+def _failure_message_for_classification(classification: str) -> str:
+    """Human-readable failure message for a ``NONE_RESULT_*`` classification."""
+    if classification == NONE_RESULT_AGENT_COMMITTED:
+        return (
+            "The extraction agent committed to a None result — the requested "
+            "information was not found in the document."
+        )
+    if classification == NONE_RESULT_NO_FINAL:
+        return (
+            "The extraction agent never produced a final structured response. "
+            "This is an integration failure (the model exhausted its tool-use "
+            "budget without committing to the result tool), not a statement "
+            "about the document. See issue #1381."
+        )
+    if classification == NONE_RESULT_TOOL_LOOP:
+        return (
+            "The extraction agent looped on the same tool call without "
+            "producing a final structured response. This is an integration "
+            "failure, not a statement about the document. See issue #1381."
+        )
+    return (
+        "The extraction returned None and the cause could not be classified. "
+        "See ``llm_call_log`` for the full message history."
+    )
+
+
 @sync_to_async
 def get_annotation_label_text(annotation):
     """
@@ -340,15 +451,23 @@ def sync_add_sources(datacell, sources):
                 )
 
         else:
-            # Extraction returned None
-            logger.warning(f"✗ Extraction returned None for cell {cell_id}")
+            # Extraction returned None — classify *why* so operators can
+            # distinguish legitimate "data not present" outcomes from
+            # pipeline bugs (issue #1381).
+            classification = _classify_none_result(
+                messages if "messages" in locals() else None
+            )
+            failure_message = _failure_message_for_classification(classification)
             logger.warning(
-                "  This likely means the requested information is not present in the document"
+                f"✗ Extraction returned None for cell {cell_id} "
+                f"(failure_mode={classification})"
             )
+            logger.warning(f"  {failure_message}")
             await sync_mark_failed(
                 datacell,
-                "Failed to extract requested data from document",
-                "The extraction returned None - the requested information may not be present in the document.",
+                f"Failed to extract requested data from document "
+                f"(failure_mode={classification})",
+                f"failure_mode={classification}\n\n{failure_message}",
                 llm_log,
             )
 
diff --git a/opencontractserver/tests/test_data_extract_failure_classification.py b/opencontractserver/tests/test_data_extract_failure_classification.py