Skip to content

Commit 3426cff

Browse files
committed
Make Anthropic structured extraction reliable + classify None failures
doc_extract_query_task with anthropic:claude-sonnet-4-6 was failing ~85% of cells on the LegalBench-RAG benchmark with the canonical "extraction returned None" message. The captured Datacell.llm_call_log showed Claude's last assistant message was always text + tool_use parts and never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned None. The same prompts on gpt-4o-mini / gpt-4o produced 0-3% failure rate. Three coordinated changes: 1. Strengthen the structured-extraction system prompt (all three _build_structured_system_prompt overrides in pydantic_ai_agents.py) to explicitly tell the agent it MUST commit to the final structured response by calling the result tool after gathering information. Wording is universal — harmless for OpenAI, necessary for Claude. 2. Pass output_retries=3 to PydanticAIAgent for structured runs (was pydantic-ai's default of 1) so the loop has room to retry the final-result tool call when the model fails to commit on the first pass. Add an _is_anthropic_model() helper and force temperature=0 for structured runs against Anthropic models when the caller did not pin a temperature — Claude is reluctant to commit at non-zero temperatures. 3. Classify why doc_extract_query_task got None back, instead of reporting a single ambiguous error. New _classify_none_result() helper inspects the captured pydantic-ai message log and returns one of agent_committed_none (legitimate), no_final_response (integration failure — the canonical Anthropic mode), tool_loop_no_output (integration failure), or unknown. The Datacell.stacktrace records failure_mode=<classification> plus a human-readable message, so operators can grep failure_mode= to separate "data not present" outcomes from pipeline bugs. Closes #1381
1 parent f40e91f commit 3426cff

4 files changed

Lines changed: 365 additions & 17 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Fixed
1111

12+
- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
13+
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent` line 491, `PydanticAIDocumentAgent` line 1971, `PydanticAICorpusAgent` line 2476) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` (line 1321) now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. New `_is_anthropic_model()` helper detects `anthropic:` prefix or bare `claude` substring; when an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is already reluctant to commit; non-zero temperature pushes it toward more exploratory text).
14+
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs.
15+
- **`opencontractserver/tests/test_data_extract_failure_classification.py`** — New `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, loop-then-commit precedence, mixed text + tool path) and `_is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None).
1216
- **Merged `frontend` Codecov flag drops to ~33% on every commit where Frontend CI's CT job fails** (`frontend/package.json` `test:coverage:ct`): the script chained `playwright test ... && mkdir -p ... && nyc report ...`, so a failing CT run short-circuited before `nyc report` could turn the per-test JSON files in `.nyc_output` into an `lcov.info`. The downstream `Upload CT Coverage to Codecov` step (`if: success() || failure()`) then errored with "No coverage reports found" and `frontend-component` did not upload for that SHA. Codecov's server-side aggregation of the `frontend` flag was left with only `frontend-unit` (~23%) and `frontend-e2e` (~24%), pulling the merged number down to ~33% even though the previous commit was at ~67% — observed on six consecutive main commits 2026-04-26T01:02..02:58Z (`2d7033f8`..`be5bcfc8`) before recovering on `30298391`. Mirrored the existing `test:e2e:coverage` pattern (`; CT_EXIT=$?; nyc report ... || echo "No coverage data to report"; exit $CT_EXIT`) so `nyc report` runs regardless of test outcome and the lcov ships even on red CT runs. `frontend-component` will still report a slightly lower number when tests fail (failed tests register fewer hits), but it will report — keeping the merged `frontend` flag's denominator stable.
1317
- **`User.__init__` shared-state mutation re-introduced by branch merge** (`opencontractserver/users/models.py:172-180` removed): PR #1374 (commit `50ed6740`) deleted the `User.__init__` override that mutated `Field.validators[0]` on every instantiation, but a subsequent merge (`b68c1cb4 → 6d2cddbf`) resurrected the override along with its mypy-narrowing changes. The current main on commit `6d2cddbf` therefore reproduced the original `#1358` bug: `User(...)` rebound `username_field.validators[0]` and clobbered any third-party validator prepended to the list. Removed the `__init__` override entirely; the class-body declaration `validators=[UserUnicodeUsernameValidator()]` on the `username` field (still present from PR #1374) is the canonical and only declaration. Also dropped the now-unused `Field` import. Regression coverage from PR #1374 (`opencontractserver/tests/test_user_username_validator.py`) was already on main and is what surfaced the regression in CI.
1418

opencontractserver/llms/agents/pydantic_ai_agents.py

Lines changed: 75 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,25 @@
103103
# Type variable for structured responses
104104
T = TypeVar("T")
105105

106+
# Number of retries pydantic-ai will attempt when the model fails to produce
107+
# a valid structured response. Bumped above pydantic-ai's default of 1 because
108+
# Anthropic models routinely need an extra round-trip after tool calls before
109+
# they commit to the final result tool. See issue #1381.
110+
STRUCTURED_OUTPUT_RETRIES = 3
111+
112+
113+
def _is_anthropic_model(model_name: Optional[str]) -> bool:
114+
"""Return True if ``model_name`` looks like an Anthropic / Claude model.
115+
116+
Accepts both pydantic-ai-style ``"anthropic:..."`` prefixes and bare model
117+
names containing ``"claude"``. Used to apply Anthropic-specific structured
118+
extraction tweaks (lower temperature, etc.).
119+
"""
120+
if not model_name:
121+
return False
122+
name = model_name.lower()
123+
return name.startswith("anthropic:") or "claude" in name
124+
106125

107126
def _get_function_tools(agent: PydanticAIAgent) -> dict:
108127
"""Return the function-tools dict from a pydantic-ai Agent.
@@ -496,13 +515,24 @@ def _build_structured_system_prompt(
496515
Subclasses may override this to include document or corpus context.
497516
The base implementation intentionally avoids any citation or
498517
conversational guidance to minimize iterations and enforce raw output.
518+
519+
The wording explicitly tells the agent to commit to the final
520+
structured response after gathering information. Some models (notably
521+
Anthropic's Claude family) tend to keep narrating or invoking tools
522+
instead of producing the structured output unless told to stop. See
523+
issue #1381.
499524
"""
500525
return (
501526
"You are in data extraction mode.\n"
502527
"Use available tools to locate the requested information.\n"
528+
"After gathering enough information from the tools, you MUST "
529+
"produce the final structured response by calling the result "
530+
"tool. Do not narrate further; do not keep invoking tools "
531+
"indefinitely.\n"
503532
"Return ONLY the raw value matching the target type. "
504533
"No explanations, no citations, no extra words.\n"
505-
"If the information cannot be found using the tools, return null/None."
534+
"If the information genuinely cannot be found after using the "
535+
"tools, return null/None."
506536
)
507537

508538
async def _chat_raw(
@@ -1311,11 +1341,27 @@ async def _structured_response_raw(
13111341
try:
13121342
# Build model settings with overrides
13131343
model_settings = _prepare_pydantic_ai_model_settings(self.config)
1344+
if model_settings is None:
1345+
model_settings = {}
13141346
if temperature is not None:
13151347
model_settings["temperature"] = temperature
13161348
if max_tokens is not None:
13171349
model_settings["max_tokens"] = max_tokens
13181350

1351+
# Anthropic models tend to keep narrating / calling tools instead
1352+
# of committing to the structured output when given any wiggle
1353+
# room (issue #1381). Force temperature down to 0 unless the
1354+
# caller explicitly asked for something else.
1355+
effective_model = model or self.config.model_name
1356+
if _is_anthropic_model(effective_model) and temperature is None:
1357+
if self.config.temperature is None or self.config.temperature > 0:
1358+
logger.info(
1359+
"Forcing temperature=0 for structured extraction with "
1360+
"Anthropic model %s (issue #1381).",
1361+
effective_model,
1362+
)
1363+
model_settings["temperature"] = 0
1364+
13191365
# Seed tools from the main agent so the structured run has the same capabilities
13201366
seeded_tools_dict = _get_function_tools(self.pydantic_ai_agent)
13211367
seeded_tools = list(seeded_tools_dict.values())
@@ -1346,12 +1392,15 @@ async def _structured_response_raw(
13461392
logger.info(f"Structured system prompt: {structured_system_prompt}")
13471393

13481394
structured_agent = PydanticAIAgent(
1349-
model=model or self.config.model_name,
1395+
model=effective_model,
13501396
instructions=structured_system_prompt,
13511397
output_type=target_type,
13521398
deps_type=PydanticAIDependencies,
13531399
tools=final_tools,
13541400
model_settings=model_settings,
1401+
# Give pydantic-ai room to retry the structured output when
1402+
# the model fails to commit on the first pass (issue #1381).
1403+
output_retries=STRUCTURED_OUTPUT_RETRIES,
13551404
)
13561405

13571406
# Include prior conversation context if available
@@ -1960,11 +2009,18 @@ def _build_structured_system_prompt(
19602009
f"{UNTRUSTED_CONTENT_NOTICE}\n\n"
19612010
f"You are a data extraction specialist for document {fenced_title} (ID: {document_id}).\n\n"
19622011
"EXTRACTION PROTOCOL:\n"
1963-
"1. You have access to tools to analyze this document. Use them to find the requested information.\n"
1964-
"2. Use vector search, summary loaders, and note access as needed to locate data.\n"
1965-
"3. Return ONLY the raw extracted value matching the target type.\n"
1966-
"4. No explanations, no citations, no commentary – just the data.\n\n"
1967-
"If the information cannot be found using the tools, return null/None."
2012+
"1. You have access to tools to analyze this document. "
2013+
"Use them to find the requested information.\n"
2014+
"2. Use vector search, summary loaders, and note access as "
2015+
"needed to locate data.\n"
2016+
"3. After gathering enough information from the tools, you "
2017+
"MUST commit to the final structured response by calling the "
2018+
"result tool. Do not narrate further; do not keep invoking "
2019+
"tools indefinitely.\n"
2020+
"4. Return ONLY the raw extracted value matching the target type.\n"
2021+
"5. No explanations, no citations, no commentary – just the data.\n\n"
2022+
"If the information genuinely cannot be found after using the "
2023+
"tools, return null/None."
19682024
)
19692025

19702026
@classmethod
@@ -2465,11 +2521,18 @@ def _build_structured_system_prompt(
24652521
f"{UNTRUSTED_CONTENT_NOTICE}\n\n"
24662522
f"You are a data extraction specialist for corpus {fenced_title} (ID: {corpus_id}).\n\n"
24672523
"EXTRACTION PROTOCOL:\n"
2468-
"1. You have access to tools to analyze this corpus. Use them to find the requested information.\n"
2469-
"2. Leverage vector search and document coordination tools as needed.\n"
2470-
"3. Return ONLY the raw extracted value matching the target type.\n"
2471-
"4. No explanations, no citations, no commentary – just the data.\n\n"
2472-
"If the information cannot be found using the tools, return null/None."
2524+
"1. You have access to tools to analyze this corpus. "
2525+
"Use them to find the requested information.\n"
2526+
"2. Leverage vector search and document coordination tools "
2527+
"as needed.\n"
2528+
"3. After gathering enough information from the tools, you "
2529+
"MUST commit to the final structured response by calling the "
2530+
"result tool. Do not narrate further; do not keep invoking "
2531+
"tools indefinitely.\n"
2532+
"4. Return ONLY the raw extracted value matching the target type.\n"
2533+
"5. No explanations, no citations, no commentary – just the data.\n\n"
2534+
"If the information genuinely cannot be found after using the "
2535+
"tools, return null/None."
24732536
)
24742537

24752538
@classmethod

opencontractserver/tasks/data_extract_tasks.py

Lines changed: 124 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
import json
33
import logging
44
import os
5+
from collections import Counter
6+
from typing import Any, Optional
57

68
from asgiref.sync import sync_to_async
79

@@ -16,6 +18,115 @@
1618
logger = logging.getLogger(__name__)
1719

1820

21+
# ---------------------------------------------------------------------------
22+
# Failure-mode classification for structured-extraction None results
23+
# ---------------------------------------------------------------------------
24+
# When pydantic-ai returns ``None`` from the structured extraction agent, we
25+
# previously reported a single error message that conflated three very
26+
# different outcomes:
27+
#
28+
# 1. ``agent_committed_none`` — the agent searched, decided the data was
29+
# absent, and explicitly returned None.
30+
# Legitimate signal: the document doesn't
31+
# contain the requested information.
32+
# 2. ``no_final_response`` — the agent never produced a final structured
33+
# response. The pydantic-ai loop exhausted
34+
# without the model calling the result tool.
35+
# Common with Anthropic models (issue #1381).
36+
# This is an integration failure, not a
37+
# statement about the document.
38+
# 3. ``tool_loop_no_output`` — the agent issued the same tool call multiple
39+
# times without ever synthesising a final
40+
# answer. Also an integration failure.
41+
#
42+
# Operators want to grep ``failure_mode=`` to separate legitimate "data not
43+
# present" outcomes from pipeline bugs.
44+
45+
NONE_RESULT_AGENT_COMMITTED = "agent_committed_none"
46+
NONE_RESULT_NO_FINAL = "no_final_response"
47+
NONE_RESULT_TOOL_LOOP = "tool_loop_no_output"
48+
NONE_RESULT_UNKNOWN = "unknown"
49+
50+
# Threshold for declaring a tool loop. If any single tool name + arguments
51+
# combination appears at least this many times in the captured message log
52+
# without a final structured response, classify as ``tool_loop_no_output``.
53+
_TOOL_LOOP_THRESHOLD = 3
54+
55+
56+
def _classify_none_result(messages: Optional[list[Any]]) -> str:
57+
"""Classify *why* a structured extraction returned ``None``.
58+
59+
Examines the captured pydantic-ai message history and returns one of the
60+
``NONE_RESULT_*`` constants. Designed to be defensive: any unexpected
61+
shape falls back to :data:`NONE_RESULT_UNKNOWN` rather than raising.
62+
"""
63+
if not messages:
64+
return NONE_RESULT_UNKNOWN
65+
66+
try:
67+
from pydantic_ai.messages import ModelResponse, ToolCallPart
68+
except ImportError: # pragma: no cover - pydantic_ai is a hard dep
69+
return NONE_RESULT_UNKNOWN
70+
71+
# Scan the log for ``ModelResponse`` parts. A "final structured response"
72+
# is a ToolCallPart whose tool_name starts with ``final_result`` —
73+
# pydantic-ai routes structured outputs through this synthetic tool.
74+
saw_final_result = False
75+
tool_call_signatures: list[tuple[str, str]] = []
76+
77+
for msg in messages:
78+
if not isinstance(msg, ModelResponse):
79+
continue
80+
for part in getattr(msg, "parts", []) or []:
81+
if isinstance(part, ToolCallPart):
82+
tool_name = getattr(part, "tool_name", "") or ""
83+
if tool_name.startswith("final_result"):
84+
saw_final_result = True
85+
else:
86+
args_repr = repr(getattr(part, "args", None))
87+
tool_call_signatures.append((tool_name, args_repr))
88+
89+
if saw_final_result:
90+
# Pydantic-ai received a final_result call but the structured output
91+
# was None. That means the agent explicitly committed to the absence
92+
# of data — legitimate.
93+
return NONE_RESULT_AGENT_COMMITTED
94+
95+
# No final_result anywhere. Look for tool-call repetition.
96+
if tool_call_signatures:
97+
most_common = Counter(tool_call_signatures).most_common(1)
98+
if most_common and most_common[0][1] >= _TOOL_LOOP_THRESHOLD:
99+
return NONE_RESULT_TOOL_LOOP
100+
101+
return NONE_RESULT_NO_FINAL
102+
103+
104+
def _failure_message_for_classification(classification: str) -> str:
105+
"""Human-readable failure message for a ``NONE_RESULT_*`` classification."""
106+
if classification == NONE_RESULT_AGENT_COMMITTED:
107+
return (
108+
"The extraction agent committed to a None result — the requested "
109+
"information was not found in the document."
110+
)
111+
if classification == NONE_RESULT_NO_FINAL:
112+
return (
113+
"The extraction agent never produced a final structured response. "
114+
"This is an integration failure (the model exhausted its tool-use "
115+
"budget without committing to the result tool), not a statement "
116+
"about the document. See issue #1381."
117+
)
118+
if classification == NONE_RESULT_TOOL_LOOP:
119+
return (
120+
"The extraction agent looped on the same tool call without "
121+
"producing a final structured response. This is an integration "
122+
"failure, not a statement about the document. See issue #1381."
123+
)
124+
return (
125+
"The extraction returned None and the cause could not be classified. "
126+
"See ``llm_call_log`` for the full message history."
127+
)
128+
129+
19130
@sync_to_async
20131
def get_annotation_label_text(annotation):
21132
"""
@@ -340,15 +451,23 @@ def sync_add_sources(datacell, sources):
340451
)
341452

342453
else:
343-
# Extraction returned None
344-
logger.warning(f"✗ Extraction returned None for cell {cell_id}")
454+
# Extraction returned None — classify *why* so operators can
455+
# distinguish legitimate "data not present" outcomes from
456+
# pipeline bugs (issue #1381).
457+
classification = _classify_none_result(
458+
messages if "messages" in locals() else None
459+
)
460+
failure_message = _failure_message_for_classification(classification)
345461
logger.warning(
346-
" This likely means the requested information is not present in the document"
462+
f"✗ Extraction returned None for cell {cell_id} "
463+
f"(failure_mode={classification})"
347464
)
465+
logger.warning(f" {failure_message}")
348466
await sync_mark_failed(
349467
datacell,
350-
"Failed to extract requested data from document",
351-
"The extraction returned None - the requested information may not be present in the document.",
468+
f"Failed to extract requested data from document "
469+
f"(failure_mode={classification})",
470+
f"failure_mode={classification}\n\n{failure_message}",
352471
llm_log,
353472
)
354473

0 commit comments

Comments
 (0)