Skip to content

Commit b9704b7

Browse files
committed
Close Anthropic temperature-coupling footgun + lift patch coverage
Hoist is_anthropic_model into constants/llm.py so doc_extract_query_task can decide whether to pass temperature=None and let the agent-layer guard apply temperature=0 automatically. New _resolve_extract_temperature helper makes the model-family -> temperature coupling unit-testable and prevents a silent regression of #1381 if EXTRACT_DEFAULT_MODEL is ever flipped to a Claude model. Adds Anthropic temperature-override tests on _structured_response_raw, smoke tests for the strengthened _build_structured_system_prompt overrides, and defensive-path tests for _classify_none_result.
1 parent cf43bbd commit b9704b7

6 files changed

Lines changed: 509 additions & 45 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3939
### Fixed
4040

4141
- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
42-
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent` line 491, `PydanticAIDocumentAgent` line 1971, `PydanticAICorpusAgent` line 2476) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` (line 1321) now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. New `_is_anthropic_model()` helper detects `anthropic:` prefix or bare `claude` substring; when an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is already reluctant to commit; non-zero temperature pushes it toward more exploratory text).
43-
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs.
44-
- **`opencontractserver/tests/test_data_extract_failure_classification.py`** — New `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, loop-then-commit precedence, mixed text + tool path) and `_is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None).
42+
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text).
43+
- **`opencontractserver/constants/llm.py`** — Hosts `is_anthropic_model()` (`anthropic:` prefix or bare `claude` substring detector). Lives next to `EXTRACT_DEFAULT_MODEL` / `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
44+
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `EXTRACT_DEFAULT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
45+
- **`opencontractserver/tests/test_data_extract_failure_classification.py`**`SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
46+
- **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
4547
- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
4648
- **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
4749
- **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.

opencontractserver/constants/llm.py

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
instead of chasing literals across modules.
99
"""
1010

11+
from typing import Optional
12+
1113
# Retry budget passed to ``PydanticAIAgent`` for structured extraction.
1214
# pydantic-ai's default is 1; Claude/Anthropic models routinely fail to
1315
# call ``final_result`` on the first turn for sparse documents and we
@@ -36,11 +38,28 @@
3638
NONE_RESULT_UNKNOWN = "unknown"
3739

3840
# Default model for ``doc_extract_query_task``. Co-located with
39-
# ``EXTRACT_DEFAULT_TEMPERATURE`` so the model/temperature relationship is
40-
# visible in one place: ``EXTRACT_DEFAULT_TEMPERATURE`` is safe ONLY
41-
# because this model is OpenAI. Switching to a Claude model here without
42-
# also dropping the temperature override silently regresses the issue
43-
# #1381 reliability fix — see ``_is_anthropic_model`` and
44-
# ``_structured_response_raw`` in ``pydantic_ai_agents.py``.
41+
# ``EXTRACT_DEFAULT_TEMPERATURE`` and the ``is_anthropic_model`` helper
42+
# below so the model/family/temperature relationship lives in one place.
43+
# Call sites must pass ``temperature=None`` when the model family is
44+
# Anthropic so the structured-extraction guard in
45+
# ``_structured_response_raw`` can apply ``temperature=0`` automatically
46+
# (issue #1381).
4547
EXTRACT_DEFAULT_MODEL = "openai:gpt-4o-mini"
4648
EXTRACT_DEFAULT_TEMPERATURE = 0.3
49+
50+
51+
def is_anthropic_model(model_name: Optional[str]) -> bool:
52+
"""Return True if ``model_name`` looks like an Anthropic / Claude model.
53+
54+
Accepts both pydantic-ai-style ``"anthropic:..."`` prefixes and bare
55+
model names containing ``"claude"``. Lives in ``constants/llm.py``
56+
rather than in an agent module because call sites outside the agents
57+
layer (notably ``data_extract_tasks.doc_extract_query_task``) need to
58+
decide whether to pass ``temperature=None`` so the Anthropic guard in
59+
``_structured_response_raw`` activates. Pure stateless string check —
60+
no imports beyond ``typing``.
61+
"""
62+
if not model_name:
63+
return False
64+
name = model_name.lower()
65+
return name.startswith("anthropic:") or "claude" in name

opencontractserver/llms/agents/pydantic_ai_agents.py

Lines changed: 5 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,10 @@
3131
from pydantic_graph import End
3232

3333
from opencontractserver.constants.context_guardrails import COMPACTION_SUMMARY_PREFIX
34-
from opencontractserver.constants.llm import STRUCTURED_OUTPUT_RETRIES
34+
from opencontractserver.constants.llm import (
35+
STRUCTURED_OUTPUT_RETRIES,
36+
is_anthropic_model,
37+
)
3538
from opencontractserver.conversations.models import Conversation
3639
from opencontractserver.corpuses.models import Corpus
3740
from opencontractserver.documents.models import Document
@@ -105,19 +108,6 @@
105108
T = TypeVar("T")
106109

107110

108-
def _is_anthropic_model(model_name: Optional[str]) -> bool:
109-
"""Return True if ``model_name`` looks like an Anthropic / Claude model.
110-
111-
Accepts both pydantic-ai-style ``"anthropic:..."`` prefixes and bare model
112-
names containing ``"claude"``. Used to apply Anthropic-specific structured
113-
extraction tweaks (lower temperature, etc.).
114-
"""
115-
if not model_name:
116-
return False
117-
name = model_name.lower()
118-
return name.startswith("anthropic:") or "claude" in name
119-
120-
121111
def _get_function_tools(agent: PydanticAIAgent) -> dict:
122112
"""Return the function-tools dict from a pydantic-ai Agent.
123113
@@ -1357,7 +1347,7 @@ async def _structured_response_raw(
13571347
# temperature pin OR an explicit config.temperature).
13581348
effective_model = model or self.config.model_name
13591349
if (
1360-
_is_anthropic_model(effective_model)
1350+
is_anthropic_model(effective_model)
13611351
and temperature is None
13621352
and self.config.temperature is None
13631353
):

opencontractserver/tasks/data_extract_tasks.py

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
NONE_RESULT_TOOL_LOOP,
1818
NONE_RESULT_UNKNOWN,
1919
TOOL_LOOP_THRESHOLD,
20+
is_anthropic_model,
2021
)
2122
from opencontractserver.extracts.models import Datacell
2223
from opencontractserver.shared.decorators import celery_task_with_async_to_sync
@@ -114,6 +115,20 @@ def _classify_none_result(messages: Optional[list[Any]]) -> str:
114115
return NONE_RESULT_NO_FINAL
115116

116117

118+
def _resolve_extract_temperature(model_name: Optional[str]) -> Optional[float]:
119+
"""Pick the temperature to pass to ``get_structured_response_from_document``.
120+
121+
Returns ``None`` when ``model_name`` is an Anthropic / Claude model so the
122+
Anthropic guard in ``_structured_response_raw`` can apply ``temperature=0``
123+
automatically (issue #1381). Returns :data:`EXTRACT_DEFAULT_TEMPERATURE`
124+
otherwise. Pulled out as a helper so the model-family→temperature
125+
coupling is unit-testable without standing up the full extract task.
126+
"""
127+
if is_anthropic_model(model_name):
128+
return None
129+
return EXTRACT_DEFAULT_TEMPERATURE
130+
131+
117132
def _failure_message_for_classification(classification: str) -> str:
118133
"""Human-readable failure message for a ``NONE_RESULT_*`` classification."""
119134
if classification == NONE_RESULT_AGENT_COMMITTED:
@@ -349,26 +364,22 @@ def sync_add_sources(datacell, sources):
349364
# Capture LLM messages for debugging
350365
messages: Optional[list[Any]] = None
351366

367+
# Gate the explicit temperature pin on the model family so the
368+
# Anthropic ``temperature=0`` override in
369+
# ``_structured_response_raw`` activates automatically when
370+
# ``EXTRACT_DEFAULT_MODEL`` is a Claude model (issue #1381).
371+
extract_temperature = _resolve_extract_temperature(EXTRACT_DEFAULT_MODEL)
372+
352373
try:
353374
# Wrap the agent call in the context manager to capture messages
354375
with capture_run_messages() as messages:
355-
# Create a temporary agent and extract
356-
# ``EXTRACT_DEFAULT_TEMPERATURE`` (=0.3) is safe ONLY while
357-
# ``EXTRACT_DEFAULT_MODEL`` is OpenAI; passing it to a
358-
# Claude model would silently regress the issue #1381
359-
# reliability fix (Anthropic structured runs need T=0).
360-
# Both constants live next to each other in
361-
# ``constants/llm.py`` so this coupling stays visible.
362-
# TODO(#1381 follow-up): when the model becomes
363-
# column-configurable, gate the temperature on the model
364-
# family in the call site rather than the constants file.
365376
result = await agents.get_structured_response_from_document(
366377
document=document.id,
367378
corpus=corpus_id,
368379
prompt=prompt,
369380
target_type=output_type,
370381
framework=AgentFramework.PYDANTIC_AI,
371-
temperature=EXTRACT_DEFAULT_TEMPERATURE,
382+
temperature=extract_temperature,
372383
similarity_top_k=similarity_top_k,
373384
model=EXTRACT_DEFAULT_MODEL,
374385
user_id=datacell.creator.id,

0 commit comments

Comments
 (0)