Close Anthropic temperature-coupling footgun + lift patch coverage

JSv4 · JSv4 · commit b9704b7f5e89 · 2026-04-28T23:07:47.000-05:00
Hoist is_anthropic_model into constants/llm.py so doc_extract_query_task can decide whether to pass temperature=None and let the agent-layer guard apply temperature=0 automatically. New _resolve_extract_temperature helper makes the model-family -> temperature coupling unit-testable and prevents a silent regression of #1381 if EXTRACT_DEFAULT_MODEL is ever flipped to a Claude model. Adds Anthropic temperature-override tests on _structured_response_raw, smoke tests for the strengthened _build_structured_system_prompt overrides, and defensive-path tests for _classify_none_result.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -39,9 +39,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 
 - **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
-  - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent` line 491, `PydanticAIDocumentAgent` line 1971, `PydanticAICorpusAgent` line 2476) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` (line 1321) now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. New `_is_anthropic_model()` helper detects `anthropic:` prefix or bare `claude` substring; when an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is already reluctant to commit; non-zero temperature pushes it toward more exploratory text).
-  - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs.
-  - **`opencontractserver/tests/test_data_extract_failure_classification.py`** — New `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, loop-then-commit precedence, mixed text + tool path) and `_is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None).
+  - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text).
+  - **`opencontractserver/constants/llm.py`** — Hosts `is_anthropic_model()` (`anthropic:` prefix or bare `claude` substring detector). Lives next to `EXTRACT_DEFAULT_MODEL` / `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
+  - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `EXTRACT_DEFAULT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
+  - **`opencontractserver/tests/test_data_extract_failure_classification.py`** — `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
+  - **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
 - **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
   - **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
   - **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
diff --git a/opencontractserver/constants/llm.py b/opencontractserver/constants/llm.py
@@ -8,6 +8,8 @@
 instead of chasing literals across modules.
 """
 
+from typing import Optional
+
 # Retry budget passed to ``PydanticAIAgent`` for structured extraction.
 # pydantic-ai's default is 1; Claude/Anthropic models routinely fail to
 # call ``final_result`` on the first turn for sparse documents and we
@@ -36,11 +38,28 @@
 NONE_RESULT_UNKNOWN = "unknown"
 
 # Default model for ``doc_extract_query_task``.  Co-located with
-# ``EXTRACT_DEFAULT_TEMPERATURE`` so the model/temperature relationship is
-# visible in one place: ``EXTRACT_DEFAULT_TEMPERATURE`` is safe ONLY
-# because this model is OpenAI.  Switching to a Claude model here without
-# also dropping the temperature override silently regresses the issue
-# #1381 reliability fix — see ``_is_anthropic_model`` and
-# ``_structured_response_raw`` in ``pydantic_ai_agents.py``.
+# ``EXTRACT_DEFAULT_TEMPERATURE`` and the ``is_anthropic_model`` helper
+# below so the model/family/temperature relationship lives in one place.
+# Call sites must pass ``temperature=None`` when the model family is
+# Anthropic so the structured-extraction guard in
+# ``_structured_response_raw`` can apply ``temperature=0`` automatically
+# (issue #1381).
 EXTRACT_DEFAULT_MODEL = "openai:gpt-4o-mini"
 EXTRACT_DEFAULT_TEMPERATURE = 0.3
+
+
+def is_anthropic_model(model_name: Optional[str]) -> bool:
+    """Return True if ``model_name`` looks like an Anthropic / Claude model.
+
+    Accepts both pydantic-ai-style ``"anthropic:..."`` prefixes and bare
+    model names containing ``"claude"``.  Lives in ``constants/llm.py``
+    rather than in an agent module because call sites outside the agents
+    layer (notably ``data_extract_tasks.doc_extract_query_task``) need to
+    decide whether to pass ``temperature=None`` so the Anthropic guard in
+    ``_structured_response_raw`` activates.  Pure stateless string check —
+    no imports beyond ``typing``.
+    """
+    if not model_name:
+        return False
+    name = model_name.lower()
+    return name.startswith("anthropic:") or "claude" in name
diff --git a/opencontractserver/llms/agents/pydantic_ai_agents.py b/opencontractserver/llms/agents/pydantic_ai_agents.py
@@ -31,7 +31,10 @@
 from pydantic_graph import End
 
 from opencontractserver.constants.context_guardrails import COMPACTION_SUMMARY_PREFIX
-from opencontractserver.constants.llm import STRUCTURED_OUTPUT_RETRIES
+from opencontractserver.constants.llm import (
+    STRUCTURED_OUTPUT_RETRIES,
+    is_anthropic_model,
+)
 from opencontractserver.conversations.models import Conversation
 from opencontractserver.corpuses.models import Corpus
 from opencontractserver.documents.models import Document
@@ -105,19 +108,6 @@
 T = TypeVar("T")
 
 
-def _is_anthropic_model(model_name: Optional[str]) -> bool:
-    """Return True if ``model_name`` looks like an Anthropic / Claude model.
-
-    Accepts both pydantic-ai-style ``"anthropic:..."`` prefixes and bare model
-    names containing ``"claude"``. Used to apply Anthropic-specific structured
-    extraction tweaks (lower temperature, etc.).
-    """
-    if not model_name:
-        return False
-    name = model_name.lower()
-    return name.startswith("anthropic:") or "claude" in name
-
-
 def _get_function_tools(agent: PydanticAIAgent) -> dict:
     """Return the function-tools dict from a pydantic-ai Agent.
 
@@ -1357,7 +1347,7 @@ async def _structured_response_raw(
             # temperature pin OR an explicit config.temperature).
             effective_model = model or self.config.model_name
             if (
-                _is_anthropic_model(effective_model)
+                is_anthropic_model(effective_model)
                 and temperature is None
                 and self.config.temperature is None
             ):
diff --git a/opencontractserver/tasks/data_extract_tasks.py b/opencontractserver/tasks/data_extract_tasks.py
@@ -17,6 +17,7 @@
     NONE_RESULT_TOOL_LOOP,
     NONE_RESULT_UNKNOWN,
     TOOL_LOOP_THRESHOLD,
+    is_anthropic_model,
 )
 from opencontractserver.extracts.models import Datacell
 from opencontractserver.shared.decorators import celery_task_with_async_to_sync
@@ -114,6 +115,20 @@ def _classify_none_result(messages: Optional[list[Any]]) -> str:
     return NONE_RESULT_NO_FINAL
 
 
+def _resolve_extract_temperature(model_name: Optional[str]) -> Optional[float]:
+    """Pick the temperature to pass to ``get_structured_response_from_document``.
+
+    Returns ``None`` when ``model_name`` is an Anthropic / Claude model so the
+    Anthropic guard in ``_structured_response_raw`` can apply ``temperature=0``
+    automatically (issue #1381). Returns :data:`EXTRACT_DEFAULT_TEMPERATURE`
+    otherwise. Pulled out as a helper so the model-family→temperature
+    coupling is unit-testable without standing up the full extract task.
+    """
+    if is_anthropic_model(model_name):
+        return None
+    return EXTRACT_DEFAULT_TEMPERATURE
+
+
 def _failure_message_for_classification(classification: str) -> str:
     """Human-readable failure message for a ``NONE_RESULT_*`` classification."""
     if classification == NONE_RESULT_AGENT_COMMITTED:
@@ -349,26 +364,22 @@ def sync_add_sources(datacell, sources):
         # Capture LLM messages for debugging
         messages: Optional[list[Any]] = None
 
+        # Gate the explicit temperature pin on the model family so the
+        # Anthropic ``temperature=0`` override in
+        # ``_structured_response_raw`` activates automatically when
+        # ``EXTRACT_DEFAULT_MODEL`` is a Claude model (issue #1381).
+        extract_temperature = _resolve_extract_temperature(EXTRACT_DEFAULT_MODEL)
+
         try:
             # Wrap the agent call in the context manager to capture messages
             with capture_run_messages() as messages:
-                # Create a temporary agent and extract
-                # ``EXTRACT_DEFAULT_TEMPERATURE`` (=0.3) is safe ONLY while
-                # ``EXTRACT_DEFAULT_MODEL`` is OpenAI; passing it to a
-                # Claude model would silently regress the issue #1381
-                # reliability fix (Anthropic structured runs need T=0).
-                # Both constants live next to each other in
-                # ``constants/llm.py`` so this coupling stays visible.
-                # TODO(#1381 follow-up): when the model becomes
-                # column-configurable, gate the temperature on the model
-                # family in the call site rather than the constants file.
                 result = await agents.get_structured_response_from_document(
                     document=document.id,
                     corpus=corpus_id,
                     prompt=prompt,
                     target_type=output_type,
                     framework=AgentFramework.PYDANTIC_AI,
-                    temperature=EXTRACT_DEFAULT_TEMPERATURE,
+                    temperature=extract_temperature,
                     similarity_top_k=similarity_top_k,
                     model=EXTRACT_DEFAULT_MODEL,
                     user_id=datacell.creator.id,
diff --git a/opencontractserver/tests/test_data_extract_failure_classification.py b/opencontractserver/tests/test_data_extract_failure_classification.py
diff --git a/opencontractserver/tests/test_pydantic_ai_agents.py b/opencontractserver/tests/test_pydantic_ai_agents.py