Address PR #1399 review: threshold margin, docs, NO_FINAL wording, doc-agent test

JSv4 · JSv4 · commit 48acc2492695 · 2026-04-29T08:41:32.000-05:00
Resolves the demonstrably actionable items from the latest code review: * TOOL_LOOP_THRESHOLD bumped from 3 to 4 to keep a strict margin over STRUCTURED_OUTPUT_RETRIES. The previous comment said "Keep >=" but the values were equal — pydantic-ai's output_retries only governs final_result retries today, not regular tool calls, so the earlier 3==3 was not a runtime bug, but the margin makes the guard defensible if pydantic-ai ever extends retry coverage. Tests now derive the boundary from the constant via ``range(TOOL_LOOP_THRESHOLD)`` so future tweaks can't silently invalidate the boundary cases. * ``_failure_message_for_classification`` for ``no_final_response`` no longer reads "the model exhausted its tool-use budget" — that phrase implies tool use happened, but the same classification fires when the model only emitted narrative text with no tools at all. Reworded to cover both sub-cases (text-only / budget-exhausted). * Removed misleading ``# pragma: no cover`` on the deliberate ``__str__`` raiser in ``test_unhashable_args_fall_back_to_repr``. ``json.dumps(..., default=str)`` does invoke ``str(bad)`` → ``__str__`` at runtime, so the line is genuinely covered; the pragma was contradicting its own "exercised via default=str" comment. Replaced with a clarifying comment. * CHANGELOG entry adds an explicit cost note: bumping ``output_retries`` from 1 to 3 means a previously-failing Anthropic cell can now incur up to 3 result-tool round trips. That's the correct tradeoff (the prior baseline returned None and produced no answer), but operators tracking Anthropic billing should anticipate the shift. Also corrected the constant reference (DEFAULT_EXTRACT_MODEL, not the now-removed EXTRACT_DEFAULT_MODEL) and updated the TOOL_LOOP_THRESHOLD wording to reflect the new value. * New ``test_document_agent_inherits_anthropic_temperature_guard`` in ``test_pydantic_ai_agents.py`` covers the inheritance path the reviewer flagged: ``PydanticAIDocumentAgent._structured_response_raw`` must also force ``temperature=0`` for Anthropic models. The base-class test already covered the implementation, but a future override on a subclass could silently regress without an explicit document-agent assertion. * Updated ``test_core_agent_base_structured_prompt_commits_to_result`` to assert on the merged prompt's actual wording — the old "genuinely cannot be found" phrase was replaced during the origin/main merge with "Only return null/None after multiple search attempts". Now asserts on ``SEARCH PROTOCOL`` and ``multiple search attempts``. Skipped: review issue #2 (two-step Anthropic temperature mechanism) is already addressed by the merge — ``extract_model`` and ``extract_temperature = _resolve_extract_temperature(extract_model)`` now sit adjacent in ``doc_extract_query_task`` with an explicit "in lock-step with whichever model family will actually run" comment, which is what the reviewer asked for.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -39,9 +39,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 
 - **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
-  - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text).
-  - **`opencontractserver/constants/llm.py`** — Hosts `is_anthropic_model()` (`anthropic:` prefix or bare `claude` substring detector). Lives next to `EXTRACT_DEFAULT_MODEL` / `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
-  - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `EXTRACT_DEFAULT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
+  - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text). **Cost note**: bumping `output_retries` from pydantic-ai's default of 1 to 3 means a cell that fails the first commit attempt can incur up to 3 result-tool round trips. In the previous behaviour those cells silently returned `None` instead of retrying, so the worst-case per-cell LLM cost on Anthropic models can roughly triple compared to the broken baseline. This is the correct tradeoff — the previous baseline produced no answer — but operators should anticipate the cost shift in billing for Anthropic-driven extractions.
+  - **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below). `is_anthropic_model()` lives next to `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
+  - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ `TOOL_LOOP_THRESHOLD`× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `DEFAULT_EXTRACT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
   - **`opencontractserver/tests/test_data_extract_failure_classification.py`** — `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
   - **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
 - **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
diff --git a/opencontractserver/constants/llm.py b/opencontractserver/constants/llm.py
@@ -7,9 +7,14 @@
 STRUCTURED_OUTPUT_RETRIES = 3
 
 # Same-call repetition count that ``_classify_none_result`` treats as a
-# pipeline bug (post-mortem heuristic, not a pydantic-ai input). Keep
-# >= STRUCTURED_OUTPUT_RETRIES so legitimate retries don't trip it.
-TOOL_LOOP_THRESHOLD = 3
+# pipeline bug (post-mortem heuristic, not a pydantic-ai input). Kept
+# strictly greater than ``STRUCTURED_OUTPUT_RETRIES`` so a worst-case
+# legitimate ``final_result`` retry budget can never trip the loop
+# detector even though pydantic-ai's ``output_retries`` only governs
+# ``final_result`` retries today (regular tool calls are not retried by
+# the same budget). The margin is defensive: if pydantic-ai ever extends
+# retry coverage to other tools, the threshold still has headroom.
+TOOL_LOOP_THRESHOLD = 4
 
 # Vocabulary written to ``Datacell.stacktrace`` as ``failure_mode=...``.
 # Operators grep these; changing the strings breaks downstream dashboards.
diff --git a/opencontractserver/tasks/data_extract_tasks.py b/opencontractserver/tasks/data_extract_tasks.py
@@ -134,10 +134,11 @@ def _failure_message_for_classification(classification: str) -> str:
     elif classification == NONE_RESULT_NO_FINAL:
         return (
             "The extraction agent never produced a final structured response. "
-            "This is an integration failure (the model exhausted its tool-use "
-            "budget without committing to the result tool), not a statement "
-            "about the document. Check ``llm_call_log`` for the raw message "
-            "history."
+            "This is an integration failure — either the model only emitted "
+            "narrative text without ever calling the result tool, or it "
+            "exhausted its tool-use budget without committing — not a "
+            "statement about the document. Check ``llm_call_log`` for the "
+            "raw message history."
         )
     elif classification == NONE_RESULT_TOOL_LOOP:
         return (
diff --git a/opencontractserver/tests/test_data_extract_failure_classification.py b/opencontractserver/tests/test_data_extract_failure_classification.py
@@ -18,6 +18,7 @@
     NONE_RESULT_NO_FINAL,
     NONE_RESULT_TOOL_LOOP,
     NONE_RESULT_UNKNOWN,
+    TOOL_LOOP_THRESHOLD,
 )
 from opencontractserver.tasks.data_extract_tasks import (
     _classify_none_result,
@@ -84,33 +85,23 @@ def test_tool_calls_without_final_is_no_final(self) -> None:
     def test_repeated_tool_call_classifies_as_tool_loop(self) -> None:
         """Same tool call repeated >= threshold without final ⇒ tool_loop."""
         repeated = _tool_call("similarity_search", {"query": "the same thing"})
-        messages = [
-            _make_response(repeated),
-            _make_response(repeated),
-            _make_response(repeated),
-        ]
+        messages = [_make_response(repeated) for _ in range(TOOL_LOOP_THRESHOLD)]
         self.assertEqual(_classify_none_result(messages), NONE_RESULT_TOOL_LOOP)
 
     def test_repeats_below_threshold_are_not_tool_loop(self) -> None:
-        """Two repeats (threshold - 1) ⇒ no_final_response, not tool_loop.
+        """``TOOL_LOOP_THRESHOLD - 1`` repeats ⇒ no_final_response, not tool_loop.
 
         Pins the boundary so a future tweak of ``TOOL_LOOP_THRESHOLD``
         forces this test to be updated explicitly.
         """
         repeated = _tool_call("similarity_search", {"query": "same"})
-        messages = [
-            _make_response(repeated),
-            _make_response(repeated),
-        ]
+        messages = [_make_response(repeated) for _ in range(TOOL_LOOP_THRESHOLD - 1)]
         self.assertEqual(_classify_none_result(messages), NONE_RESULT_NO_FINAL)
 
     def test_loop_then_final_is_committed_not_loop(self) -> None:
         """If the agent eventually commits, that wins over loop detection."""
         repeated = _tool_call("similarity_search", {"query": "loop"})
-        messages = [
-            _make_response(repeated),
-            _make_response(repeated),
-            _make_response(repeated),
+        messages = [_make_response(repeated) for _ in range(TOOL_LOOP_THRESHOLD)] + [
             _make_response(_tool_call("final_result", {"value": None})),
         ]
         self.assertEqual(_classify_none_result(messages), NONE_RESULT_AGENT_COMMITTED)
@@ -188,7 +179,11 @@ class _BadArgs:
             def __repr__(self) -> str:
                 return "<bad>"
 
-            def __str__(self) -> str:  # pragma: no cover - exercised via default=str
+            def __str__(self) -> str:
+                # Invoked by ``json.dumps(..., default=str)`` when it tries
+                # to coerce ``_BadArgs``; deliberately raises so the
+                # classifier's ``except (TypeError, ValueError)`` branch
+                # takes over.
                 raise TypeError("nope")
 
         # Two repeats below threshold ⇒ no_final_response; the test is that
diff --git a/opencontractserver/tests/test_pydantic_ai_agents.py b/opencontractserver/tests/test_pydantic_ai_agents.py
@@ -1517,12 +1517,17 @@ class DummyOutput(BaseModel):
 
         prompt = agent._build_structured_system_prompt(DummyOutput, "user query")
 
-        # Universal phrasing for both Anthropic and OpenAI
+        # Universal phrasing for both Anthropic and OpenAI: agent must
+        # commit to the result tool after gathering data (issue #1381).
         self.assertIn("MUST", prompt)
         self.assertIn("result tool", prompt)
-        # Negative wording must mention "genuinely" so the model only
-        # bails out when the data is actually absent.
-        self.assertIn("genuinely cannot be found", prompt)
+        # SEARCH PROTOCOL nudge: don't bail after a single failed query.
+        self.assertIn("SEARCH PROTOCOL", prompt)
+        self.assertIn(
+            "multiple search attempts",
+            prompt,
+            "Negative case must require multiple attempts before committing to None",
+        )
 
     @patch("opencontractserver.llms.agents.pydantic_ai_agents.PydanticAIAgent")
     async def test_structured_response_openai_skips_anthropic_override(
@@ -1554,3 +1559,80 @@ class DummyOutput(BaseModel):
             structured_call.kwargs.get("model_settings"),
             "OpenAI structured runs without pins must pass model_settings=None",
         )
+
+    @patch("opencontractserver.llms.agents.pydantic_ai_agents.PydanticAIAgent")
+    async def test_document_agent_inherits_anthropic_temperature_guard(
+        self, mock_pyd_ai_cls: MagicMock
+    ) -> None:
+        """``PydanticAIDocumentAgent`` inherits ``_structured_response_raw``
+        from ``PydanticAICoreAgent``, so the Anthropic ``temperature=0``
+        guard must fire when extraction runs on a document agent too.
+
+        The base-class test covers the implementation itself; this test
+        explicitly exercises the inheritance path so a future override
+        on a subclass cannot silently regress the fix.
+        """
+        from opencontractserver.constants.llm import STRUCTURED_OUTPUT_RETRIES
+        from opencontractserver.llms.agents.core_agents import (
+            AgentConfig,
+            CoreConversationManager,
+            DocumentAgentContext,
+        )
+        from opencontractserver.llms.agents.pydantic_ai_agents import _HistoryResult
+
+        # Wire up the structured-agent mock the same way the core helper does.
+        structured_run_result = MagicMock()
+        structured_run_result.output = "extracted-value"
+        structured_agent_mock = MagicMock()
+        structured_agent_mock.run = AsyncMock(return_value=structured_run_result)
+        mock_pyd_ai_cls.return_value = structured_agent_mock
+
+        config = AgentConfig(
+            user_id=self.user.id,
+            model_name="anthropic:claude-sonnet-4-6",
+            temperature=None,
+        )
+        ctx = MagicMock(spec=DocumentAgentContext)
+        ctx.document = self.doc1
+        ctx.config = config
+        conv_mgr = MagicMock(spec=CoreConversationManager)
+        conv_mgr.conversation = None
+        conv_mgr.config = config
+
+        prebuilt_agent = MagicMock()
+        prebuilt_agent._function_tools = {}
+
+        agent = PydanticAIDocumentAgent(
+            context=ctx,
+            conversation_manager=conv_mgr,
+            pydantic_ai_agent=prebuilt_agent,
+            agent_deps=MagicMock(),
+        )
+        agent._get_message_history = AsyncMock(
+            return_value=_HistoryResult(messages=None)
+        )
+
+        class DummyOutput(BaseModel):
+            value: str
+
+        await agent._structured_response_raw(
+            prompt="any",
+            target_type=DummyOutput,
+            model="anthropic:claude-sonnet-4-6",
+            temperature=None,
+        )
+
+        structured_call = self._structured_agent_call(mock_pyd_ai_cls)
+        self.assertIsNotNone(
+            structured_call,
+            "Document agent must reach the inherited _structured_response_raw",
+        )
+        self.assertEqual(
+            structured_call.kwargs["model_settings"]["temperature"],
+            0,
+            "Document agents must also force Anthropic temperature=0 (issue #1381)",
+        )
+        self.assertEqual(
+            structured_call.kwargs["output_retries"],
+            STRUCTURED_OUTPUT_RETRIES,
+        )