Address review: fix loop-detection test threshold, correct CHANGELOG, trim history comment

JSv4 · JSv4 · commit 5559621b3ffe · 2026-04-29T21:22:35.000-05:00
- test_json_string_args_are_normalised_for_loop_detection: feed
  TOOL_LOOP_THRESHOLD alternating dict/str calls so the loop detector
  actually fires (the prior 3-call form was below the threshold of 4
  and silently failed CI)
- CHANGELOG: is_anthropic_model() lives in utils/llm.py, not
  constants/llm.py
- test_data_extract_helpers.py: drop the PR/draft-history sentences
  per CLAUDE.md, keep the pointer to the canonical test file
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -40,7 +40,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
   - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text). **Cost note**: bumping `output_retries` from pydantic-ai's default of 1 to 3 means a cell that fails the first commit attempt can incur up to 3 result-tool round trips. In the previous behaviour those cells silently returned `None` instead of retrying, so the worst-case per-cell LLM cost on Anthropic models can roughly triple compared to the broken baseline. This is the correct tradeoff — the previous baseline produced no answer — but operators should anticipate the cost shift in billing for Anthropic-driven extractions.
-  - **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below). `is_anthropic_model()` lives next to `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
+  - **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, `EXTRACT_DEFAULT_TEMPERATURE`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below).
+  - **`opencontractserver/utils/llm.py`** — New `is_anthropic_model()` helper so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
   - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ `TOOL_LOOP_THRESHOLD`× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `DEFAULT_EXTRACT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
   - **`opencontractserver/tests/test_data_extract_failure_classification.py`** — `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
   - **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
diff --git a/opencontractserver/tests/test_data_extract_failure_classification.py b/opencontractserver/tests/test_data_extract_failure_classification.py
@@ -150,14 +150,17 @@ def test_json_string_args_are_normalised_for_loop_detection(self) -> None:
         Pydantic-AI emits ``ArgsJson`` (str) and ``ArgsDict`` (dict)
         interchangeably across model providers; conflating them in the
         Counter keeps loop detection accurate (issue #1381).
+
+        Alternating dict/str args ``TOOL_LOOP_THRESHOLD`` times must trip
+        the loop detector — without normalisation each variant would only
+        count to half the threshold and the loop would be missed.
         """
         dict_call = _tool_call("similarity_search", {"query": "same"})
         # Same logical args, but as a JSON string.
         str_call = _tool_call("similarity_search", '{"query": "same"}')
         messages = [
-            _make_response(dict_call),
-            _make_response(str_call),
-            _make_response(dict_call),
+            _make_response(dict_call if i % 2 == 0 else str_call)
+            for i in range(TOOL_LOOP_THRESHOLD)
         ]
         self.assertEqual(_classify_none_result(messages), NONE_RESULT_TOOL_LOOP)
 
diff --git a/opencontractserver/tests/test_data_extract_helpers.py b/opencontractserver/tests/test_data_extract_helpers.py
@@ -165,13 +165,9 @@ def test_empty_input_is_a_noop(self) -> None:
         self.assertEqual(datacell.sources.count(), 0)
 
 
-# NOTE: failure-mode classification is covered by
-# ``test_data_extract_failure_classification.py`` (issue #1381 PR), which
-# exercises the canonical ``_classify_none_result`` returning the
-# ``NONE_RESULT_*`` constants. The earlier draft of this file shipped a
-# parallel duck-typed classifier with a tuple-return shape; that variant
-# was superseded during the merge and its tests were dropped to keep a
-# single source of truth for failure-mode semantics.
+# Failure-mode classification (`_classify_none_result` and the
+# `NONE_RESULT_*` constants) is covered by
+# ``test_data_extract_failure_classification.py``.
 
 
 class CrossEncoderRerankerTests(TestCase):