Skip to content

Commit 5559621

Browse files
committed
Address review: fix loop-detection test threshold, correct CHANGELOG, trim history comment
- test_json_string_args_are_normalised_for_loop_detection: feed TOOL_LOOP_THRESHOLD alternating dict/str calls so the loop detector actually fires (the prior 3-call form was below the threshold of 4 and silently failed CI) - CHANGELOG: is_anthropic_model() lives in utils/llm.py, not constants/llm.py - test_data_extract_helpers.py: drop the PR/draft-history sentences per CLAUDE.md, keep the pointer to the canonical test file
1 parent 48acc24 commit 5559621

3 files changed

Lines changed: 11 additions & 11 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4040

4141
- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
4242
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text). **Cost note**: bumping `output_retries` from pydantic-ai's default of 1 to 3 means a cell that fails the first commit attempt can incur up to 3 result-tool round trips. In the previous behaviour those cells silently returned `None` instead of retrying, so the worst-case per-cell LLM cost on Anthropic models can roughly triple compared to the broken baseline. This is the correct tradeoff — the previous baseline produced no answer — but operators should anticipate the cost shift in billing for Anthropic-driven extractions.
43-
- **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below). `is_anthropic_model()` lives next to `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
43+
- **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, `EXTRACT_DEFAULT_TEMPERATURE`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below).
44+
- **`opencontractserver/utils/llm.py`** — New `is_anthropic_model()` helper so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
4445
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ `TOOL_LOOP_THRESHOLD`× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `DEFAULT_EXTRACT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
4546
- **`opencontractserver/tests/test_data_extract_failure_classification.py`**`SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
4647
- **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.

opencontractserver/tests/test_data_extract_failure_classification.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -150,14 +150,17 @@ def test_json_string_args_are_normalised_for_loop_detection(self) -> None:
150150
Pydantic-AI emits ``ArgsJson`` (str) and ``ArgsDict`` (dict)
151151
interchangeably across model providers; conflating them in the
152152
Counter keeps loop detection accurate (issue #1381).
153+
154+
Alternating dict/str args ``TOOL_LOOP_THRESHOLD`` times must trip
155+
the loop detector — without normalisation each variant would only
156+
count to half the threshold and the loop would be missed.
153157
"""
154158
dict_call = _tool_call("similarity_search", {"query": "same"})
155159
# Same logical args, but as a JSON string.
156160
str_call = _tool_call("similarity_search", '{"query": "same"}')
157161
messages = [
158-
_make_response(dict_call),
159-
_make_response(str_call),
160-
_make_response(dict_call),
162+
_make_response(dict_call if i % 2 == 0 else str_call)
163+
for i in range(TOOL_LOOP_THRESHOLD)
161164
]
162165
self.assertEqual(_classify_none_result(messages), NONE_RESULT_TOOL_LOOP)
163166

opencontractserver/tests/test_data_extract_helpers.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -165,13 +165,9 @@ def test_empty_input_is_a_noop(self) -> None:
165165
self.assertEqual(datacell.sources.count(), 0)
166166

167167

168-
# NOTE: failure-mode classification is covered by
169-
# ``test_data_extract_failure_classification.py`` (issue #1381 PR), which
170-
# exercises the canonical ``_classify_none_result`` returning the
171-
# ``NONE_RESULT_*`` constants. The earlier draft of this file shipped a
172-
# parallel duck-typed classifier with a tuple-return shape; that variant
173-
# was superseded during the merge and its tests were dropped to keep a
174-
# single source of truth for failure-mode semantics.
168+
# Failure-mode classification (`_classify_none_result` and the
169+
# `NONE_RESULT_*` constants) is covered by
170+
# ``test_data_extract_failure_classification.py``.
175171

176172

177173
class CrossEncoderRerankerTests(TestCase):

0 commit comments

Comments
 (0)