Skip to content

Commit 48acc24

Browse files
committed
Address PR #1399 review: threshold margin, docs, NO_FINAL wording, doc-agent test
Resolves the demonstrably actionable items from the latest code review: * TOOL_LOOP_THRESHOLD bumped from 3 to 4 to keep a strict margin over STRUCTURED_OUTPUT_RETRIES. The previous comment said "Keep >=" but the values were equal — pydantic-ai's output_retries only governs final_result retries today, not regular tool calls, so the earlier 3==3 was not a runtime bug, but the margin makes the guard defensible if pydantic-ai ever extends retry coverage. Tests now derive the boundary from the constant via ``range(TOOL_LOOP_THRESHOLD)`` so future tweaks can't silently invalidate the boundary cases. * ``_failure_message_for_classification`` for ``no_final_response`` no longer reads "the model exhausted its tool-use budget" — that phrase implies tool use happened, but the same classification fires when the model only emitted narrative text with no tools at all. Reworded to cover both sub-cases (text-only / budget-exhausted). * Removed misleading ``# pragma: no cover`` on the deliberate ``__str__`` raiser in ``test_unhashable_args_fall_back_to_repr``. ``json.dumps(..., default=str)`` does invoke ``str(bad)`` → ``__str__`` at runtime, so the line is genuinely covered; the pragma was contradicting its own "exercised via default=str" comment. Replaced with a clarifying comment. * CHANGELOG entry adds an explicit cost note: bumping ``output_retries`` from 1 to 3 means a previously-failing Anthropic cell can now incur up to 3 result-tool round trips. That's the correct tradeoff (the prior baseline returned None and produced no answer), but operators tracking Anthropic billing should anticipate the shift. Also corrected the constant reference (DEFAULT_EXTRACT_MODEL, not the now-removed EXTRACT_DEFAULT_MODEL) and updated the TOOL_LOOP_THRESHOLD wording to reflect the new value. * New ``test_document_agent_inherits_anthropic_temperature_guard`` in ``test_pydantic_ai_agents.py`` covers the inheritance path the reviewer flagged: ``PydanticAIDocumentAgent._structured_response_raw`` must also force ``temperature=0`` for Anthropic models. The base-class test already covered the implementation, but a future override on a subclass could silently regress without an explicit document-agent assertion. * Updated ``test_core_agent_base_structured_prompt_commits_to_result`` to assert on the merged prompt's actual wording — the old "genuinely cannot be found" phrase was replaced during the origin/main merge with "Only return null/None after multiple search attempts". Now asserts on ``SEARCH PROTOCOL`` and ``multiple search attempts``. Skipped: review issue #2 (two-step Anthropic temperature mechanism) is already addressed by the merge — ``extract_model`` and ``extract_temperature = _resolve_extract_temperature(extract_model)`` now sit adjacent in ``doc_extract_query_task`` with an explicit "in lock-step with whichever model family will actually run" comment, which is what the reviewer asked for.
1 parent 2c20d67 commit 48acc24

5 files changed

Lines changed: 112 additions & 29 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3939
### Fixed
4040

4141
- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
42-
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text).
43-
- **`opencontractserver/constants/llm.py`** — Hosts `is_anthropic_model()` (`anthropic:` prefix or bare `claude` substring detector). Lives next to `EXTRACT_DEFAULT_MODEL` / `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
44-
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `EXTRACT_DEFAULT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
42+
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text). **Cost note**: bumping `output_retries` from pydantic-ai's default of 1 to 3 means a cell that fails the first commit attempt can incur up to 3 result-tool round trips. In the previous behaviour those cells silently returned `None` instead of retrying, so the worst-case per-cell LLM cost on Anthropic models can roughly triple compared to the broken baseline. This is the correct tradeoff — the previous baseline produced no answer — but operators should anticipate the cost shift in billing for Anthropic-driven extractions.
43+
- **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below). `is_anthropic_model()` lives next to `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
44+
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ `TOOL_LOOP_THRESHOLD`× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `DEFAULT_EXTRACT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
4545
- **`opencontractserver/tests/test_data_extract_failure_classification.py`**`SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
4646
- **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
4747
- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):

opencontractserver/constants/llm.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,14 @@
77
STRUCTURED_OUTPUT_RETRIES = 3
88

99
# Same-call repetition count that ``_classify_none_result`` treats as a
10-
# pipeline bug (post-mortem heuristic, not a pydantic-ai input). Keep
11-
# >= STRUCTURED_OUTPUT_RETRIES so legitimate retries don't trip it.
12-
TOOL_LOOP_THRESHOLD = 3
10+
# pipeline bug (post-mortem heuristic, not a pydantic-ai input). Kept
11+
# strictly greater than ``STRUCTURED_OUTPUT_RETRIES`` so a worst-case
12+
# legitimate ``final_result`` retry budget can never trip the loop
13+
# detector even though pydantic-ai's ``output_retries`` only governs
14+
# ``final_result`` retries today (regular tool calls are not retried by
15+
# the same budget). The margin is defensive: if pydantic-ai ever extends
16+
# retry coverage to other tools, the threshold still has headroom.
17+
TOOL_LOOP_THRESHOLD = 4
1318

1419
# Vocabulary written to ``Datacell.stacktrace`` as ``failure_mode=...``.
1520
# Operators grep these; changing the strings breaks downstream dashboards.

opencontractserver/tasks/data_extract_tasks.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -134,10 +134,11 @@ def _failure_message_for_classification(classification: str) -> str:
134134
elif classification == NONE_RESULT_NO_FINAL:
135135
return (
136136
"The extraction agent never produced a final structured response. "
137-
"This is an integration failure (the model exhausted its tool-use "
138-
"budget without committing to the result tool), not a statement "
139-
"about the document. Check ``llm_call_log`` for the raw message "
140-
"history."
137+
"This is an integration failure — either the model only emitted "
138+
"narrative text without ever calling the result tool, or it "
139+
"exhausted its tool-use budget without committing — not a "
140+
"statement about the document. Check ``llm_call_log`` for the "
141+
"raw message history."
141142
)
142143
elif classification == NONE_RESULT_TOOL_LOOP:
143144
return (

opencontractserver/tests/test_data_extract_failure_classification.py

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
NONE_RESULT_NO_FINAL,
1919
NONE_RESULT_TOOL_LOOP,
2020
NONE_RESULT_UNKNOWN,
21+
TOOL_LOOP_THRESHOLD,
2122
)
2223
from opencontractserver.tasks.data_extract_tasks import (
2324
_classify_none_result,
@@ -84,33 +85,23 @@ def test_tool_calls_without_final_is_no_final(self) -> None:
8485
def test_repeated_tool_call_classifies_as_tool_loop(self) -> None:
8586
"""Same tool call repeated >= threshold without final ⇒ tool_loop."""
8687
repeated = _tool_call("similarity_search", {"query": "the same thing"})
87-
messages = [
88-
_make_response(repeated),
89-
_make_response(repeated),
90-
_make_response(repeated),
91-
]
88+
messages = [_make_response(repeated) for _ in range(TOOL_LOOP_THRESHOLD)]
9289
self.assertEqual(_classify_none_result(messages), NONE_RESULT_TOOL_LOOP)
9390

9491
def test_repeats_below_threshold_are_not_tool_loop(self) -> None:
95-
"""Two repeats (threshold - 1) ⇒ no_final_response, not tool_loop.
92+
"""``TOOL_LOOP_THRESHOLD - 1`` repeats ⇒ no_final_response, not tool_loop.
9693
9794
Pins the boundary so a future tweak of ``TOOL_LOOP_THRESHOLD``
9895
forces this test to be updated explicitly.
9996
"""
10097
repeated = _tool_call("similarity_search", {"query": "same"})
101-
messages = [
102-
_make_response(repeated),
103-
_make_response(repeated),
104-
]
98+
messages = [_make_response(repeated) for _ in range(TOOL_LOOP_THRESHOLD - 1)]
10599
self.assertEqual(_classify_none_result(messages), NONE_RESULT_NO_FINAL)
106100

107101
def test_loop_then_final_is_committed_not_loop(self) -> None:
108102
"""If the agent eventually commits, that wins over loop detection."""
109103
repeated = _tool_call("similarity_search", {"query": "loop"})
110-
messages = [
111-
_make_response(repeated),
112-
_make_response(repeated),
113-
_make_response(repeated),
104+
messages = [_make_response(repeated) for _ in range(TOOL_LOOP_THRESHOLD)] + [
114105
_make_response(_tool_call("final_result", {"value": None})),
115106
]
116107
self.assertEqual(_classify_none_result(messages), NONE_RESULT_AGENT_COMMITTED)
@@ -188,7 +179,11 @@ class _BadArgs:
188179
def __repr__(self) -> str:
189180
return "<bad>"
190181

191-
def __str__(self) -> str: # pragma: no cover - exercised via default=str
182+
def __str__(self) -> str:
183+
# Invoked by ``json.dumps(..., default=str)`` when it tries
184+
# to coerce ``_BadArgs``; deliberately raises so the
185+
# classifier's ``except (TypeError, ValueError)`` branch
186+
# takes over.
192187
raise TypeError("nope")
193188

194189
# Two repeats below threshold ⇒ no_final_response; the test is that

opencontractserver/tests/test_pydantic_ai_agents.py

Lines changed: 86 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1517,12 +1517,17 @@ class DummyOutput(BaseModel):
15171517

15181518
prompt = agent._build_structured_system_prompt(DummyOutput, "user query")
15191519

1520-
# Universal phrasing for both Anthropic and OpenAI
1520+
# Universal phrasing for both Anthropic and OpenAI: agent must
1521+
# commit to the result tool after gathering data (issue #1381).
15211522
self.assertIn("MUST", prompt)
15221523
self.assertIn("result tool", prompt)
1523-
# Negative wording must mention "genuinely" so the model only
1524-
# bails out when the data is actually absent.
1525-
self.assertIn("genuinely cannot be found", prompt)
1524+
# SEARCH PROTOCOL nudge: don't bail after a single failed query.
1525+
self.assertIn("SEARCH PROTOCOL", prompt)
1526+
self.assertIn(
1527+
"multiple search attempts",
1528+
prompt,
1529+
"Negative case must require multiple attempts before committing to None",
1530+
)
15261531

15271532
@patch("opencontractserver.llms.agents.pydantic_ai_agents.PydanticAIAgent")
15281533
async def test_structured_response_openai_skips_anthropic_override(
@@ -1554,3 +1559,80 @@ class DummyOutput(BaseModel):
15541559
structured_call.kwargs.get("model_settings"),
15551560
"OpenAI structured runs without pins must pass model_settings=None",
15561561
)
1562+
1563+
@patch("opencontractserver.llms.agents.pydantic_ai_agents.PydanticAIAgent")
1564+
async def test_document_agent_inherits_anthropic_temperature_guard(
1565+
self, mock_pyd_ai_cls: MagicMock
1566+
) -> None:
1567+
"""``PydanticAIDocumentAgent`` inherits ``_structured_response_raw``
1568+
from ``PydanticAICoreAgent``, so the Anthropic ``temperature=0``
1569+
guard must fire when extraction runs on a document agent too.
1570+
1571+
The base-class test covers the implementation itself; this test
1572+
explicitly exercises the inheritance path so a future override
1573+
on a subclass cannot silently regress the fix.
1574+
"""
1575+
from opencontractserver.constants.llm import STRUCTURED_OUTPUT_RETRIES
1576+
from opencontractserver.llms.agents.core_agents import (
1577+
AgentConfig,
1578+
CoreConversationManager,
1579+
DocumentAgentContext,
1580+
)
1581+
from opencontractserver.llms.agents.pydantic_ai_agents import _HistoryResult
1582+
1583+
# Wire up the structured-agent mock the same way the core helper does.
1584+
structured_run_result = MagicMock()
1585+
structured_run_result.output = "extracted-value"
1586+
structured_agent_mock = MagicMock()
1587+
structured_agent_mock.run = AsyncMock(return_value=structured_run_result)
1588+
mock_pyd_ai_cls.return_value = structured_agent_mock
1589+
1590+
config = AgentConfig(
1591+
user_id=self.user.id,
1592+
model_name="anthropic:claude-sonnet-4-6",
1593+
temperature=None,
1594+
)
1595+
ctx = MagicMock(spec=DocumentAgentContext)
1596+
ctx.document = self.doc1
1597+
ctx.config = config
1598+
conv_mgr = MagicMock(spec=CoreConversationManager)
1599+
conv_mgr.conversation = None
1600+
conv_mgr.config = config
1601+
1602+
prebuilt_agent = MagicMock()
1603+
prebuilt_agent._function_tools = {}
1604+
1605+
agent = PydanticAIDocumentAgent(
1606+
context=ctx,
1607+
conversation_manager=conv_mgr,
1608+
pydantic_ai_agent=prebuilt_agent,
1609+
agent_deps=MagicMock(),
1610+
)
1611+
agent._get_message_history = AsyncMock(
1612+
return_value=_HistoryResult(messages=None)
1613+
)
1614+
1615+
class DummyOutput(BaseModel):
1616+
value: str
1617+
1618+
await agent._structured_response_raw(
1619+
prompt="any",
1620+
target_type=DummyOutput,
1621+
model="anthropic:claude-sonnet-4-6",
1622+
temperature=None,
1623+
)
1624+
1625+
structured_call = self._structured_agent_call(mock_pyd_ai_cls)
1626+
self.assertIsNotNone(
1627+
structured_call,
1628+
"Document agent must reach the inherited _structured_response_raw",
1629+
)
1630+
self.assertEqual(
1631+
structured_call.kwargs["model_settings"]["temperature"],
1632+
0,
1633+
"Document agents must also force Anthropic temperature=0 (issue #1381)",
1634+
)
1635+
self.assertEqual(
1636+
structured_call.kwargs["output_retries"],
1637+
STRUCTURED_OUTPUT_RETRIES,
1638+
)

0 commit comments

Comments
 (0)