Skip to content

Commit 1e67b38

Browse files
authored
Merge pull request #1399 from Open-Source-Legal/claude/resolve-issue-1381-qQ4LA
Make Anthropic structured extraction reliable + classify None failures
2 parents 9531e84 + 5559621 commit 1e67b38

8 files changed

Lines changed: 989 additions & 230 deletions

File tree

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3838

3939
### Fixed
4040

41+
- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
42+
- **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text). **Cost note**: bumping `output_retries` from pydantic-ai's default of 1 to 3 means a cell that fails the first commit attempt can incur up to 3 result-tool round trips. In the previous behaviour those cells silently returned `None` instead of retrying, so the worst-case per-cell LLM cost on Anthropic models can roughly triple compared to the broken baseline. This is the correct tradeoff — the previous baseline produced no answer — but operators should anticipate the cost shift in billing for Anthropic-driven extractions.
43+
- **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, `EXTRACT_DEFAULT_TEMPERATURE`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below).
44+
- **`opencontractserver/utils/llm.py`** — New `is_anthropic_model()` helper so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
45+
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ `TOOL_LOOP_THRESHOLD`× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `DEFAULT_EXTRACT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
46+
- **`opencontractserver/tests/test_data_extract_failure_classification.py`**`SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
47+
- **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
4148
- **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
4249
- **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
4350
- **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
@@ -48,7 +55,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4855
- **Tests**`opencontractserver/tests/test_extraction_grounding.py`:
4956
- `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
5057
- `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
51-
5258
- **`CreateCorpusActionModal` opened with the wrong default agent instructions for document triggers** (Issue #1385, `frontend/src/components/corpuses/CreateCorpusActionModal.tsx:136-144,168-171`): the `inlineAgentInstructions` state was initialised with `DEFAULT_MODERATOR_INSTRUCTIONS` even though the default trigger is `add_document` (a document trigger). The trigger-change handler at line 611 swaps to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS`, but a user who created an inline agent on the default-selected trigger without first re-selecting the trigger would submit the moderator copy as the new agent's system instructions. Initialised both the `useState` default and `resetForm()` to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` so the pre-interaction value matches the default trigger. Updated `frontend/tests/CreateCorpusActionModal.ct.tsx` "inline-agent create: full happy path" mutation mock to expect `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` — the previous mock variable masked this bug because `MockedProvider` was matching the stale moderator default rather than the trigger-appropriate one.
5359

5460
### Changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
"""LLM / agent integration constants (issue #1381)."""
2+
3+
# pydantic-ai default is 1; Anthropic models often need retries to commit
4+
# to ``final_result``. Bumping this requires re-checking ``TOOL_LOOP_THRESHOLD``
5+
# below — a legitimate retried tool call could be mis-classified as a loop
6+
# if the threshold is lower than the retry budget.
7+
STRUCTURED_OUTPUT_RETRIES = 3
8+
9+
# Same-call repetition count that ``_classify_none_result`` treats as a
10+
# pipeline bug (post-mortem heuristic, not a pydantic-ai input). Kept
11+
# strictly greater than ``STRUCTURED_OUTPUT_RETRIES`` so a worst-case
12+
# legitimate ``final_result`` retry budget can never trip the loop
13+
# detector even though pydantic-ai's ``output_retries`` only governs
14+
# ``final_result`` retries today (regular tool calls are not retried by
15+
# the same budget). The margin is defensive: if pydantic-ai ever extends
16+
# retry coverage to other tools, the threshold still has headroom.
17+
TOOL_LOOP_THRESHOLD = 4
18+
19+
# Vocabulary written to ``Datacell.stacktrace`` as ``failure_mode=...``.
20+
# Operators grep these; changing the strings breaks downstream dashboards.
21+
NONE_RESULT_AGENT_COMMITTED = "agent_committed_none"
22+
NONE_RESULT_NO_FINAL = "no_final_response"
23+
NONE_RESULT_TOOL_LOOP = "tool_loop_no_output"
24+
NONE_RESULT_UNKNOWN = "unknown"
25+
26+
EXTRACT_DEFAULT_TEMPERATURE = 0.3

opencontractserver/llms/agents/pydantic_ai_agents.py

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
from pydantic_graph import End
3434

3535
from opencontractserver.constants.context_guardrails import COMPACTION_SUMMARY_PREFIX
36+
from opencontractserver.constants.llm import STRUCTURED_OUTPUT_RETRIES
3637
from opencontractserver.conversations.models import Conversation
3738
from opencontractserver.corpuses.models import Corpus
3839
from opencontractserver.documents.models import Document
@@ -90,6 +91,7 @@
9091
PydanticAIAnnotationVectorStore,
9192
)
9293
from opencontractserver.utils.embeddings import aget_embedder
94+
from opencontractserver.utils.llm import is_anthropic_model
9395
from opencontractserver.utils.prompt_sanitization import (
9496
UNTRUSTED_CONTENT_NOTICE,
9597
fence_user_content,
@@ -536,10 +538,20 @@ def _build_structured_system_prompt(
536538
Subclasses may override this to include document or corpus context.
537539
The base implementation intentionally avoids any citation or
538540
conversational guidance to minimize iterations and enforce raw output.
541+
542+
The wording explicitly tells the agent to commit to the final
543+
structured response after gathering information. Some models (notably
544+
Anthropic's Claude family) tend to keep narrating or invoking tools
545+
instead of producing the structured output unless told to stop. See
546+
issue #1381.
539547
"""
540548
return (
541549
"You are in data extraction mode.\n"
542550
"Use available tools to locate the requested information.\n"
551+
"After gathering enough information from the tools, you MUST "
552+
"produce the final structured response by calling the result "
553+
"tool. Do not narrate further; do not keep invoking tools "
554+
"indefinitely.\n"
543555
"Return ONLY the raw value matching the target type. "
544556
"No explanations, no citations, no extra words.\n\n"
545557
"SEARCH PROTOCOL:\n"
@@ -1358,13 +1370,40 @@ async def _structured_response_raw(
13581370
)
13591371

13601372
try:
1361-
# Build model settings with overrides
1373+
# Build model settings with overrides.
1374+
# ``_prepare_pydantic_ai_model_settings`` returns ``None`` when
1375+
# both temperature and max_tokens are unset on ``self.config``
1376+
# (the helper signals "no settings to pass" rather than
1377+
# returning an empty dict). We need a mutable dict here so
1378+
# the function-level ``temperature`` / ``max_tokens`` overrides
1379+
# — and the Anthropic temperature-0 nudge below — have
1380+
# somewhere to land.
13621381
model_settings = _prepare_pydantic_ai_model_settings(self.config)
1382+
if model_settings is None:
1383+
model_settings = {}
13631384
if temperature is not None:
13641385
model_settings["temperature"] = temperature
13651386
if max_tokens is not None:
13661387
model_settings["max_tokens"] = max_tokens
13671388

1389+
# Anthropic models tend to keep narrating / calling tools instead
1390+
# of committing to the structured output when given any wiggle
1391+
# room (issue #1381). Force temperature down to 0 unless the
1392+
# caller explicitly asked for something else (function-level
1393+
# temperature pin OR an explicit config.temperature).
1394+
effective_model = model or self.config.model_name
1395+
if (
1396+
is_anthropic_model(effective_model)
1397+
and temperature is None
1398+
and self.config.temperature is None
1399+
):
1400+
logger.info(
1401+
"Forcing temperature=0 for structured extraction with "
1402+
"Anthropic model %s (issue #1381).",
1403+
effective_model,
1404+
)
1405+
model_settings["temperature"] = 0
1406+
13681407
# Seed tools from the main agent so the structured run has the same capabilities
13691408
seeded_tools_dict = _get_function_tools(self.pydantic_ai_agent)
13701409
seeded_tools = list(seeded_tools_dict.values())
@@ -1394,13 +1433,20 @@ async def _structured_response_raw(
13941433

13951434
logger.info(f"Structured system prompt: {structured_system_prompt}")
13961435

1436+
# Preserve the pre-issue-#1381 behaviour of passing
1437+
# ``model_settings=None`` to ``PydanticAIAgent`` when nothing
1438+
# ended up being set, so non-Anthropic structured runs without
1439+
# caller pins are bit-identical to before.
13971440
structured_agent = PydanticAIAgent(
1398-
model=model or self.config.model_name,
1441+
model=effective_model,
13991442
instructions=structured_system_prompt,
14001443
output_type=target_type,
14011444
deps_type=PydanticAIDependencies,
14021445
tools=final_tools,
1403-
model_settings=model_settings,
1446+
model_settings=model_settings or None,
1447+
# Give pydantic-ai room to retry the structured output when
1448+
# the model fails to commit on the first pass (issue #1381).
1449+
output_retries=STRUCTURED_OUTPUT_RETRIES,
14041450
)
14051451

14061452
# Include prior conversation context if available
@@ -1995,8 +2041,12 @@ def _build_structured_system_prompt(
19952041
"search for likely answer phrasings). A single failed search is NOT "
19962042
"sufficient evidence that the information is missing — most legal "
19972043
"documents need multiple targeted queries to surface a relevant span.\n"
1998-
"4. Return ONLY the raw extracted value matching the target type.\n"
1999-
"5. No explanations, no citations, no commentary – just the data.\n\n"
2044+
"4. After gathering enough information from the tools, you MUST "
2045+
"commit to the final structured response by calling the result "
2046+
"tool. Do not narrate further; do not keep invoking tools "
2047+
"indefinitely.\n"
2048+
"5. Return ONLY the raw extracted value matching the target type.\n"
2049+
"6. No explanations, no citations, no commentary – just the data.\n\n"
20002050
"Only return null/None after multiple search attempts have all "
20012051
"failed to find relevant content."
20022052
)
@@ -2509,8 +2559,12 @@ def _build_structured_system_prompt(
25092559
"search for likely answer phrasings). A single failed search is NOT "
25102560
"sufficient evidence that the information is missing — most legal "
25112561
"documents need multiple targeted queries to surface a relevant span.\n"
2512-
"4. Return ONLY the raw extracted value matching the target type.\n"
2513-
"5. No explanations, no citations, no commentary – just the data.\n\n"
2562+
"4. After gathering enough information from the tools, you MUST "
2563+
"commit to the final structured response by calling the result "
2564+
"tool. Do not narrate further; do not keep invoking tools "
2565+
"indefinitely.\n"
2566+
"5. Return ONLY the raw extracted value matching the target type.\n"
2567+
"6. No explanations, no citations, no commentary – just the data.\n\n"
25142568
"Only return null/None after multiple search attempts have all "
25152569
"failed to find relevant content."
25162570
)

0 commit comments

Comments
 (0)