You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Close Anthropic temperature-coupling footgun + lift patch coverage
Hoist is_anthropic_model into constants/llm.py so doc_extract_query_task
can decide whether to pass temperature=None and let the agent-layer
guard apply temperature=0 automatically. New _resolve_extract_temperature
helper makes the model-family -> temperature coupling unit-testable and
prevents a silent regression of #1381 if EXTRACT_DEFAULT_MODEL is ever
flipped to a Claude model. Adds Anthropic temperature-override tests
on _structured_response_raw, smoke tests for the strengthened
_build_structured_system_prompt overrides, and defensive-path tests
for _classify_none_result.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,9 +39,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
39
39
### Fixed
40
40
41
41
-**Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
42
-
-**`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent` line 491, `PydanticAIDocumentAgent` line 1971, `PydanticAICorpusAgent` line 2476) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` (line 1321) now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. New `_is_anthropic_model()` helper detects `anthropic:` prefix or bare `claude` substring; when an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is already reluctant to commit; non-zero temperature pushes it toward more exploratory text).
43
-
-**`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs.
44
-
-**`opencontractserver/tests/test_data_extract_failure_classification.py`** — New `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, loop-then-commit precedence, mixed text + tool path) and `_is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None).
42
+
-**`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text).
43
+
-**`opencontractserver/constants/llm.py`** — Hosts `is_anthropic_model()` (`anthropic:` prefix or bare `claude` substring detector). Lives next to `EXTRACT_DEFAULT_MODEL` / `EXTRACT_DEFAULT_TEMPERATURE` so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
44
+
- **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ 3× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `EXTRACT_DEFAULT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
-**`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
45
47
-**Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
46
48
-**Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
47
49
-**Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
0 commit comments