Open-Source-Legal
diff --git a/‎CHANGELOG.md‎
Lines changed: 7 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎opencontractserver/constants/llm.py‎
Lines changed: 26 additions & 0 deletions b/‎opencontractserver/constants/llm.py‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎opencontractserver/llms/agents/pydantic_ai_agents.py‎
Lines changed: 61 additions & 7 deletions b/‎opencontractserver/llms/agents/pydantic_ai_agents.py‎
Lines changed: 61 additions & 7 deletions
@@ -38,6 +38,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- **Anthropic models silently fail in `doc_extract_query_task`** (Issue #1381): When `doc_extract_query_task` was run with an Anthropic / Claude model, ~85% of cells failed with the canonical "extraction returned None — the requested information may not be present" message even though the document contained the answer. Inspecting `Datacell.llm_call_log` for failed cells showed Claude's last assistant message was always `text` + `tool_use` parts, never a final structured response — pydantic-ai's structured-response runner treated this as no result and returned `None`. The error message conflated three distinct outcomes ("agent committed to None", "agent never produced a final structured response", "agent looped on the same tool call") under one ambiguous string. Three coordinated changes:
+  - **`opencontractserver/llms/agents/pydantic_ai_agents.py`** — All three `_build_structured_system_prompt` overrides (`PydanticAICoreAgent`, `PydanticAIDocumentAgent`, `PydanticAICorpusAgent`) now explicitly tell the agent that **after gathering enough information from the tools, it MUST commit to the final structured response by calling the result tool**. Wording is universal — harmless for OpenAI and necessary for Claude. `_structured_response_raw` now passes `output_retries=STRUCTURED_OUTPUT_RETRIES` (=3) to `PydanticAIAgent` so pydantic-ai retries the final-result tool call when the model fails to commit on the first pass. When an Anthropic model is used and the caller did not pin a temperature, structured runs force `temperature=0` (Claude is reluctant to commit; non-zero temperature pushes it toward more exploratory text). **Cost note**: bumping `output_retries` from pydantic-ai's default of 1 to 3 means a cell that fails the first commit attempt can incur up to 3 result-tool round trips. In the previous behaviour those cells silently returned `None` instead of retrying, so the worst-case per-cell LLM cost on Anthropic models can roughly triple compared to the broken baseline. This is the correct tradeoff — the previous baseline produced no answer — but operators should anticipate the cost shift in billing for Anthropic-driven extractions.
+  - **`opencontractserver/constants/llm.py`** — Hosts `STRUCTURED_OUTPUT_RETRIES`, `TOOL_LOOP_THRESHOLD`, `EXTRACT_DEFAULT_TEMPERATURE`, and the `NONE_RESULT_*` vocabulary used by the classifier (see below).
+  - **`opencontractserver/utils/llm.py`** — New `is_anthropic_model()` helper so call sites outside the agents layer (notably `data_extract_tasks.doc_extract_query_task`) can decide whether to pass `temperature=None` and let the Anthropic guard activate.
+  - **`opencontractserver/tasks/data_extract_tasks.py`** — New `_classify_none_result(messages)` helper inspects the captured pydantic-ai message log and returns one of `agent_committed_none` (a `final_result*` tool call appears — legitimate "data not present"), `no_final_response` (no `final_result*` anywhere — pipeline integration failure), `tool_loop_no_output` (same tool call repeated ≥ `TOOL_LOOP_THRESHOLD`× without final — pipeline bug), or `unknown`. The `Datacell.stacktrace` now records `failure_mode=<classification>` plus a human-readable message (the integration-failure variants reference issue #1381) so operators can `grep failure_mode=` to separate legitimate "not present" outcomes from pipeline bugs. New `_resolve_extract_temperature(model_name)` helper picks the temperature passed to the structured runner: returns `None` for Anthropic models so `_structured_response_raw`'s `temperature=0` override fires automatically, and `EXTRACT_DEFAULT_TEMPERATURE` (0.3) otherwise. This closes the latent footgun where flipping `DEFAULT_EXTRACT_MODEL` to a Claude model would have silently bypassed the reliability fix because `temperature=EXTRACT_DEFAULT_TEMPERATURE` was passed unconditionally.
+  - **`opencontractserver/tests/test_data_extract_failure_classification.py`** — `SimpleTestCase` suite covering the classifier (empty input, `final_result` detection, `final_result_<TypeName>` suffix variants, no-tool-calls path, tool-call-without-final path, repeated-call loop detection, threshold-minus-one boundary, loop-then-commit precedence, mixed text + tool path, non-`ModelResponse` skip, JSON-string `args` normalisation, malformed JSON `args` defensive path, unhashable `args` `repr` fallback), `is_anthropic_model` (prefix, bare-name, OpenAI / `gpt-4` / `o1` rejection, empty/None), and `_resolve_extract_temperature` (Anthropic→None, OpenAI→default, unknown→default, current-default sanity check).
+  - **`opencontractserver/tests/test_pydantic_ai_agents.py`** — New `_structured_response_raw` tests pinning the Anthropic temperature override (forces 0 when caller passes `temperature=None`, respects function-level pin, respects `config.temperature` pin, leaves OpenAI runs untouched), and three smoke tests for the strengthened `_build_structured_system_prompt` overrides covering the document, corpus, and core base agents.
 - **Extraction grounding follow-up** (Issue #1246, follow-up to original #1245 grounding pipeline):
   - **Bug — silent `page=1` fallback corrupted multi-page PDF grounding** (`opencontractserver/utils/extraction_grounding.py`, `_create_pdf_annotation`): when PlasmaPDF could not determine a page for a span, the previous code logged a warning and saved the annotation on page 1 anyway. For multi-page PDFs this produced a structurally incorrect annotation pinned to the wrong page (and therefore the wrong bounding box context), so users clicking through to the source landed on a different page than the one containing the extracted text. Fixed: `_create_pdf_annotation` now raises `ValueError` inside its `transaction.atomic()` savepoint, the savepoint rolls back, and the outer per-result `try/except` in `_create_grounding_annotations` logs it as a failed grounding attempt. Best-effort grounding is preserved (other annotations in the batch are unaffected) but no annotation is ever saved with a wrong page.
   - **Bug — label-set lookup outside the per-annotation guard caused all-or-nothing failure** (`opencontractserver/utils/extraction_grounding.py`, `_create_grounding_annotations`): `corpus.ensure_label_and_labelset(...)` was invoked once before the per-annotation `try/with transaction.atomic()` loop. A failure to materialise the label-set (e.g. a transient DB error or a pre-existing constraint conflict) propagated out, was caught by the outer `try/except` in `data_extract_tasks.py`, and silently dropped *all* groundings for the datacell. Moved the call inside the savepoint so a label-lookup failure only skips the affected annotation.
@@ -48,7 +55,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - **Tests** — `opencontractserver/tests/test_extraction_grounding.py`:
     - `TestGroundingPipelinePDFIntegration` (new class): builds a synthetic two-page PAWLS payload (no real PDF binary needed), runs grounding through `build_translation_layer`, and verifies (a) annotations land on the correct page, (b) re-running grounding is idempotent, and (c) when PlasmaPDF returns `page=None` the annotation is **skipped** instead of being saved on page 1.
     - `test_ground_text_document_is_idempotent`: regression for the duplicate-annotation bug on the SPAN_LABEL path.
-
 - **`CreateCorpusActionModal` opened with the wrong default agent instructions for document triggers** (Issue #1385, `frontend/src/components/corpuses/CreateCorpusActionModal.tsx:136-144,168-171`): the `inlineAgentInstructions` state was initialised with `DEFAULT_MODERATOR_INSTRUCTIONS` even though the default trigger is `add_document` (a document trigger). The trigger-change handler at line 611 swaps to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS`, but a user who created an inline agent on the default-selected trigger without first re-selecting the trigger would submit the moderator copy as the new agent's system instructions. Initialised both the `useState` default and `resetForm()` to `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` so the pre-interaction value matches the default trigger. Updated `frontend/tests/CreateCorpusActionModal.ct.tsx` "inline-agent create: full happy path" mutation mock to expect `DEFAULT_DOCUMENT_AGENT_INSTRUCTIONS` — the previous mock variable masked this bug because `MockedProvider` was matching the stale moderator default rather than the trigger-appropriate one.
 
 ### Changed
 
@@ -0,0 +1,26 @@
+"""LLM / agent integration constants (issue #1381)."""
+
+# pydantic-ai default is 1; Anthropic models often need retries to commit
+# to ``final_result``. Bumping this requires re-checking ``TOOL_LOOP_THRESHOLD``
+# below — a legitimate retried tool call could be mis-classified as a loop
+# if the threshold is lower than the retry budget.
+STRUCTURED_OUTPUT_RETRIES = 3
+
+# Same-call repetition count that ``_classify_none_result`` treats as a
+# pipeline bug (post-mortem heuristic, not a pydantic-ai input). Kept
+# strictly greater than ``STRUCTURED_OUTPUT_RETRIES`` so a worst-case
+# legitimate ``final_result`` retry budget can never trip the loop
+# detector even though pydantic-ai's ``output_retries`` only governs
+# ``final_result`` retries today (regular tool calls are not retried by
+# the same budget). The margin is defensive: if pydantic-ai ever extends
+# retry coverage to other tools, the threshold still has headroom.
+TOOL_LOOP_THRESHOLD = 4
+
+# Vocabulary written to ``Datacell.stacktrace`` as ``failure_mode=...``.
+# Operators grep these; changing the strings breaks downstream dashboards.
+NONE_RESULT_AGENT_COMMITTED = "agent_committed_none"
+NONE_RESULT_NO_FINAL = "no_final_response"
+NONE_RESULT_TOOL_LOOP = "tool_loop_no_output"
+NONE_RESULT_UNKNOWN = "unknown"
+
+EXTRACT_DEFAULT_TEMPERATURE = 0.3
@@ -33,6 +33,7 @@
 from pydantic_graph import End
 
 from opencontractserver.constants.context_guardrails import COMPACTION_SUMMARY_PREFIX
+from opencontractserver.constants.llm import STRUCTURED_OUTPUT_RETRIES
 from opencontractserver.conversations.models import Conversation
 from opencontractserver.corpuses.models import Corpus
 from opencontractserver.documents.models import Document
@@ -90,6 +91,7 @@
     PydanticAIAnnotationVectorStore,
 )
 from opencontractserver.utils.embeddings import aget_embedder
+from opencontractserver.utils.llm import is_anthropic_model
 from opencontractserver.utils.prompt_sanitization import (
     UNTRUSTED_CONTENT_NOTICE,
     fence_user_content,
@@ -536,10 +538,20 @@ def _build_structured_system_prompt(
         Subclasses may override this to include document or corpus context.
         The base implementation intentionally avoids any citation or
         conversational guidance to minimize iterations and enforce raw output.
+
+        The wording explicitly tells the agent to commit to the final
+        structured response after gathering information. Some models (notably
+        Anthropic's Claude family) tend to keep narrating or invoking tools
+        instead of producing the structured output unless told to stop. See
+        issue #1381.
         """
         return (
             "You are in data extraction mode.\n"
             "Use available tools to locate the requested information.\n"
+            "After gathering enough information from the tools, you MUST "
+            "produce the final structured response by calling the result "
+            "tool. Do not narrate further; do not keep invoking tools "
+            "indefinitely.\n"
             "Return ONLY the raw value matching the target type. "
             "No explanations, no citations, no extra words.\n\n"
             "SEARCH PROTOCOL:\n"
@@ -1358,13 +1370,40 @@ async def _structured_response_raw(
         )
 
         try:
-            # Build model settings with overrides
+            # Build model settings with overrides.
+            # ``_prepare_pydantic_ai_model_settings`` returns ``None`` when
+            # both temperature and max_tokens are unset on ``self.config``
+            # (the helper signals "no settings to pass" rather than
+            # returning an empty dict).  We need a mutable dict here so
+            # the function-level ``temperature`` / ``max_tokens`` overrides
+            # — and the Anthropic temperature-0 nudge below — have
+            # somewhere to land.
             model_settings = _prepare_pydantic_ai_model_settings(self.config)
+            if model_settings is None:
+                model_settings = {}
             if temperature is not None:
                 model_settings["temperature"] = temperature
             if max_tokens is not None:
                 model_settings["max_tokens"] = max_tokens
 
+            # Anthropic models tend to keep narrating / calling tools instead
+            # of committing to the structured output when given any wiggle
+            # room (issue #1381). Force temperature down to 0 unless the
+            # caller explicitly asked for something else (function-level
+            # temperature pin OR an explicit config.temperature).
+            effective_model = model or self.config.model_name
+            if (
+                is_anthropic_model(effective_model)
+                and temperature is None
+                and self.config.temperature is None
+            ):
+                logger.info(
+                    "Forcing temperature=0 for structured extraction with "
+                    "Anthropic model %s (issue #1381).",
+                    effective_model,
+                )
+                model_settings["temperature"] = 0
+
             # Seed tools from the main agent so the structured run has the same capabilities
             seeded_tools_dict = _get_function_tools(self.pydantic_ai_agent)
             seeded_tools = list(seeded_tools_dict.values())
@@ -1394,13 +1433,20 @@ async def _structured_response_raw(
 
             logger.info(f"Structured system prompt: {structured_system_prompt}")
 
+            # Preserve the pre-issue-#1381 behaviour of passing
+            # ``model_settings=None`` to ``PydanticAIAgent`` when nothing
+            # ended up being set, so non-Anthropic structured runs without
+            # caller pins are bit-identical to before.
             structured_agent = PydanticAIAgent(
-                model=model or self.config.model_name,
+                model=effective_model,
                 instructions=structured_system_prompt,
                 output_type=target_type,
                 deps_type=PydanticAIDependencies,
                 tools=final_tools,
-                model_settings=model_settings,
+                model_settings=model_settings or None,
+                # Give pydantic-ai room to retry the structured output when
+                # the model fails to commit on the first pass (issue #1381).
+                output_retries=STRUCTURED_OUTPUT_RETRIES,
             )
 
             # Include prior conversation context if available
@@ -1995,8 +2041,12 @@ def _build_structured_system_prompt(
             "search for likely answer phrasings). A single failed search is NOT "
             "sufficient evidence that the information is missing — most legal "
             "documents need multiple targeted queries to surface a relevant span.\n"
-            "4. Return ONLY the raw extracted value matching the target type.\n"
-            "5. No explanations, no citations, no commentary – just the data.\n\n"
+            "4. After gathering enough information from the tools, you MUST "
+            "commit to the final structured response by calling the result "
+            "tool. Do not narrate further; do not keep invoking tools "
+            "indefinitely.\n"
+            "5. Return ONLY the raw extracted value matching the target type.\n"
+            "6. No explanations, no citations, no commentary – just the data.\n\n"
             "Only return null/None after multiple search attempts have all "
             "failed to find relevant content."
         )
@@ -2509,8 +2559,12 @@ def _build_structured_system_prompt(
             "search for likely answer phrasings). A single failed search is NOT "
             "sufficient evidence that the information is missing — most legal "
             "documents need multiple targeted queries to surface a relevant span.\n"
-            "4. Return ONLY the raw extracted value matching the target type.\n"
-            "5. No explanations, no citations, no commentary – just the data.\n\n"
+            "4. After gathering enough information from the tools, you MUST "
+            "commit to the final structured response by calling the result "
+            "tool. Do not narrate further; do not keep invoking tools "
+            "indefinitely.\n"
+            "5. Return ONLY the raw extracted value matching the target type.\n"
+            "6. No explanations, no citations, no commentary – just the data.\n\n"
             "Only return null/None after multiple search attempts have all "
             "failed to find relevant content."
         )