feat(mcp): surface run_id in run-backed tool results for citation (DRC-3532)#1418
Conversation
…C-3532)
The cloud-backend MCP tools (row_count_diff, profile_diff, value_diff, query,
query_diff, top_k_diff, histogram_diff) routed through _tool_run_backed returned
only run["result"], dropping run_id. The recce-cloud summary agent therefore
never saw the run_id and could not emit deterministic {{run:<run_id>}} citation
markers, forcing fragile server-side fuzzy prose matching.
Merge run_id into the result dict (additive; existing result fields preserved).
Only added when the response carries a run_id and the result is a dict, so
run-less or non-dict responses are untouched (never synthesize a run_id).
Cross-repo (DRC-3532): the recce-cloud summary agent prompt is updated to emit
the markers; server-side marker replacement already exists in recce_task_func.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
Codecov Report❌ Patch coverage is
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
…s state Local-mode ad-hoc diff tools (row_count_diff, profile_diff, query, query_diff, value_diff) now route through _tool_run_backed_local, which uses the same submit_run machinery as the recce server's run_router to persist Runs to default_context().runs. The tool result carries run_id alongside the diff output, matching the CloudBackend._tool_run_backed shape (commit 98670f9). When default_context() is None (cloud SSE path already handled by CloudBackend), the method falls back to direct task execution without run_id — identical to pre-DRC-3634 behaviour, no double execution. The MCP server remains the sole state exporter; runs created mid-session via _tool_run_backed_local are included in the exported payload because they land in the same in-process RecceContext.runs list that export_state() serialises. Tests: 12 new tests in TestLocalModeRunBacked covering run_id presence, run persistence in context.runs, run type mapping (query/query_base), and the isolation boundary (value_diff_detail stays outside run-backed scope). 175 MCP tests green (135 server + 40 cloud backend); ruff clean. Signed-off-by: Kent <iamcxa@gmail.com>
…ding warehouses
profile_diff/profile filtered the requested `columns` with a case-sensitive
membership test (`column.name in selected_columns`). get_columns() returns
physical catalog names, which case-folding warehouses (e.g. Snowflake) store
UPPERCASE, while the Recce Cloud summary agent supplies lowercase
manifest-convention names. The exact-case filter dropped every column, so the
per-column profiling loop never ran and the tool returned a completely empty
result ({"base":{"columns":[],"data":[]},"current":{"columns":[],"data":[]}}).
This made the cloud summary non-deterministic: lowercase column names yielded an
empty profile that the downstream trust gate suppressed.
Fix: lowercase both sides of the membership test before filtering, mirroring the
case-insensitive normalisation in valuediff's _build_column_case_lookup. The SQL
path is unchanged — it still profiles via the physical column.name — so lowercase
input now resolves to the same physical (uppercase) columns and returns the same
non-empty profile as uppercase input. No-op on already-lowercase duckdb models and
on quoted/case-sensitive identifiers; applied to both ProfileDiffTask and
ProfileTask.
Adds regression tests using an UPPERCASE duckdb CSV header (duckdb preserves
header casing, reproducing Snowflake physical-name casing end-to-end with no
mocking): asserts lowercase and uppercase column requests return identical
non-empty profiles, plus an adapter-safety no-op test on lowercase physical names.
DRC-3674
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…sertions test_tool_row_count_diff and test_tool_query_with_base_flag asserted exact equality with the mock result, but run-backed local mode (DRC-3532, _tool_run_backed_local) now merges a run_id key into the tool result. Compare on the result fields ignoring run_id; the dedicated TestLocalModeRunBacked tests cover run_id surfacing. Fixes the Test DBT Versions CI failure on PR #1418. 135 test_mcp_server tests green. Signed-off-by: Kent <iamcxa@gmail.com>
kentwelcome
left a comment
There was a problem hiding this comment.
Claude Code Review: 1 blocker (silent error swallowing in local run-backed tools), 2 issues. See review comment.
Code Review: PR #1418SHA Blockers
Issues
Notes
Limits
|
…DRC-3634) Address PR #1418 review (Claude Code Review, NO-GO). Blocker -- _tool_run_backed_local swallowed failed local runs. submit_run.fn records run.status=FAILED / run.error then returns None on every exception except DuckDBExternalAccessBlocked, so `await asyncio.wrap_future(future)` resolves cleanly and a FAILED run (result=None) collapsed into a success-shaped {"run_id": ...}. A failing query / query_diff / profile_diff / value_diff / row_count_diff (bad SQL, missing model, permission denied) thus returned an empty success -- losing the error and its _classify_db_error classification, and letting a citation agent report "no diff" for a run that actually errored. Add a `run.status == FAILED` guard that re-raises RecceException so call_tool restores isError=True, while the already-persisted FAILED run stays in context.runs for citation. Regression tests cover the raise (message preserved) and the persistence. Issue 1 -- the 5 run-backed tool descriptions now document the additive top-level run_id field (MCP tool description = LLM agent contract). Signed-off-by: Kent <iamcxa@gmail.com>
|
@kentwelcome All review points addressed in Blocker — failed local runs swallowed. Added a Issue 1 — undocumented Notes — happy-path-only tests. Added two Issue 2 — PR description. Updated the description to document the Re: Limits — could not run the suite. Ran it locally with |
wcchang1115
left a comment
There was a problem hiding this comment.
Code Review: PR #1418 (incremental)
SHA 53f05c96 · Verdict NO-GO · Incremental (2e8dd51d → HEAD)
Scope: branch was rebased since the anchor; PR-authored work since the last review touches recce/mcp_server.py + the two MCP test files only.
Blockers
None.
Issues
-
recce/mcp_server.py— Rationale comments duplicate the_tool_run_backed_localdocstring (mcp_server.py:606) instead of living there once::1734,1741,1753,1760,1767— the same 2-line# DRC-3634: route through _tool_run_backed_local …comment is copy-pasted verbatim at all 5 run-backed call sites. The callawait self._tool_run_backed_local(...)is self-documenting; editing the rationale means updating 6 places or leaving 5 stale.:663— the# Surface run_id … matching CloudBackend shape (commit 98670f9e)comment restates the docstring and narrates the obvious next line.
Fix: delete all 6; let the docstring be the single source of truth. Keep the distinct base→run_type mapping comment in
_tool_query. Removes ~12 lines, no technical content lost.
Notes
recce/mcp_server.py:51—VIEW_ALL_MAX_NODEScomment claims the cap makes the result "cannot exceed" the 25k-token SDK cap, but only node count is bounded, not edges. The claim is technically false. Fix is wording only — e.g. "node-count bounded; edge count assumed to scale with nodes (holds for sparse dbt DAGs)". Do not add edge capping: disproportionate change, the trigger (~44k edges on 300 nodes) doesn't occur in dbt DAGs, and dropping edges would orphan nodes into false roots — worse than the current honest output.
Verification: pytest tests/test_mcp_server.py tests/test_mcp_cloud_backend.py → 180 passed.
Issue 1 is the only NO-GO driver and is follow-up-acceptable (zero runtime risk; pure maintainability).
iamcxa
left a comment
There was a problem hiding this comment.
@iamcxa Code review for PR #1418 at 53f05c96.
Verdict: NO-GO as-is. CI is green, but I found two concrete response-shape / selection regressions that should be fixed before merge.
Findings
recce/mcp_server.pyline 202, and local mirror at line 665 — top-levelrun_idcan overwrite real row-count data.
Both cloud and local paths merge citation metadata with {**result, "run_id": ...}. That is not strictly additive for row_count_diff, whose response is a model-name-keyed dict. If a project has a dbt model named run_id, its row-count result is replaced by the UUID string, so the MCP response silently drops actual diff data while claiming existing result fields are preserved.
Suggested fix: add a collision regression test and choose a non-colliding shape for citation metadata, or otherwise preserve an existing run_id result key instead of overwriting it.
recce/tasks/profile.pyline 193, andProfileTaskline 301 — case-insensitive matching broadens exact selections on case-sensitive schemas.
The new filter lowercases every physical column name before matching. That fixes Snowflake-style uppercase physical names, but it regresses quoted/case-sensitive schemas with distinct columns such as name and Name: requesting name now profiles both columns. The new test comment says this is a no-op on quoted/case-sensitive identifiers, but the implementation does not preserve exact-case semantics.
Suggested fix: resolve exact matches first, then fall back to case-insensitive matching only when the lowercase key maps unambiguously to one physical column. Add a regression test with case-distinct physical columns.
Advisory test hardening
- Add cloud coverage for a non-dict result, for example
{"run_id": "run-xyz", "result": [{"ok": true}]}, and assert the list is returned unchanged. - Parameterize the local out-of-scope boundary test for
value_diff_detail,top_k_diff, andhistogram_diff, since tool descriptions only promiserun_idfor the five rerouted local tools. - Consider one handler-level failed-run test that invokes the MCP
call_toolhandler and asserts the response isisErrorwhile the FAILED Run remains persisted.
Verification
- Freshness: live PR head was rechecked immediately before posting and remained
53f05c96de57fa41ae145911dc5c554ee05fb048. - Diff scope: 5 files, 475 additions / 56 deletions.
- GitHub checks: all reported checks are SUCCESS or intentionally SKIPPED.
- Local static check:
git diff --check origin/main...HEADpassed. - I did not run local pytest because this read-only checkout has no
.venv, and the repo instructions require venv activation before Python commands.
…m (DRC-3532) Address PR #1418 incremental review (wcchang1115, NO-GO driver + note). - Delete the 2-line "DRC-3634: route through _tool_run_backed_local ..." comment copy-pasted verbatim at all 5 run-backed call sites and the "Surface run_id ... matching CloudBackend shape" comment at the local merge site. The `await self._tool_run_backed_local(...)` call is self-documenting and the rationale lives once in the method docstring; the duplicates meant editing 6 places or leaving 5 stale. The distinct base->run_type mapping comment in _tool_query is kept. - Reword the VIEW_ALL_MAX_NODES comment: it claimed the node cap makes the serialized result "cannot exceed" the 25k-token SDK cap, but only the node count is bounded, not edges. State honestly that edge count is assumed to scale with nodes (holds for sparse dbt DAGs); no edge capping added. Signed-off-by: Kent <iamcxa@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tests (DRC-3532) Address PR #1418 incremental review (response-shape + test hardening). - run_id collision: row_count_diff returns a model-name-keyed dict, so {**result, "run_id": ...} would overwrite the row-count data of a dbt model literally named "run_id" while claiming existing fields are preserved. Guard both the cloud (_tool_run_backed) and local (_tool_run_backed_local) merge sites with `"run_id" not in result` so a real result key is never clobbered by the citation UUID. Pathological (no real dbt model is named run_id), but the guard is free and keeps the additive contract honest. - Tests: - local + cloud collision regression (a model named "run_id" keeps its data) - cloud non-dict result is returned unchanged (isinstance guard) - parameterize the out-of-scope boundary test over value_diff_detail, top_k_diff, histogram_diff (descriptions promise run_id only for the five rerouted local tools) Signed-off-by: Kent <iamcxa@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@wcchang1115 Round-2 review addressed across two commits ( From your review:
From the incremental self-review at
Verification: One advisory left open: a handler-level failed-run test (invoke |
…ls (DRC-3532) Close the last advisory from the PR #1418 incremental review: exercise the full call_tool handler path (not just _tool_*-level) for a failing run-backed local tool. Asserts the registered handler returns isError=True with the original message surfaced (so _classify_db_error / the agent can see it), and the FAILED Run is still persisted to context.runs for citation. Reuses TestCallToolHandler._invoke_call_tool. Signed-off-by: Kent <iamcxa@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Added the handler-level failed-run test in It drives the full |
Code Review: PR #1418SHA Re-review of the 3 commits since the BlockersNone. IssuesNone blocking in the incremental scope. Resolution of prior findings
Verification
|
…sertions test_tool_row_count_diff and test_tool_query_with_base_flag asserted exact equality with the mock result, but run-backed local mode (DRC-3532, _tool_run_backed_local) now merges a run_id key into the tool result. Compare on the result fields ignoring run_id; the dedicated TestLocalModeRunBacked tests cover run_id surfacing. Fixes the Test DBT Versions CI failure on PR #1418. 135 test_mcp_server tests green. Signed-off-by: Kent <iamcxa@gmail.com>
…DRC-3634) Address PR #1418 review (Claude Code Review, NO-GO). Blocker -- _tool_run_backed_local swallowed failed local runs. submit_run.fn records run.status=FAILED / run.error then returns None on every exception except DuckDBExternalAccessBlocked, so `await asyncio.wrap_future(future)` resolves cleanly and a FAILED run (result=None) collapsed into a success-shaped {"run_id": ...}. A failing query / query_diff / profile_diff / value_diff / row_count_diff (bad SQL, missing model, permission denied) thus returned an empty success -- losing the error and its _classify_db_error classification, and letting a citation agent report "no diff" for a run that actually errored. Add a `run.status == FAILED` guard that re-raises RecceException so call_tool restores isError=True, while the already-persisted FAILED run stays in context.runs for citation. Regression tests cover the raise (message preserved) and the persistence. Issue 1 -- the 5 run-backed tool descriptions now document the additive top-level run_id field (MCP tool description = LLM agent contract). Signed-off-by: Kent <iamcxa@gmail.com>
…m (DRC-3532) Address PR #1418 incremental review (wcchang1115, NO-GO driver + note). - Delete the 2-line "DRC-3634: route through _tool_run_backed_local ..." comment copy-pasted verbatim at all 5 run-backed call sites and the "Surface run_id ... matching CloudBackend shape" comment at the local merge site. The `await self._tool_run_backed_local(...)` call is self-documenting and the rationale lives once in the method docstring; the duplicates meant editing 6 places or leaving 5 stale. The distinct base->run_type mapping comment in _tool_query is kept. - Reword the VIEW_ALL_MAX_NODES comment: it claimed the node cap makes the serialized result "cannot exceed" the 25k-token SDK cap, but only the node count is bounded, not edges. State honestly that edge count is assumed to scale with nodes (holds for sparse dbt DAGs); no edge capping added. Signed-off-by: Kent <iamcxa@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tests (DRC-3532) Address PR #1418 incremental review (response-shape + test hardening). - run_id collision: row_count_diff returns a model-name-keyed dict, so {**result, "run_id": ...} would overwrite the row-count data of a dbt model literally named "run_id" while claiming existing fields are preserved. Guard both the cloud (_tool_run_backed) and local (_tool_run_backed_local) merge sites with `"run_id" not in result` so a real result key is never clobbered by the citation UUID. Pathological (no real dbt model is named run_id), but the guard is free and keeps the additive contract honest. - Tests: - local + cloud collision regression (a model named "run_id" keeps its data) - cloud non-dict result is returned unchanged (isinstance guard) - parameterize the out-of-scope boundary test over value_diff_detail, top_k_diff, histogram_diff (descriptions promise run_id only for the five rerouted local tools) Signed-off-by: Kent <iamcxa@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Closes DRC-3532.
Why (cross-repo, DRC-3532)
The recce-cloud summary agent wants to deep-link each factual statement in its PR
summary to the exact run it executed, via deterministic
{{run:<run_id>}}inlinemarkers (server-side marker replacement already exists in recce-cloud-infra's
recce_task_func). But the agent never received the run_id: the cloud-backendMCP tools that back the agent's analysis (row_count_diff, profile_diff,
value_diff, query, query_diff, top_k_diff, histogram_diff) route through
_tool_run_backed, which returned onlyrun["result"]and droppedrun_id.Without run_id the agent cannot cite runs deterministically, forcing fragile
server-side fuzzy prose matching (low/unstable coverage, occasional wrong links).
What
Cloud-backend path (
_tool_run_backed)_tool_run_backednow mergesrun_idinto the result dict (additive — existingresult fields preserved). Only added when the response actually carries a run_id
and the result is a dict, so run-less or non-dict responses are untouched (the
agent must never be handed, or synthesize, a run_id it was not given).
Local-mode path (
_tool_run_backed_local, DRC-3634)The larger half of the mcp_server.py diff: in local mode,
row_count_diff,query,query_diff,profile_diff, andvalue_diffare rerouted through a new_tool_run_backed_local. When an in-processRecceContextis available(
default_context()), it executes via the samesubmit_runmachinery as therecce server's run_router, so the Run is persisted to
default_context().runs(and therefore included in the S3-exported state the infra comment-builder reads)
and the returned dict carries a top-level
run_id. Whendefault_context()isNone (the cloud-backend SSE path), it falls back to plain in-executor execution
with no
run_id— identical to pre-DRC-3634 behaviour.Error semantics:
submit_run.fnrecordsrun.status=FAILED/run.errorthenreturns None on any non-
DuckDBExternalAccessBlockedexception, so the awaitedfuture resolves cleanly.
_tool_run_backed_localinspectsrun.statusandre-raises
RecceExceptionon a FAILED run, socall_toolrestoresisError=Trueand_classify_db_errorcan classify the failure — instead ofcollapsing a failed run into a success-shaped
{"run_id": ...}. The persistedFAILED Run stays in
context.runsfor citation. The 5 run-backed tooldescriptions document the additive top-level
run_idfield (MCP tooldescription = LLM agent contract).
Test
tests/test_mcp_cloud_backend.py:tests/test_mcp_server.py(TestLocalModeRunBacked):tools (e.g.
value_diff_detail) do not_classify_db_error), NOT a success-shaped{"run_id": ...}context.runsfor citation137 local mcp_server tests + 40 cloud-backend tests green.
Paired change
recce-cloud-infra: the summary agent prompt emits
{{run:<run_id>}}markersusing this run_id; fuzzy linkify is demoted to a legacy fallback. Tracked under
DRC-3532. Durable structured-citation design is DRC-3634.
🤖 Generated with Claude Code
Consumed by (cross-repo)
recce-cloud-infra PR DataRecce/recce-cloud-infra#1427 (Summary v2: trust architecture — value-grounding + run citations) depends on this PR. The instance-launcher image bakes the recce build from this branch; #1427's run-citation deep links (
[value](…/runs/<id>)) andprofile_diff-grounded row counts require the changes here. Suggested merge order: this PR → recce release/image → #1427.Also on this branch —
profile_difflowercase-column fix (DRC-3674 B)Commit
889a01b7(fix(profile): resolve lowercase columns to physical names on case-folding warehouses):ProfileDiffTask/ProfileTaskfiltered requested columns with a case-sensitivecolumn.name in selected_columnstest. On Snowflake (physical names UPPERCASE) the agent's lowercase column names matched nothing → the filter dropped every column → empty profile → the cloud summary suppressed the whole analysis. Fix lowercases both sides of the membership test (no-op on duckdb; SQL path unchanged). Regression test reproduces the empty-profile case with an uppercase-header duckdb table. This unblocks #1427's CLV/profile grounding (which was intermittently empty in UAT depending on the LLM's column casing).