Skip to content

docs: dv-connect troubleshooting for auth re-prompt & first-query hang (#63)#69

Open
aadharshkannan wants to merge 1 commit into
microsoft:mainfrom
aadharshkannan:skillopt/dv-connect-auth-troubleshooting
Open

docs: dv-connect troubleshooting for auth re-prompt & first-query hang (#63)#69
aadharshkannan wants to merge 1 commit into
microsoft:mainfrom
aadharshkannan:skillopt/dv-connect-auth-troubleshooting

Conversation

@aadharshkannan

Copy link
Copy Markdown

dv-connect: troubleshooting for "auth keeps re-prompting" and "first query hangs" (issue #63)

Closes the documentation gap behind #63. This PR adds a focused Troubleshooting section to dv-connect/SKILL.md for two symptoms the skill previously did not cover, and softens an over-confident claim that contradicted the reported behavior.

The problem (from #63)

A user running the Dataverse plugin in GitHub Copilot reports:

  1. "GitHub Copilot keeps asking for reauthentication every session" — even though pac auth list instantly shows a valid profile for the same environment.
  2. "Simple query doesn't give a result even after 2 minutes" — it hangs, then returns nothing.

The previous dv-connect/SKILL.md actively reinforced the wrong mental model: at MCP setup it stated the browser sign-in "only happens once; the token is cached for future sessions," and it had no troubleshooting entry for either symptom.

Root cause

There are three separate authentication surfaces, each with its own token cache: the PAC CLI (pac auth), the MCP proxy (@microsoft/dataverse mcp), and the Python SDK path (scripts/auth.py, azure-identity). A valid pac auth list only proves the first. The two reported symptoms come from the other two surfaces:

  • Symptom B (query hangs ~2 min) is the most clear-cut. When the agent answers a query by running a Python/Web API script, scripts/auth.py builds an interactive DeviceCodeCredential whenever there is no saved authentication record and no service principal. In a non-interactive Copilot session, nobody can complete the device-code prompt, so the call blocks until the execution timeout (~2 min) and produces no result.
  • Symptom A (reauth every session) must be diagnosed at the surface that actually prompts (MCP proxy or Python SDK) — re-running pac auth does nothing because the PAC profile is already valid.

The fix

A new Troubleshooting: auth keeps re-prompting, or the first query hangs section that:

  • States the three-auth-surface model up front and explicitly tells the agent not to "fix" this by re-running pac auth create/select when the PAC profile is valid.
  • Symptom A → diagnose at the prompting surface: claude mcp list should show ✓ Connected; re-register the MCP server; run --validate against the GA /api/mcp endpoint; verify tenant admin consent + environment allowlist; stabilize the proxy's cached auth.
  • Symptom B → name the non-interactive device-code trap; first-line fix: warm the token cache once by running python scripts/auth.py interactively so the AuthenticationRecord persists and later runs refresh silently. Prefer the MCP server for queries; use a service principal (CLIENT_ID/CLIENT_SECRET) for unattended/CI.
  • Symptom C → sign-in loops that persist after login = corrupt token cache; clear the relevant cache (not the PAC profile), re-auth once, re-verify consent/allowlist.

It also softens the "only happens once" MCP note to point at this section.

How SkillOpt derived and validated this

This change was found and validated with SkillOpt, our skill-evaluation harness. SkillOpt materializes the full Dataverse plugin, swaps in a candidate SKILL.md for the target skill, runs the GitHub Copilot CLI agent against held-out probe prompts, and has an LLM judge score the response against a set of semantic claims. The eval set (7 probes) ships in PR #66 (evals/skillopt/dv_connect_auth.jsonl) and is reproducible.

Goldilocks baseline on the current production dv-connect/SKILL.md:

hard soft n
Production dv-connect (baseline) 0.857 (6/7) 0.905 7
With this PR 1.000 (7/7) 0.961 7

The single baseline failure was the non-interactive device-code item. The judge's verdict on the production skill names the exact gap:

SEM FAIL P1 [warm/persist the token cache]: "The response does recommend using a configured MCP server, but it does not mention warming the token cache or saving an AuthenticationRecord for silent refreshes."

Without the documented guidance, the strong target model jumps straight to a heavyweight service-principal rewrite and omits the lightweight first-line fix (warm the cache once).

A/B stability — the recovered item run 6× against the pristine original vs. 6× against this PR's skill (same harness, same judge):

Skill hard pass-rate soft avg
Original dv-connect/SKILL.md 3/6 (0.50) 0.83
This PR 6/6 (1.00) 1.00

The distributions do not overlap: every original run scored soft ≤ 0.87 and only half passed hard; every PR run scored a perfect 1.00. The production skill leaves the model at coin-flip reliability on this remediation; documenting it makes the correct guidance deterministic.

Scope / hedging

This is a documentation-only change to one SKILL.md. It frames the global-binary MCP registration as a remediation option for persistent reauth, not as a replacement for the default npx flow, and ties each remedy to an observable symptom rather than asserting a single absolute cause. The companion eval set is in PR #66.

@aadharshkannan aadharshkannan requested a review from a team June 2, 2026 02:32
…sue microsoft#63)

Addresses two failure modes reported in microsoft#63 that the skill did not cover:
repeated sign-in every session despite valid pac auth, and a simple query
that hangs ~2 min then returns nothing.

Adds a Troubleshooting section to dv-connect/SKILL.md that:
- Clarifies the three separate auth surfaces (PAC CLI, MCP proxy, Python
  SDK) each have their own token cache, so a valid pac profile does not
  prove the MCP/Python paths are authenticated -> do not 'fix' by re-running
  pac auth.
- Symptom A (reauth every session): diagnose at the prompting surface
  (claude mcp list, re-register, --validate GA /api/mcp, verify consent +
  allowlist); stabilize the proxy's cached auth.
- Symptom B (first query hangs ~2 min): the non-interactive device-code
  trap -- auth.py's interactive DeviceCodeCredential blocks a non-interactive
  turn until timeout. First-line fix: warm the token cache once
  (python scripts/auth.py) so the AuthenticationRecord persists and later
  runs refresh silently; prefer MCP; use a service principal for CI.
- Symptom C (sign-in loops after completing login): corrupt token cache;
  clear the relevant cache (not the PAC profile), re-auth, re-verify.

Also softens the over-confident 'this only happens once' note to point at
the new section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aadharshkannan aadharshkannan force-pushed the skillopt/dv-connect-auth-troubleshooting branch from d6361e6 to ff87568 Compare June 2, 2026 02:54
@aadharshkannan aadharshkannan changed the title dv-connect: troubleshoot auth re-prompt + first-query hang (issue #63) docs: dv-connect troubleshooting for auth re-prompt & first-query hang (#63) Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant