Skip to content

Voice, AI Skills, and Plan Mode for the IDE assistants#11

Open
bigbansal wants to merge 5 commits into
mainfrom
voice-skills-planmode
Open

Voice, AI Skills, and Plan Mode for the IDE assistants#11
bigbansal wants to merge 5 commits into
mainfrom
voice-skills-planmode

Conversation

@bigbansal
Copy link
Copy Markdown
Contributor

Summary

Adds three chat-surface features to both CodeSetu IDE assistants (apps/vscode, apps/jetbrains) at feature parity, behind opt-in toggles and settings:

  • Plan Mode — chat toggle that asks the model for a numbered plan + clarifying questions instead of code edits. "Approve & Run" sends APPROVED — proceed with implementation and exits Plan Mode. Single source of truth in skills/plan-mode/SKILL.md, runtime constants in TS and Kotlin.
  • AI Skills runtime — deterministic router (pinned + slash + keyword, capped at 1 auto-routed) with a slash-command palette in the composer. Ships 4 new built-ins (/explain, /refactor, /test, /indic) alongside the new plan-mode skill. Workspace .codesetu/skills/*.md continue to load always-on — no regression. codesetu.skills.autoRoute lets users disable keyword auto-routing.
  • Voice (STT + TTS, 5 backends) — new @codesetu/core/speech package with browser, local, sarvam, openai-compatible, huggingface. Mic button + TTS toggle in the shared chat template. Browser/local run entirely in the webview; server backends post audio over the host bridge. CSP tightened to media-src 'self' blob: + an allowlisted connect-src derived from configured speech endpoints. New CodeSetu: Setup Speech Provider command (VSCode) and a separate codesetu.speech.apiKey OS-secret slot on both IDEs (Sarvam Saaras/Bulbul keys differ from chat keys).

Three commits, each independently shippable:

  • 737df99 feat(plan-mode)
  • 1c94175 feat(skills)
  • 94e5e91 feat(voice)

Outstanding before voice is fully usable in JetBrains

JCEF mic permission spikegetUserMedia is blocked by default in JBCef. The UI surfaces a clear error when this happens. The fix is to add --enable-media-stream to JBCefApp.getInstance() args and a CefPermissionHandler that auto-approves audio. Browser-side speechSynthesis (TTS read-aloud) works in JCEF without any extra flags.

VSCode voice has no such spike — the VSCode webview's getUserMedia works as soon as the user grants permission.

What's intentionally NOT in this PR

  • The api-client products (apps/vscode-apiclient, apps/jetbrains-apiclient, packages/api-client-core) are untouched.
  • Inline completions and the right-click editor actions (Explain/Refactor/etc.) are untouched. Voice/skills/plan-mode are chat-surface features only.
  • Tool-call dry-run plan mode (Claude-Code-style "approve every tool call") is designed-for-but-not-built — the prompt structure and skills-as-pinned-fragments pattern is in place so it can layer on once a tool-execution loop exists.

Test gate

All green on the branch tip:

  • @codesetu/core: build · lint · 54 tests across 6 files (9 new for the skills router, 14 new for speech)
  • apps/vscode: esbuild bundle (937.7 KB) · ESLint clean
  • apps/jetbrains: compileKotlin + JUnit suite pass (no new tests, settings/prompt changes covered by existing model/payload tests)

Test plan

  • VSCode Plan Mode — F5 the Extension Dev Host, open the chat, toggle Plan Mode in the + menu, send a small request. Confirm the assistant produces a numbered checklist with no code edits. Click "Approve & Run" → confirm the next turn implements and Plan Mode is off.
  • VSCode Skills slash palette — type / in the composer. Confirm the palette opens with 5 entries (plan/explain/refactor/test/indic). Arrow keys + Enter inserts the command + space. Send /explain with code selected; verify the response is structured per the explain-code skill (check Output channel for the system prompt if needed).
  • VSCode Skills auto-route — send "please refactor this function for readability" without slash; confirm the refactor skill activates. Set codesetu.skills.autoRoute=false and confirm it no longer auto-activates.
  • VSCode voice — browser — click the mic, allow permission, speak. Verify the transcript appears in the input. Toggle the speaker icon and send a message; verify the response is read aloud via speechSynthesis.
  • VSCode voice — server — run CodeSetu: Setup Speech Provider, pick sarvam (or openai-compatible with a Whisper endpoint), supply a key. Repeat the mic + TTS flow; verify the host log shows Speech.transcribe via <provider>.
  • JetBrains Plan Mode./gradlew :runIde, repeat the Plan Mode flow above.
  • JetBrains slash palette — repeat the /explain + auto-route checks.
  • JetBrains voice — TTS only — enable TTS toggle, confirm speechSynthesis reads responses aloud in JCEF.
  • JetBrains voice — STTexpected to fail until the JCEF mic permission spike lands; verify the failure message points the user to --enable-media-stream.
  • CSP regression check — confirm chat still renders, model/skills/slash all work, no console CSP violations in either webview.

🤖 Generated with Claude Code

bigbansal and others added 3 commits May 28, 2026 21:19
Plan Mode is a new chat-surface mode that asks the assistant to produce
a numbered plan and clarifying questions before any implementation. The
mode is pinned via a single skill (`skills/plan-mode/SKILL.md`) injected
into the system prompt as a `pinnedSkills` entry, designed so the Phase
2 skills runtime can layer on top using the same injection point.

Both IDEs get the same UX: a "Plan Mode" toggle in the composer menu, a
"Plan" pill on the composer when active, and an "Approve & Run" button
that appears after a plan-mode assistant turn — it sends "APPROVED —
proceed with implementation" and exits Plan Mode for the rest of the
session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a deterministic skills router (pinned + slash + keyword, capped at
1 auto-routed) shared by both IDE assistants, plus a slash-command
palette in the chat composer that opens on `/`. Ships four built-in
skills alongside the existing plan-mode skill:

  /explain  -> explain-code
  /refactor -> refactor
  /test     -> write-tests
  /indic    -> indic-comments (Hindi/Tamil/Bengali/...)

The router lives in `@codesetu/core` and is mirrored in Kotlin for the
JetBrains plugin. Workspace `.codesetu/skills/*.md` continue to load
always-on as today — no regression. Auto-routing can be disabled via
the new `codesetu.skills.autoRoute` setting (VSCode) or the matching
checkbox in JetBrains settings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ains UI

Adds a SpeechProvider package (`@codesetu/core/speech`) with five backends:

  browser           Browser SpeechRecognition + speechSynthesis (no key)
  local             Same as browser; refuses any server fallback (air-gapped)
  sarvam            Sarvam Saaras (STT) + Bulbul (TTS)
  openai-compatible /v1/audio/transcriptions + /v1/audio/speech
  huggingface       Hugging Face Inference Router (Whisper-large-v3)

The chat webview gets a mic button (idle / listening / transcribing
states with a red pulse) and a TTS toggle in the composer toolbar.
Browser/local paths run entirely in the webview; server paths post
audio bytes to the host over the existing message bridge, which calls
the SpeechProvider and pushes the transcription or synthesized audio
back to the webview. CSP gains a tight `media-src 'self' blob:` and a
`connect-src` allowlist derived from the configured speech endpoints.

VSCode ships a new `CodeSetu: Setup Speech Provider` command, a
separate `codesetu.speech.apiKey` secret slot, and `codesetu.speech.*`
settings (sttProvider, ttsProvider, language, ttsEnabled,
sttBaseUrl/sttModel, ttsBaseUrl/ttsModel).

JetBrains mirrors the UI, settings, secret store (`CodeSetuSpeechApiKeyStore`),
and host bridge (`CodeSetuSpeechClient` over JDK HttpClient). The mic
button shows a clear error if JCEF blocks `getUserMedia` — server STT
on JetBrains still depends on a pending JCEF `--enable-media-stream`
spike before mic input works end-to-end. TTS via browser speechSynthesis
works in JCEF without additional flags.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bigbansal bigbansal changed the base branch from codex-ide-feature-foundation to main May 28, 2026 16:46
bigbansal and others added 2 commits May 28, 2026 22:36
JetBrains plugin now registers a JBCefAppRequiredArgumentsProvider that
adds the two CEF flags required for the chat webview's mic button to
work:

  --enable-features=WebRTC,MediaStream,AudioServiceOutOfProcess
  --use-fake-ui-for-media-stream

Without the first, getUserMedia throws NotSupportedError before the
user is ever prompted. The second auto-approves the in-page permission
request (the OS still gates physical mic access). The trade-off
applies to every JCEF webview in the IDE process and is documented in
apps/jetbrains/README.md under "Voice in JetBrains".

Also updates the top-level README, both app READMEs, and CHANGELOG so
reviewers and testers know what to look for during the smoke test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round of cleanups requested after PR review:

* Strip TTS end-to-end — toggle, settings (ttsProvider, ttsEnabled,
  ttsBaseUrl, ttsModel), host bridges, Sarvam/OpenAI/HF synthesize, the
  webview speakViaBrowser/speakViaServer/maybeSpeakAssistant paths, and
  the related tests. Voice is now STT-only.
* Drop the "local" speech provider — it was misleading (routed to the
  same WebSpeech path as "browser", not actually on-device).
* Sarvam STT default model bumped to saarika:v2; response parsing kept
  loose so a Sarvam-side rename doesn't break us silently.
* JetBrains default STT provider switched to sarvam (browser
  SpeechRecognition does not work in JCEF — no Google cloud-speech keys
  in the embedded Chromium build).
* Mic UX: pointerdown >250ms = push-to-talk (release stops), short press
  = tap-to-toggle, spacebar in an empty/focused composer also push-to-
  talks, Esc stops an active capture. Same in both webviews.
* Wire isPlanModeApproval into both responders — typing APPROVED / RUN
  (or clicking Approve & Run, which sends the canonical phrase) drops
  plan-mode pinning for that turn so the model implements instead of
  re-planning. Kills the previously-dead helper.
* Slash menu and composer (+) menu are now mutually exclusive.
* Editor actions (Explain Selection etc.) inherit the user's current
  Plan Mode pick via a uiState message from the webview to the host.
* JetBrains: persist the chat Plan Mode toggle across panel reloads via
  CodeSetuSettingsState.chatPlanModeOn, templated into chat.html on
  render.
* JetBrains: new Tools → CodeSetu → Setup Speech Provider action that
  mirrors the VSCode wizard.
* Bump both apps to 0.3.0 and add JetBrains change-notes entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant