Windows desktop support + LiteLLM provider + optional text refinement#1
Open
alarmz wants to merge 15 commits into
Open
Windows desktop support + LiteLLM provider + optional text refinement#1alarmz wants to merge 15 commits into
alarmz wants to merge 15 commits into
Conversation
The macOS-style modifier glyphs (⌘ Command, ⌥ Option) are jarring on Windows. Add a top-level _modifierLabel helper that returns Win/Ctrl/Alt/Shift on Windows and the macOS labels elsewhere, and route both the hotkey display and the recorder overlay through it. Also hide the "輔助使用權限" permission tile on Windows since SendInput does not require it (the native channel already returns true unconditionally on Windows; previously the tile was always-green-but-misleading). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZeroType already exposed a free-text "custom endpoint" for the OpenAI
provider, which technically pointed at any OpenAI-compatible proxy. But
the endpoint was hidden in an "advanced" expansion and the model list
was static, so using a LiteLLM proxy meant the user had to hand-type
whatever model alias their proxy exposed.
This commit adds a first-class `litellm` provider:
- assets/config/providers.json: new `litellm` entry with empty static
model list (filled dynamically at runtime).
- SpeechRecognitionService:
- new `case 'litellm'` that requires the user-supplied base URL and
routes the request through the existing OpenAI multipart handler,
appending /v1/audio/transcriptions to the base.
- new `fetchAvailableModels(baseUrl, apiKey)` that GETs /v1/models on
the proxy and returns id+name records for the picker.
- ModelConfigRepository: `getCachedModels` / `saveCachedModels` keyed by
providerId, JSON-encoded into SharedPreferences.
- DynamicModelsController (riverpod family by providerId, keepAlive):
loads cached list on build; `refresh(baseUrl, apiKey)` hits /v1/models
and persists.
- model_config_page UI: when the selected provider is `litellm`, render
a required Proxy-Base-URL input inline (with a hint that /v1 is added
automatically), and replace the static model dropdown with
`_LiteLLMModelPicker` — a dropdown sourced from the dynamic
controller plus a refresh icon button. Errors and "not yet fetched"
states surface inline. Other providers retain the existing static
dropdown and "advanced" custom endpoint UX.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Flutter window is created with `titleBarStyle: TitleBarStyle.hidden`, which suppresses the native Windows title bar. Without a substitute drag region this leaves the window completely immovable — fatal on multi-monitor setups where the user can't reach controls that opened off-screen. Replace the static "Zero Type" centred Container with a `_TitleBar`: - The centre region is wrapped in `DragToMoveArea` from window_manager so the user can grab the title to move/throw the window between displays. - Three trailing 46x44 buttons (minimize / maximize-or-restore / close) drive `windowManager` directly. The close button paints the standard Windows red on hover with a white icon; the others use a subtle surface-tinted hover. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user reported pressing Alt+Space showed only the literal string "錯誤"
in the floating pill, with no further information — making it impossible to
tell whether the proxy URL was wrong, the API key was bad, the model
returned 4xx, or something timed out. Three changes:
1. recording_overlay: when status is error, render the actual
`state.errorMessage` (`錯誤:<message>`) instead of the fixed "錯誤"
string. The label is wrapped in a 480-px max Flexible/ConstrainedBox so
long server bodies wrap and ellipsis at 3 lines instead of overflowing
the pill.
2. SpeechRecognitionService: wrap the OpenAI/LiteLLM POST and the LiteLLM
`/v1/models` GET in DioException-aware try/catch via a new
`_wrapDioError` helper that:
* logs full details (type, status, message, truncated body) via
`AppLogger`
* rethrows an Exception whose message includes the HTTP status, the
Dio error type, and a 400-char-truncated response body so the user
sees something actionable rather than a stack trace
The pre-existing Gemini handler is migrated to the same helper.
3. Add `AppLogger` (`lib/core/services/app_logger.dart`): an append-only
logger that writes to `%TEMP%\zero_type.log` (rotated at 1 MiB) and
mirrors to stdout. The controller's hotkey/start/stop/error paths use it
instead of `print`, and a new `_failStartup(msg)` helper unifies the
early-validation paths so they actually mutate state to error+message
(not just call the macOS-only native overlay channel) — previously on
Windows these branches showed nothing at all.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pressing Alt+Space on Windows immediately threw: PlatformException(Record, null, The data specified for the media type is invalid, inconsistent, or not supported by this object., null) Root cause: the Windows Media Foundation AAC encoder accepts only 44100 or 48000 Hz sample rates (per Microsoft's AAC encoder spec). The previous hard-coded 16000 Hz — chosen because Whisper is internally 16 kHz, so the mac path avoided server-side resampling — fails on Windows with MF_E_INVALIDMEDIATYPE before any audio is captured. Branch on Platform.isWindows: 44100 Hz on Windows, 16000 Hz on macOS (unchanged). Bitrate stays 128 kbps so the upload size is identical; Whisper resamples 44.1 → 16 kHz on the server with no perceptible difference for speech. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user pointed ZeroType at a LiteLLM proxy and selected
`gemini-2.5-flash-lite`, which the proxy returned in its `/v1/models`
listing. The transcription request to `/v1/audio/transcriptions` failed
with a server-side 500 and the LiteLLM body:
litellm.APIConnectionError: Unmapped provider passed in. Unable to get
the response.
LiteLLM's `/v1/audio/transcriptions` only knows how to dispatch to
whisper-style backends (OpenAI Whisper, Deepgram, Azure Speech, …); it
has no mapping for chat/multimodal models. Modern multimodal LLMs
(Gemini, GPT-4o-audio, Claude with audio) accept audio via the chat
completions endpoint with an `input_audio` content part — and LiteLLM
already bridges that shape to each backend's native audio API.
Switch the LiteLLM provider to:
- `model.contains('whisper')` → /v1/audio/transcriptions (existing path)
- everything else → /v1/chat/completions with input_audio
The new `_transcribeWithChatCompletions` reads the recorded m4a, base64-
encodes it, posts a single user message containing the prompt and the
audio (format detected from the file extension), and parses
`choices[0].message.content` as the transcript. Token usage is read from
`usage.prompt_tokens` / `usage.completion_tokens` so the history page's
cost tracking still works.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…toggle)
Adds option C from the design conversation: take the raw transcript from the
speech provider and run it through a separate chat LLM to clean up filler
words, fix self-corrections, normalise punctuation, etc. The point is to let
users pick "cheap+fast model for transcription, smart model for polishing"
— e.g. Whisper transcribe → Claude refine.
Three independent axes (per the user's chosen design):
(a) Refinement provider/model/key/endpoint kept entirely separate from the
speech provider, with its own SharedPreferences namespace.
(b) Refinement prompt is a separate editable prompt (`TextRefinement.prompt`
asset, custom override at `<appSupport>/TextRefinement_Custom.prompt`).
(c) Off by default; activated by a toggle in the Settings page so a typo
in the refinement config can never silently double the user's bill.
Implementation:
- `assets/prompts/TextRefinement.prompt` — sensible default refinement
instructions (preserve meaning, drop fillers, fix self-corrections, format
bullet lists, restore literal punctuation).
- `PromptRepository`: refactored shared file/asset/prefs helpers and added
refinement get/save/reset/default methods. `PromptController` adds a
`RefinementPromptController` mirroring the speech one. `PromptPage`
becomes a `TabBar` with two `PromptEditor` instances.
- `ModelConfigRepository`: refinement-namespaced provider/model/apiKey/
endpoint/cachedModels methods (separate prefs keys so speech and
refinement can point at different LiteLLM proxies if desired).
- `model_config_controller`: new `RefinementProviderController` and
`DynamicRefinementModelsController` (parallel to the speech versions).
- `model_config_page`: new "文字優化(可選)" collapsible section reusing
`_ApiKeyInput`/`_LiteLLMEndpointInput`/`_ModelDropdown`. The
`_LiteLLMModelPicker` gained an `isRefinement` flag so it watches and
refreshes the refinement model cache when used in this section.
- `SettingsState`: new `refinementEnabled` field; `SettingsController`
loads/saves it via `AppConstants.isRefinementEnabledKey` (which was
already declared but unused). New tile in `settings_page` between
"開機啟動" and "歷史記錄保留時間".
- `SpeechRecognitionService.refine(rawText, …)`: text-only chat-completion
call. Routes openai/litellm via `/v1/chat/completions` (LiteLLM works
with any chat backend), gemini via `models/{id}:generateContent`. Reuses
the existing `_wrapDioError` helper so any failure produces a useful
log line and a readable exception.
- `ZeroTypeController._maybeRefine(rawText)`: called between transcribe
and clipboard/paste. Returns `null` (silent fallback to raw transcript)
if the toggle is off, the refinement provider isn't fully configured,
or the call throws — refinement should never block the user from
getting their text out. Briefly shows "優化中" in the overlay during the
call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions workflow at `.github/workflows/build-windows.yml`:
- Triggers on push to master/main, on `v*` tags, on PRs, and via
`workflow_dispatch`. Concurrency group cancels superseded runs.
- Runs on `windows-latest`, sets up Flutter 3.41.9 stable with cache,
enables windows desktop, fetches deps, runs build_runner codegen, then
`flutter build windows --release`.
- Reads version from pubspec.yaml and passes it into Inno Setup as
`MyAppVersion`.
- Compiles the installer (Inno Setup 6 is preinstalled on the GHA
windows-latest image; falls back to chocolatey if absent).
- Always uploads the installer .exe as a workflow artifact.
- On a `v*` tag push, additionally publishes a GitHub Release with the
installer attached and auto-generated release notes.
Installer at `installer/zerotype.iss`:
- Bundles the entire Release output and installs to {autopf}\ZeroType
(admin) or LocalAppData\Programs\ZeroType (standard user).
- `PrivilegesRequiredOverridesAllowed=dialog commandline` lets the user
decline UAC and continue as a standard user. Welcome page
(`InfoBeforeFile=install-mode-info.txt`) explains the trade-off in
Traditional Chinese + English: admin gets system-wide install plus
auto-configured microphone permission and optional launch-at-startup;
standard user gets per-user install with no system writes — the user
configures mic and startup themselves.
- Microphone consent: on admin installs, writes
HKCU\…\CapabilityAccessManager\ConsentStore\microphone\NonPackaged\
<encoded-exe-path> with Value="Allow" so the user doesn't have to dig
into Settings → Privacy → Microphone the first time. The exe path is
computed at install time via `[Code]` GetMicConsentSubkey() — Windows
encodes the path with `#` instead of `\`, so install-location changes
are handled correctly. Skipped via `Check: IsAdminInstallMode` in
standard-user mode (per the user's "admin = auto, standard = manual"
contract).
- Launch-at-startup: optional task, also gated behind
`Check: IsAdminInstallMode`. Writes to HKCU\…\Run with the install
path; cleaned up on uninstall.
- Tracks the install with a stable AppId GUID so future versions upgrade
in place rather than installing side-by-side.
`.gitignore`: ignore `installer/Output/` (the local build output of
ISCC; CI generates these fresh per run).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LiteLLM proxy returned a 400 from OpenAI's /v1/chat/completions when we
sent an m4a-format input_audio:
Invalid value: 'm4a'. Supported values are: 'wav' and 'mp3'.
OpenAI's chat-completions audio API accepts only wav and mp3, so any
non-whisper OpenAI model (gpt-4o-audio, gpt-5.5, …) rejected our recordings
even though Gemini-style models had been accepting m4a fine.
Switch the Windows recorder to AudioEncoder.wav (PCM16, 16 kHz mono).
Trade-off:
* Files grow from ~1 MB/min (AAC@128k) to ~1.9 MB/min (PCM16@16k) —
fine for local LiteLLM proxies and acceptable for direct cloud
uploads.
* Format works for *every* backend tested: OpenAI input_audio,
Gemini multimodal generateContent, Whisper /v1/audio/transcriptions,
and LiteLLM-bridged routes for all of the above.
* Sample rate drops back to 16 kHz (Whisper's native rate); the AAC
44.1 kHz workaround was only needed because Windows MF's AAC encoder
rejected 16 kHz, and PCM has no such restriction.
macOS unchanged (AAC m4a continues to work end-to-end on the existing
mac pipeline; switching it would be churn for no benefit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GHA windows-latest Inno Setup install ships only the official language files; ChineseTraditional lives in Languages\Unofficial\ and isn't installed by default, so ISCC failed with: Couldn't open include file "compiler:Languages\ChineseTraditional.isl": The system cannot find the file specified. Drop the entry — the wizard chrome will render in English, but the substantive page (InfoBeforeFile=install-mode-info.txt) is still bilingual TC + English so the install-mode choice is unambiguous to Chinese-speaking users. If we want the wizard chrome itself in Chinese later, we can vendor the unofficial .isl into installer/ and reference it via a relative path; not bundling it now to avoid baking a third-party file we'd have to keep in sync.
When the user picked a text-only OpenAI model (gpt-5.5) on the LiteLLM proxy, the OpenAI backend returned the verbose: Invalid 'messages[0]'. Content blocks are expected to be either text or image_url type. …wrapped in HTTP boilerplate by _wrapDioError. The actual cause — 'this model is text-only, you need a multimodal/audio one' — was buried. Detect that specific OpenAI error pattern (image_url + 'Content blocks are expected') in the LiteLLM chat path and rewrite into a Traditional Chinese message that names the user-selected model and lists working alternatives: • gemini-2.5-flash-lite / gemini-3-flash-preview (multimodal) • claude-haiku-4-5 / claude-sonnet-* (multimodal) • gpt-4o-audio-preview / gpt-4o-mini-audio-preview • whisper-1 (transcription endpoint, auto-routed) All four families above were verified end-to-end against the user's LiteLLM proxy at v1.1.0.
The v1.1.0 tag run failed at the 'Create GitHub Release' step with HTTP 403: 'Resource not accessible by integration'. The default GITHUB_TOKEN on a fork only has read access to repository contents; uploading a release requires the elevated permission to be requested explicitly via `permissions: contents: write`.
- Replace placeholder your-username/zerotype URLs with alarmz/ZeroType - Add Windows install section using ZeroTypeSetup-x.y.z.exe with the admin-vs-standard mode explanation that mirrors the installer's bilingual welcome page - Document LiteLLM provider end-to-end: base URL format, the dynamic /v1/models picker, and the model-type → endpoint routing table (whisper-* → /v1/audio/transcriptions, multimodal → chat completions with input_audio, text-only → unsupported) - Mention the new optional text-refinement feature (independent provider/prompt + Settings toggle) - Update default-hotkey lines and paste-simulation lines to mention both ⌘V on macOS and Ctrl+V on Windows - Add a v1.1.0 release-notes block; demote v1.0.2 from 'current' - Note that Windows accessibility permission is NOT required (uses SendInput, no keyboard automation consent needed)
User reported transcripts arriving with the system prompt's internal
sections appended:
<transcript>
self_correction:
- 【後者為準】無遺漏的修正訊號。
- …
language: …
dictionary: …
Or, alternatively, fully wrapped in the prompt's YAML format with the
real answer hidden inside an `output:` block:
self_correction: …
output: |
我錄了兩次,第二次會出現很多說明…
Root cause: the SpeechToText.prompt is structured as YAML — `instructions:`,
`examples:` (with `reasoning:` and `output:` per example), `self_correction:`,
`language:`, `dictionary:`. Gemini 2.5 Flash *does* respect the convention
because its thinking-mode channel separates reasoning from final answer; but
gemini-2.5-flash-lite has no such channel, so it imitates the YAML form
and emits the structure verbatim alongside its answer.
Add `_stripPromptStructureEcho()` in SpeechRecognitionService applied to
the chat-completions content. The sanitizer:
- Detects top-level YAML keys (`self_correction|reasoning|language|
dictionary|examples|instructions|name|description`) at line start.
- If an explicit `output:` block exists (Pattern B), extracts and
un-indents its YAML literal value.
- Otherwise (Pattern A — transcript first, structure trailing), cuts
everything from the first structural key onward.
The log now shows both lengths so the effect is observable:
[LiteLLM-chat] success length=72 (raw=487) tokens: in=2825 out=312
…meaning the sanitizer trimmed 415 chars of structural noise.
The upstream prompt is left untouched — this is purely a defensive
client-side fix, so the upstream maintainer can keep their prompt format
without anyone needing to coordinate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
從 v1.0.2 fork,補上 Windows 平台支援並擴充幾個自己用得到的功能。整 fork 已測過 end-to-end(錄音 → 轉錄 → 自動貼上)並發過 v1.1.0 release。如果你覺得 PR 太大或某些部分不想收,我可以切成多個 PR、或挑你要的部分另外整理。
主要變更
🪟 Windows 全平台支援
windows/runner/channel_handler.cpp:原生 Windows runner 實作 paste(SendInput Ctrl+V)、麥克風權限 stub(Windows 不需 Accessibility)、overlay 與 control channellib/core/services/recording_service.dart:Windows 用 WAV(16 kHz mono PCM,相容 OpenAI/Whisper/Gemini),macOS 維持 AAC m4alib/shared/widgets/recording_overlay.dart:Windows 用 Flutter widget overlay(macOS 仍走 NSPanel)lib/shared/widgets/main_shell.dart:標題列改成可拖曳 + 最小化/最大化/關閉按鈕(補上隱藏 native title bar 後缺的視窗控制)settings_page.dart:modifier 標籤 platform-aware(Windows 顯示 Win/Ctrl/Alt/Shift),accessibility tile 在 Windows 隱藏🔌 LiteLLM Proxy 整合
assets/config/providers.json加litellmproviderSpeechRecognitionService新增fetchAvailableModels()抓/v1/modelswhisper-*→/v1/audio/transcriptions,其他 →/v1/chat/completions多模態 input_audio✨ 文字優化(可選)
prompts/TextRefinement.prompt新 default📦 Windows installer + GitHub Actions CI
.github/workflows/build-windows.yml:每 push 都 build Windows release + Inno Setup installer artifact;push tagv*自動發 GitHub Releaseinstaller/zerotype.iss:Inno Setup 安裝程式HKCU\…\CapabilityAccessManager\ConsentStore\microphone\NonPackaged\<encoded-exe-path>)+ 開機啟動%LOCALAPPDATA%\Programs\ZeroType,不碰系統設定🔎 Logging
lib/core/services/app_logger.dart:append-only logger 寫%TEMP%\zero_type.log,1 MiB rolling,同時 mirror stdout_wrapDioError處理📝 README
設計決策
/v1/audio/transcriptions和 chat completions 的input_audio只接受 wav/mp3;macOS 路徑保持 m4a 沒動以免 regression/v1/models而非寫死:LiteLLM proxy 的 model alias 由使用者自己配置,不可能 hard-code向後相容
macOS 原本的路徑完全沒動:
app_constants.dart中早已宣告但未使用的 key,沒新增 key 名)測試狀況
如何取捨
如果整 PR 太大,建議優先合:
歡迎直接 cherry-pick,或告訴我重新切 PR。
🤖 Most of the engineering work was done in collaboration with Claude (Anthropic).