Windows desktop support + LiteLLM provider + optional text refinement by alarmz · Pull Request #1 · nick1ee/ZeroType

alarmz · 2026-05-01T09:24:32Z

從 v1.0.2 fork，補上 Windows 平台支援並擴充幾個自己用得到的功能。整 fork 已測過 end-to-end（錄音 → 轉錄 → 自動貼上）並發過 v1.1.0 release。如果你覺得 PR 太大或某些部分不想收，我可以切成多個 PR、或挑你要的部分另外整理。

主要變更

🪟 Windows 全平台支援

windows/runner/channel_handler.cpp：原生 Windows runner 實作 paste（SendInput Ctrl+V）、麥克風權限 stub（Windows 不需 Accessibility）、overlay 與 control channel
lib/core/services/recording_service.dart：Windows 用 WAV（16 kHz mono PCM，相容 OpenAI/Whisper/Gemini），macOS 維持 AAC m4a
lib/shared/widgets/recording_overlay.dart：Windows 用 Flutter widget overlay（macOS 仍走 NSPanel）
lib/shared/widgets/main_shell.dart：標題列改成可拖曳 + 最小化/最大化/關閉按鈕（補上隱藏 native title bar 後缺的視窗控制）
settings_page.dart：modifier 標籤 platform-aware（Windows 顯示 Win/Ctrl/Alt/Shift），accessibility tile 在 Windows 隱藏

🔌 LiteLLM Proxy 整合

assets/config/providers.json 加 litellm provider
SpeechRecognitionService 新增 fetchAvailableModels() 抓 /v1/models
路由：whisper-* → /v1/audio/transcriptions，其他 → /v1/chat/completions 多模態 input_audio
偵測到 OpenAI text-only 模型錯誤時改寫為清晰中文提示
repository / controller / page 全套擴充支援動態 model 抓取與快取（separate prefs key namespace）

✨ 文字優化（可選）

prompts/TextRefinement.prompt 新 default
整套 refinement 配置與 transcription 完全獨立（可 Whisper 轉錄 + Claude 優化）
設定頁 toggle 控制，預設關閉，失敗自動 fallback 用原始轉錄文字
涉及 PromptRepository / ModelConfigRepository 擴充、新 RefinementProviderController + DynamicRefinementModelsController + RefinementPromptController；UI 在「模型」頁加 collapsible section、「提示詞」改成 TabBar

📦 Windows installer + GitHub Actions CI

.github/workflows/build-windows.yml：每 push 都 build Windows release + Inno Setup installer artifact；push tag v* 自動發 GitHub Release
installer/zerotype.iss：Inno Setup 安裝程式
- 第一頁中英對照解釋「admin = 自動配置 / 一般 = 你自己設定」
- admin 模式自動寫入 Windows 麥克風權限白名單（HKCU\…\CapabilityAccessManager\ConsentStore\microphone\NonPackaged\<encoded-exe-path>）+ 開機啟動
- 一般模式裝到 %LOCALAPPDATA%\Programs\ZeroType，不碰系統設定

🔎 Logging

lib/core/services/app_logger.dart：append-only logger 寫 %TEMP%\zero_type.log，1 MiB rolling，同時 mirror stdout
浮動圓條 error 狀態顯示完整錯誤訊息（含 HTTP status + response body）而非固定「錯誤」二字
DioException 統一透過 _wrapDioError 處理

📝 README

全面更新：Windows 安裝段落、LiteLLM 設定教學（含模型路由表）、文字優化說明、v1.1.0 release notes

設計決策

Refinement 預設關閉：避免使用者不小心讓帳單翻倍；設定不完整時靜默 skip
WAV 而非 AAC：OpenAI /v1/audio/transcriptions 和 chat completions 的 input_audio 只接受 wav/mp3；macOS 路徑保持 m4a 沒動以免 regression
動態 /v1/models 而非寫死：LiteLLM proxy 的 model alias 由使用者自己配置，不可能 hard-code
Installer admin/standard 雙模式：不強制 admin，但用 admin 旗標當 explicit signal 讓使用者選擇是否要 auto-config

向後相容

macOS 原本的路徑完全沒動：

AppDelegate.swift 與 native overlay 未改
AAC m4a 錄音格式維持
既有 prompt assets 與 prefs key 完全保留（refinement 用了 app_constants.dart 中早已宣告但未使用的 key，沒新增 key 名）

測試狀況

✅ Windows 11 Pro 25H2 end-to-end 驗過 LiteLLM proxy → 多個 model（gemini-2.5-flash-lite / gemini-3-flash-preview / claude-haiku-4-5）→ 文字貼上
✅ Inno Setup installer 雙模式測過
✅ CI release pipeline 跑通並發過 v1.1.0
❌ macOS 沒回歸測（fork 上沒有 mac 環境），但 mac 路徑完全未動

如何取捨

如果整 PR 太大，建議優先合：

Windows 平台支援（核心：channel_handler.cpp + recording WAV + main_shell drag bar）
Logging（小、有用、沒爭議）
其他模組（LiteLLM / refinement / installer）視意願取捨

歡迎直接 cherry-pick，或告訴我重新切 PR。

🤖 Most of the engineering work was done in collaboration with Claude (Anthropic).

The macOS-style modifier glyphs (⌘ Command, ⌥ Option) are jarring on Windows. Add a top-level _modifierLabel helper that returns Win/Ctrl/Alt/Shift on Windows and the macOS labels elsewhere, and route both the hotkey display and the recorder overlay through it. Also hide the "輔助使用權限" permission tile on Windows since SendInput does not require it (the native channel already returns true unconditionally on Windows; previously the tile was always-green-but-misleading). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ZeroType already exposed a free-text "custom endpoint" for the OpenAI provider, which technically pointed at any OpenAI-compatible proxy. But the endpoint was hidden in an "advanced" expansion and the model list was static, so using a LiteLLM proxy meant the user had to hand-type whatever model alias their proxy exposed. This commit adds a first-class `litellm` provider: - assets/config/providers.json: new `litellm` entry with empty static model list (filled dynamically at runtime). - SpeechRecognitionService: - new `case 'litellm'` that requires the user-supplied base URL and routes the request through the existing OpenAI multipart handler, appending /v1/audio/transcriptions to the base. - new `fetchAvailableModels(baseUrl, apiKey)` that GETs /v1/models on the proxy and returns id+name records for the picker. - ModelConfigRepository: `getCachedModels` / `saveCachedModels` keyed by providerId, JSON-encoded into SharedPreferences. - DynamicModelsController (riverpod family by providerId, keepAlive): loads cached list on build; `refresh(baseUrl, apiKey)` hits /v1/models and persists. - model_config_page UI: when the selected provider is `litellm`, render a required Proxy-Base-URL input inline (with a hint that /v1 is added automatically), and replace the static model dropdown with `_LiteLLMModelPicker` — a dropdown sourced from the dynamic controller plus a refresh icon button. Errors and "not yet fetched" states surface inline. Other providers retain the existing static dropdown and "advanced" custom endpoint UX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Flutter window is created with `titleBarStyle: TitleBarStyle.hidden`, which suppresses the native Windows title bar. Without a substitute drag region this leaves the window completely immovable — fatal on multi-monitor setups where the user can't reach controls that opened off-screen. Replace the static "Zero Type" centred Container with a `_TitleBar`: - The centre region is wrapped in `DragToMoveArea` from window_manager so the user can grab the title to move/throw the window between displays. - Three trailing 46x44 buttons (minimize / maximize-or-restore / close) drive `windowManager` directly. The close button paints the standard Windows red on hover with a white icon; the others use a subtle surface-tinted hover. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The user reported pressing Alt+Space showed only the literal string "錯誤" in the floating pill, with no further information — making it impossible to tell whether the proxy URL was wrong, the API key was bad, the model returned 4xx, or something timed out. Three changes: 1. recording_overlay: when status is error, render the actual `state.errorMessage` (`錯誤：<message>`) instead of the fixed "錯誤" string. The label is wrapped in a 480-px max Flexible/ConstrainedBox so long server bodies wrap and ellipsis at 3 lines instead of overflowing the pill. 2. SpeechRecognitionService: wrap the OpenAI/LiteLLM POST and the LiteLLM `/v1/models` GET in DioException-aware try/catch via a new `_wrapDioError` helper that: * logs full details (type, status, message, truncated body) via `AppLogger` * rethrows an Exception whose message includes the HTTP status, the Dio error type, and a 400-char-truncated response body so the user sees something actionable rather than a stack trace The pre-existing Gemini handler is migrated to the same helper. 3. Add `AppLogger` (`lib/core/services/app_logger.dart`): an append-only logger that writes to `%TEMP%\zero_type.log` (rotated at 1 MiB) and mirrors to stdout. The controller's hotkey/start/stop/error paths use it instead of `print`, and a new `_failStartup(msg)` helper unifies the early-validation paths so they actually mutate state to error+message (not just call the macOS-only native overlay channel) — previously on Windows these branches showed nothing at all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pressing Alt+Space on Windows immediately threw: PlatformException(Record, null, The data specified for the media type is invalid, inconsistent, or not supported by this object., null) Root cause: the Windows Media Foundation AAC encoder accepts only 44100 or 48000 Hz sample rates (per Microsoft's AAC encoder spec). The previous hard-coded 16000 Hz — chosen because Whisper is internally 16 kHz, so the mac path avoided server-side resampling — fails on Windows with MF_E_INVALIDMEDIATYPE before any audio is captured. Branch on Platform.isWindows: 44100 Hz on Windows, 16000 Hz on macOS (unchanged). Bitrate stays 128 kbps so the upload size is identical; Whisper resamples 44.1 → 16 kHz on the server with no perceptible difference for speech. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The user pointed ZeroType at a LiteLLM proxy and selected `gemini-2.5-flash-lite`, which the proxy returned in its `/v1/models` listing. The transcription request to `/v1/audio/transcriptions` failed with a server-side 500 and the LiteLLM body: litellm.APIConnectionError: Unmapped provider passed in. Unable to get the response. LiteLLM's `/v1/audio/transcriptions` only knows how to dispatch to whisper-style backends (OpenAI Whisper, Deepgram, Azure Speech, …); it has no mapping for chat/multimodal models. Modern multimodal LLMs (Gemini, GPT-4o-audio, Claude with audio) accept audio via the chat completions endpoint with an `input_audio` content part — and LiteLLM already bridges that shape to each backend's native audio API. Switch the LiteLLM provider to: - `model.contains('whisper')` → /v1/audio/transcriptions (existing path) - everything else → /v1/chat/completions with input_audio The new `_transcribeWithChatCompletions` reads the recorded m4a, base64- encodes it, posts a single user message containing the prompt and the audio (format detected from the file extension), and parses `choices[0].message.content` as the transcript. Token usage is read from `usage.prompt_tokens` / `usage.completion_tokens` so the history page's cost tracking still works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…toggle) Adds option C from the design conversation: take the raw transcript from the speech provider and run it through a separate chat LLM to clean up filler words, fix self-corrections, normalise punctuation, etc. The point is to let users pick "cheap+fast model for transcription, smart model for polishing" — e.g. Whisper transcribe → Claude refine. Three independent axes (per the user's chosen design): (a) Refinement provider/model/key/endpoint kept entirely separate from the speech provider, with its own SharedPreferences namespace. (b) Refinement prompt is a separate editable prompt (`TextRefinement.prompt` asset, custom override at `<appSupport>/TextRefinement_Custom.prompt`). (c) Off by default; activated by a toggle in the Settings page so a typo in the refinement config can never silently double the user's bill. Implementation: - `assets/prompts/TextRefinement.prompt` — sensible default refinement instructions (preserve meaning, drop fillers, fix self-corrections, format bullet lists, restore literal punctuation). - `PromptRepository`: refactored shared file/asset/prefs helpers and added refinement get/save/reset/default methods. `PromptController` adds a `RefinementPromptController` mirroring the speech one. `PromptPage` becomes a `TabBar` with two `PromptEditor` instances. - `ModelConfigRepository`: refinement-namespaced provider/model/apiKey/ endpoint/cachedModels methods (separate prefs keys so speech and refinement can point at different LiteLLM proxies if desired). - `model_config_controller`: new `RefinementProviderController` and `DynamicRefinementModelsController` (parallel to the speech versions). - `model_config_page`: new "文字優化（可選）" collapsible section reusing `_ApiKeyInput`/`_LiteLLMEndpointInput`/`_ModelDropdown`. The `_LiteLLMModelPicker` gained an `isRefinement` flag so it watches and refreshes the refinement model cache when used in this section. - `SettingsState`: new `refinementEnabled` field; `SettingsController` loads/saves it via `AppConstants.isRefinementEnabledKey` (which was already declared but unused). New tile in `settings_page` between "開機啟動" and "歷史記錄保留時間". - `SpeechRecognitionService.refine(rawText, …)`: text-only chat-completion call. Routes openai/litellm via `/v1/chat/completions` (LiteLLM works with any chat backend), gemini via `models/{id}:generateContent`. Reuses the existing `_wrapDioError` helper so any failure produces a useful log line and a readable exception. - `ZeroTypeController._maybeRefine(rawText)`: called between transcribe and clipboard/paste. Returns `null` (silent fallback to raw transcript) if the toggle is off, the refinement provider isn't fully configured, or the call throws — refinement should never block the user from getting their text out. Briefly shows "優化中" in the overlay during the call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GitHub Actions workflow at `.github/workflows/build-windows.yml`: - Triggers on push to master/main, on `v*` tags, on PRs, and via `workflow_dispatch`. Concurrency group cancels superseded runs. - Runs on `windows-latest`, sets up Flutter 3.41.9 stable with cache, enables windows desktop, fetches deps, runs build_runner codegen, then `flutter build windows --release`. - Reads version from pubspec.yaml and passes it into Inno Setup as `MyAppVersion`. - Compiles the installer (Inno Setup 6 is preinstalled on the GHA windows-latest image; falls back to chocolatey if absent). - Always uploads the installer .exe as a workflow artifact. - On a `v*` tag push, additionally publishes a GitHub Release with the installer attached and auto-generated release notes. Installer at `installer/zerotype.iss`: - Bundles the entire Release output and installs to {autopf}\ZeroType (admin) or LocalAppData\Programs\ZeroType (standard user). - `PrivilegesRequiredOverridesAllowed=dialog commandline` lets the user decline UAC and continue as a standard user. Welcome page (`InfoBeforeFile=install-mode-info.txt`) explains the trade-off in Traditional Chinese + English: admin gets system-wide install plus auto-configured microphone permission and optional launch-at-startup; standard user gets per-user install with no system writes — the user configures mic and startup themselves. - Microphone consent: on admin installs, writes HKCU\…\CapabilityAccessManager\ConsentStore\microphone\NonPackaged\ <encoded-exe-path> with Value="Allow" so the user doesn't have to dig into Settings → Privacy → Microphone the first time. The exe path is computed at install time via `[Code]` GetMicConsentSubkey() — Windows encodes the path with `#` instead of `\`, so install-location changes are handled correctly. Skipped via `Check: IsAdminInstallMode` in standard-user mode (per the user's "admin = auto, standard = manual" contract). - Launch-at-startup: optional task, also gated behind `Check: IsAdminInstallMode`. Writes to HKCU\…\Run with the install path; cleaned up on uninstall. - Tracks the install with a stable AppId GUID so future versions upgrade in place rather than installing side-by-side. `.gitignore`: ignore `installer/Output/` (the local build output of ISCC; CI generates these fresh per run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LiteLLM proxy returned a 400 from OpenAI's /v1/chat/completions when we sent an m4a-format input_audio: Invalid value: 'm4a'. Supported values are: 'wav' and 'mp3'. OpenAI's chat-completions audio API accepts only wav and mp3, so any non-whisper OpenAI model (gpt-4o-audio, gpt-5.5, …) rejected our recordings even though Gemini-style models had been accepting m4a fine. Switch the Windows recorder to AudioEncoder.wav (PCM16, 16 kHz mono). Trade-off: * Files grow from ~1 MB/min (AAC@128k) to ~1.9 MB/min (PCM16@16k) — fine for local LiteLLM proxies and acceptable for direct cloud uploads. * Format works for *every* backend tested: OpenAI input_audio, Gemini multimodal generateContent, Whisper /v1/audio/transcriptions, and LiteLLM-bridged routes for all of the above. * Sample rate drops back to 16 kHz (Whisper's native rate); the AAC 44.1 kHz workaround was only needed because Windows MF's AAC encoder rejected 16 kHz, and PCM has no such restriction. macOS unchanged (AAC m4a continues to work end-to-end on the existing mac pipeline; switching it would be churn for no benefit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The GHA windows-latest Inno Setup install ships only the official language files; ChineseTraditional lives in Languages\Unofficial\ and isn't installed by default, so ISCC failed with: Couldn't open include file "compiler:Languages\ChineseTraditional.isl": The system cannot find the file specified. Drop the entry — the wizard chrome will render in English, but the substantive page (InfoBeforeFile=install-mode-info.txt) is still bilingual TC + English so the install-mode choice is unambiguous to Chinese-speaking users. If we want the wizard chrome itself in Chinese later, we can vendor the unofficial .isl into installer/ and reference it via a relative path; not bundling it now to avoid baking a third-party file we'd have to keep in sync.

When the user picked a text-only OpenAI model (gpt-5.5) on the LiteLLM proxy, the OpenAI backend returned the verbose: Invalid 'messages[0]'. Content blocks are expected to be either text or image_url type. …wrapped in HTTP boilerplate by _wrapDioError. The actual cause — 'this model is text-only, you need a multimodal/audio one' — was buried. Detect that specific OpenAI error pattern (image_url + 'Content blocks are expected') in the LiteLLM chat path and rewrite into a Traditional Chinese message that names the user-selected model and lists working alternatives: • gemini-2.5-flash-lite / gemini-3-flash-preview (multimodal) • claude-haiku-4-5 / claude-sonnet-* (multimodal) • gpt-4o-audio-preview / gpt-4o-mini-audio-preview • whisper-1 (transcription endpoint, auto-routed) All four families above were verified end-to-end against the user's LiteLLM proxy at v1.1.0.

The v1.1.0 tag run failed at the 'Create GitHub Release' step with HTTP 403: 'Resource not accessible by integration'. The default GITHUB_TOKEN on a fork only has read access to repository contents; uploading a release requires the elevated permission to be requested explicitly via `permissions: contents: write`.

- Replace placeholder your-username/zerotype URLs with alarmz/ZeroType - Add Windows install section using ZeroTypeSetup-x.y.z.exe with the admin-vs-standard mode explanation that mirrors the installer's bilingual welcome page - Document LiteLLM provider end-to-end: base URL format, the dynamic /v1/models picker, and the model-type → endpoint routing table (whisper-* → /v1/audio/transcriptions, multimodal → chat completions with input_audio, text-only → unsupported) - Mention the new optional text-refinement feature (independent provider/prompt + Settings toggle) - Update default-hotkey lines and paste-simulation lines to mention both ⌘V on macOS and Ctrl+V on Windows - Add a v1.1.0 release-notes block; demote v1.0.2 from 'current' - Note that Windows accessibility permission is NOT required (uses SendInput, no keyboard automation consent needed)

User reported transcripts arriving with the system prompt's internal sections appended: <transcript> self_correction: - 【後者為準】無遺漏的修正訊號。 - … language: … dictionary: … Or, alternatively, fully wrapped in the prompt's YAML format with the real answer hidden inside an `output:` block: self_correction: … output: | 我錄了兩次，第二次會出現很多說明… Root cause: the SpeechToText.prompt is structured as YAML — `instructions:`, `examples:` (with `reasoning:` and `output:` per example), `self_correction:`, `language:`, `dictionary:`. Gemini 2.5 Flash *does* respect the convention because its thinking-mode channel separates reasoning from final answer; but gemini-2.5-flash-lite has no such channel, so it imitates the YAML form and emits the structure verbatim alongside its answer. Add `_stripPromptStructureEcho()` in SpeechRecognitionService applied to the chat-completions content. The sanitizer: - Detects top-level YAML keys (`self_correction|reasoning|language| dictionary|examples|instructions|name|description`) at line start. - If an explicit `output:` block exists (Pattern B), extracts and un-indents its YAML literal value. - Otherwise (Pattern A — transcript first, structure trailing), cuts everything from the first structural key onward. The log now shows both lengths so the effect is observable: [LiteLLM-chat] success length=72 (raw=487) tokens: in=2825 out=312 …meaning the sanitizer trimmed 415 chars of structural noise. The upstream prompt is left untouched — this is purely a defensive client-side fix, so the upstream maintainer can keep their prompt format without anyone needing to coordinate.

alarmz and others added 15 commits May 1, 2026 14:44

Bump version to 1.1.0 for Windows release

aee819b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows desktop support + LiteLLM provider + optional text refinement#1

Windows desktop support + LiteLLM provider + optional text refinement#1
alarmz wants to merge 15 commits into
nick1ee:masterfrom
alarmz:master

alarmz commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alarmz commented May 1, 2026

主要變更

🪟 Windows 全平台支援

🔌 LiteLLM Proxy 整合

✨ 文字優化（可選）

📦 Windows installer + GitHub Actions CI

🔎 Logging

📝 README

設計決策

向後相容

測試狀況

如何取捨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant