Skip to content

Windows desktop support + LiteLLM provider + optional text refinement#1

Open
alarmz wants to merge 15 commits into
nick1ee:masterfrom
alarmz:master
Open

Windows desktop support + LiteLLM provider + optional text refinement#1
alarmz wants to merge 15 commits into
nick1ee:masterfrom
alarmz:master

Conversation

@alarmz

@alarmz alarmz commented May 1, 2026

Copy link
Copy Markdown

從 v1.0.2 fork,補上 Windows 平台支援並擴充幾個自己用得到的功能。整 fork 已測過 end-to-end(錄音 → 轉錄 → 自動貼上)並發過 v1.1.0 release。如果你覺得 PR 太大或某些部分不想收,我可以切成多個 PR、或挑你要的部分另外整理。

主要變更

🪟 Windows 全平台支援

  • windows/runner/channel_handler.cpp:原生 Windows runner 實作 paste(SendInput Ctrl+V)、麥克風權限 stub(Windows 不需 Accessibility)、overlay 與 control channel
  • lib/core/services/recording_service.dart:Windows 用 WAV(16 kHz mono PCM,相容 OpenAI/Whisper/Gemini),macOS 維持 AAC m4a
  • lib/shared/widgets/recording_overlay.dart:Windows 用 Flutter widget overlay(macOS 仍走 NSPanel)
  • lib/shared/widgets/main_shell.dart:標題列改成可拖曳 + 最小化/最大化/關閉按鈕(補上隱藏 native title bar 後缺的視窗控制)
  • settings_page.dart:modifier 標籤 platform-aware(Windows 顯示 Win/Ctrl/Alt/Shift),accessibility tile 在 Windows 隱藏

🔌 LiteLLM Proxy 整合

  • assets/config/providers.jsonlitellm provider
  • SpeechRecognitionService 新增 fetchAvailableModels()/v1/models
  • 路由:whisper-*/v1/audio/transcriptions,其他 → /v1/chat/completions 多模態 input_audio
  • 偵測到 OpenAI text-only 模型錯誤時改寫為清晰中文提示
  • repository / controller / page 全套擴充支援動態 model 抓取與快取(separate prefs key namespace)

✨ 文字優化(可選)

  • prompts/TextRefinement.prompt 新 default
  • 整套 refinement 配置與 transcription 完全獨立(可 Whisper 轉錄 + Claude 優化)
  • 設定頁 toggle 控制,預設關閉,失敗自動 fallback 用原始轉錄文字
  • 涉及 PromptRepository / ModelConfigRepository 擴充、新 RefinementProviderController + DynamicRefinementModelsController + RefinementPromptController;UI 在「模型」頁加 collapsible section、「提示詞」改成 TabBar

📦 Windows installer + GitHub Actions CI

  • .github/workflows/build-windows.yml:每 push 都 build Windows release + Inno Setup installer artifact;push tag v* 自動發 GitHub Release
  • installer/zerotype.iss:Inno Setup 安裝程式
    • 第一頁中英對照解釋「admin = 自動配置 / 一般 = 你自己設定」
    • admin 模式自動寫入 Windows 麥克風權限白名單(HKCU\…\CapabilityAccessManager\ConsentStore\microphone\NonPackaged\<encoded-exe-path>)+ 開機啟動
    • 一般模式裝到 %LOCALAPPDATA%\Programs\ZeroType,不碰系統設定

🔎 Logging

  • lib/core/services/app_logger.dart:append-only logger 寫 %TEMP%\zero_type.log,1 MiB rolling,同時 mirror stdout
  • 浮動圓條 error 狀態顯示完整錯誤訊息(含 HTTP status + response body)而非固定「錯誤」二字
  • DioException 統一透過 _wrapDioError 處理

📝 README

  • 全面更新:Windows 安裝段落、LiteLLM 設定教學(含模型路由表)、文字優化說明、v1.1.0 release notes

設計決策

  • Refinement 預設關閉:避免使用者不小心讓帳單翻倍;設定不完整時靜默 skip
  • WAV 而非 AAC:OpenAI /v1/audio/transcriptions 和 chat completions 的 input_audio 只接受 wav/mp3;macOS 路徑保持 m4a 沒動以免 regression
  • 動態 /v1/models 而非寫死:LiteLLM proxy 的 model alias 由使用者自己配置,不可能 hard-code
  • Installer admin/standard 雙模式:不強制 admin,但用 admin 旗標當 explicit signal 讓使用者選擇是否要 auto-config

向後相容

macOS 原本的路徑完全沒動:

  • AppDelegate.swift 與 native overlay 未改
  • AAC m4a 錄音格式維持
  • 既有 prompt assets 與 prefs key 完全保留(refinement 用了 app_constants.dart 中早已宣告但未使用的 key,沒新增 key 名)

測試狀況

  • ✅ Windows 11 Pro 25H2 end-to-end 驗過 LiteLLM proxy → 多個 model(gemini-2.5-flash-lite / gemini-3-flash-preview / claude-haiku-4-5)→ 文字貼上
  • ✅ Inno Setup installer 雙模式測過
  • ✅ CI release pipeline 跑通並發過 v1.1.0
  • ❌ macOS 沒回歸測(fork 上沒有 mac 環境),但 mac 路徑完全未動

如何取捨

如果整 PR 太大,建議優先合:

  1. Windows 平台支援(核心:channel_handler.cpp + recording WAV + main_shell drag bar)
  2. Logging(小、有用、沒爭議)
  3. 其他模組(LiteLLM / refinement / installer)視意願取捨

歡迎直接 cherry-pick,或告訴我重新切 PR。


🤖 Most of the engineering work was done in collaboration with Claude (Anthropic).

alarmz and others added 15 commits May 1, 2026 14:44
The macOS-style modifier glyphs (⌘ Command, ⌥ Option) are jarring on
Windows. Add a top-level _modifierLabel helper that returns Win/Ctrl/Alt/Shift
on Windows and the macOS labels elsewhere, and route both the hotkey display
and the recorder overlay through it.

Also hide the "輔助使用權限" permission tile on Windows since SendInput does
not require it (the native channel already returns true unconditionally on
Windows; previously the tile was always-green-but-misleading).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZeroType already exposed a free-text "custom endpoint" for the OpenAI
provider, which technically pointed at any OpenAI-compatible proxy. But
the endpoint was hidden in an "advanced" expansion and the model list
was static, so using a LiteLLM proxy meant the user had to hand-type
whatever model alias their proxy exposed.

This commit adds a first-class `litellm` provider:

- assets/config/providers.json: new `litellm` entry with empty static
  model list (filled dynamically at runtime).

- SpeechRecognitionService:
  - new `case 'litellm'` that requires the user-supplied base URL and
    routes the request through the existing OpenAI multipart handler,
    appending /v1/audio/transcriptions to the base.
  - new `fetchAvailableModels(baseUrl, apiKey)` that GETs /v1/models on
    the proxy and returns id+name records for the picker.

- ModelConfigRepository: `getCachedModels` / `saveCachedModels` keyed by
  providerId, JSON-encoded into SharedPreferences.

- DynamicModelsController (riverpod family by providerId, keepAlive):
  loads cached list on build; `refresh(baseUrl, apiKey)` hits /v1/models
  and persists.

- model_config_page UI: when the selected provider is `litellm`, render
  a required Proxy-Base-URL input inline (with a hint that /v1 is added
  automatically), and replace the static model dropdown with
  `_LiteLLMModelPicker` — a dropdown sourced from the dynamic
  controller plus a refresh icon button. Errors and "not yet fetched"
  states surface inline. Other providers retain the existing static
  dropdown and "advanced" custom endpoint UX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Flutter window is created with `titleBarStyle: TitleBarStyle.hidden`,
which suppresses the native Windows title bar. Without a substitute drag
region this leaves the window completely immovable — fatal on multi-monitor
setups where the user can't reach controls that opened off-screen.

Replace the static "Zero Type" centred Container with a `_TitleBar`:

- The centre region is wrapped in `DragToMoveArea` from window_manager so
  the user can grab the title to move/throw the window between displays.
- Three trailing 46x44 buttons (minimize / maximize-or-restore / close)
  drive `windowManager` directly. The close button paints the standard
  Windows red on hover with a white icon; the others use a subtle
  surface-tinted hover.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user reported pressing Alt+Space showed only the literal string "錯誤"
in the floating pill, with no further information — making it impossible to
tell whether the proxy URL was wrong, the API key was bad, the model
returned 4xx, or something timed out. Three changes:

1. recording_overlay: when status is error, render the actual
   `state.errorMessage` (`錯誤:<message>`) instead of the fixed "錯誤"
   string. The label is wrapped in a 480-px max Flexible/ConstrainedBox so
   long server bodies wrap and ellipsis at 3 lines instead of overflowing
   the pill.

2. SpeechRecognitionService: wrap the OpenAI/LiteLLM POST and the LiteLLM
   `/v1/models` GET in DioException-aware try/catch via a new
   `_wrapDioError` helper that:
     * logs full details (type, status, message, truncated body) via
       `AppLogger`
     * rethrows an Exception whose message includes the HTTP status, the
       Dio error type, and a 400-char-truncated response body so the user
       sees something actionable rather than a stack trace
   The pre-existing Gemini handler is migrated to the same helper.

3. Add `AppLogger` (`lib/core/services/app_logger.dart`): an append-only
   logger that writes to `%TEMP%\zero_type.log` (rotated at 1 MiB) and
   mirrors to stdout. The controller's hotkey/start/stop/error paths use it
   instead of `print`, and a new `_failStartup(msg)` helper unifies the
   early-validation paths so they actually mutate state to error+message
   (not just call the macOS-only native overlay channel) — previously on
   Windows these branches showed nothing at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pressing Alt+Space on Windows immediately threw:
  PlatformException(Record, null, The data specified for the media type is
  invalid, inconsistent, or not supported by this object., null)

Root cause: the Windows Media Foundation AAC encoder accepts only 44100 or
48000 Hz sample rates (per Microsoft's AAC encoder spec). The previous
hard-coded 16000 Hz — chosen because Whisper is internally 16 kHz, so the
mac path avoided server-side resampling — fails on Windows with
MF_E_INVALIDMEDIATYPE before any audio is captured.

Branch on Platform.isWindows: 44100 Hz on Windows, 16000 Hz on macOS
(unchanged). Bitrate stays 128 kbps so the upload size is identical;
Whisper resamples 44.1 → 16 kHz on the server with no perceptible
difference for speech.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user pointed ZeroType at a LiteLLM proxy and selected
`gemini-2.5-flash-lite`, which the proxy returned in its `/v1/models`
listing. The transcription request to `/v1/audio/transcriptions` failed
with a server-side 500 and the LiteLLM body:
  litellm.APIConnectionError: Unmapped provider passed in. Unable to get
  the response.

LiteLLM's `/v1/audio/transcriptions` only knows how to dispatch to
whisper-style backends (OpenAI Whisper, Deepgram, Azure Speech, …); it
has no mapping for chat/multimodal models. Modern multimodal LLMs
(Gemini, GPT-4o-audio, Claude with audio) accept audio via the chat
completions endpoint with an `input_audio` content part — and LiteLLM
already bridges that shape to each backend's native audio API.

Switch the LiteLLM provider to:
  - `model.contains('whisper')` → /v1/audio/transcriptions (existing path)
  - everything else            → /v1/chat/completions with input_audio

The new `_transcribeWithChatCompletions` reads the recorded m4a, base64-
encodes it, posts a single user message containing the prompt and the
audio (format detected from the file extension), and parses
`choices[0].message.content` as the transcript. Token usage is read from
`usage.prompt_tokens` / `usage.completion_tokens` so the history page's
cost tracking still works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…toggle)

Adds option C from the design conversation: take the raw transcript from the
speech provider and run it through a separate chat LLM to clean up filler
words, fix self-corrections, normalise punctuation, etc. The point is to let
users pick "cheap+fast model for transcription, smart model for polishing"
— e.g. Whisper transcribe → Claude refine.

Three independent axes (per the user's chosen design):
  (a) Refinement provider/model/key/endpoint kept entirely separate from the
      speech provider, with its own SharedPreferences namespace.
  (b) Refinement prompt is a separate editable prompt (`TextRefinement.prompt`
      asset, custom override at `<appSupport>/TextRefinement_Custom.prompt`).
  (c) Off by default; activated by a toggle in the Settings page so a typo
      in the refinement config can never silently double the user's bill.

Implementation:

- `assets/prompts/TextRefinement.prompt` — sensible default refinement
  instructions (preserve meaning, drop fillers, fix self-corrections, format
  bullet lists, restore literal punctuation).

- `PromptRepository`: refactored shared file/asset/prefs helpers and added
  refinement get/save/reset/default methods. `PromptController` adds a
  `RefinementPromptController` mirroring the speech one. `PromptPage`
  becomes a `TabBar` with two `PromptEditor` instances.

- `ModelConfigRepository`: refinement-namespaced provider/model/apiKey/
  endpoint/cachedModels methods (separate prefs keys so speech and
  refinement can point at different LiteLLM proxies if desired).

- `model_config_controller`: new `RefinementProviderController` and
  `DynamicRefinementModelsController` (parallel to the speech versions).

- `model_config_page`: new "文字優化(可選)" collapsible section reusing
  `_ApiKeyInput`/`_LiteLLMEndpointInput`/`_ModelDropdown`. The
  `_LiteLLMModelPicker` gained an `isRefinement` flag so it watches and
  refreshes the refinement model cache when used in this section.

- `SettingsState`: new `refinementEnabled` field; `SettingsController`
  loads/saves it via `AppConstants.isRefinementEnabledKey` (which was
  already declared but unused). New tile in `settings_page` between
  "開機啟動" and "歷史記錄保留時間".

- `SpeechRecognitionService.refine(rawText, …)`: text-only chat-completion
  call. Routes openai/litellm via `/v1/chat/completions` (LiteLLM works
  with any chat backend), gemini via `models/{id}:generateContent`. Reuses
  the existing `_wrapDioError` helper so any failure produces a useful
  log line and a readable exception.

- `ZeroTypeController._maybeRefine(rawText)`: called between transcribe
  and clipboard/paste. Returns `null` (silent fallback to raw transcript)
  if the toggle is off, the refinement provider isn't fully configured,
  or the call throws — refinement should never block the user from
  getting their text out. Briefly shows "優化中" in the overlay during the
  call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions workflow at `.github/workflows/build-windows.yml`:
- Triggers on push to master/main, on `v*` tags, on PRs, and via
  `workflow_dispatch`. Concurrency group cancels superseded runs.
- Runs on `windows-latest`, sets up Flutter 3.41.9 stable with cache,
  enables windows desktop, fetches deps, runs build_runner codegen, then
  `flutter build windows --release`.
- Reads version from pubspec.yaml and passes it into Inno Setup as
  `MyAppVersion`.
- Compiles the installer (Inno Setup 6 is preinstalled on the GHA
  windows-latest image; falls back to chocolatey if absent).
- Always uploads the installer .exe as a workflow artifact.
- On a `v*` tag push, additionally publishes a GitHub Release with the
  installer attached and auto-generated release notes.

Installer at `installer/zerotype.iss`:
- Bundles the entire Release output and installs to {autopf}\ZeroType
  (admin) or LocalAppData\Programs\ZeroType (standard user).
- `PrivilegesRequiredOverridesAllowed=dialog commandline` lets the user
  decline UAC and continue as a standard user. Welcome page
  (`InfoBeforeFile=install-mode-info.txt`) explains the trade-off in
  Traditional Chinese + English: admin gets system-wide install plus
  auto-configured microphone permission and optional launch-at-startup;
  standard user gets per-user install with no system writes — the user
  configures mic and startup themselves.
- Microphone consent: on admin installs, writes
  HKCU\…\CapabilityAccessManager\ConsentStore\microphone\NonPackaged\
  <encoded-exe-path> with Value="Allow" so the user doesn't have to dig
  into Settings → Privacy → Microphone the first time. The exe path is
  computed at install time via `[Code]` GetMicConsentSubkey() — Windows
  encodes the path with `#` instead of `\`, so install-location changes
  are handled correctly. Skipped via `Check: IsAdminInstallMode` in
  standard-user mode (per the user's "admin = auto, standard = manual"
  contract).
- Launch-at-startup: optional task, also gated behind
  `Check: IsAdminInstallMode`. Writes to HKCU\…\Run with the install
  path; cleaned up on uninstall.
- Tracks the install with a stable AppId GUID so future versions upgrade
  in place rather than installing side-by-side.

`.gitignore`: ignore `installer/Output/` (the local build output of
ISCC; CI generates these fresh per run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LiteLLM proxy returned a 400 from OpenAI's /v1/chat/completions when we
sent an m4a-format input_audio:
  Invalid value: 'm4a'. Supported values are: 'wav' and 'mp3'.

OpenAI's chat-completions audio API accepts only wav and mp3, so any
non-whisper OpenAI model (gpt-4o-audio, gpt-5.5, …) rejected our recordings
even though Gemini-style models had been accepting m4a fine.

Switch the Windows recorder to AudioEncoder.wav (PCM16, 16 kHz mono).
Trade-off:
  * Files grow from ~1 MB/min (AAC@128k) to ~1.9 MB/min (PCM16@16k) —
    fine for local LiteLLM proxies and acceptable for direct cloud
    uploads.
  * Format works for *every* backend tested: OpenAI input_audio,
    Gemini multimodal generateContent, Whisper /v1/audio/transcriptions,
    and LiteLLM-bridged routes for all of the above.
  * Sample rate drops back to 16 kHz (Whisper's native rate); the AAC
    44.1 kHz workaround was only needed because Windows MF's AAC encoder
    rejected 16 kHz, and PCM has no such restriction.

macOS unchanged (AAC m4a continues to work end-to-end on the existing
mac pipeline; switching it would be churn for no benefit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GHA windows-latest Inno Setup install ships only the official
language files; ChineseTraditional lives in Languages\Unofficial\
and isn't installed by default, so ISCC failed with:
  Couldn't open include file "compiler:Languages\ChineseTraditional.isl":
  The system cannot find the file specified.

Drop the entry — the wizard chrome will render in English, but the
substantive page (InfoBeforeFile=install-mode-info.txt) is still
bilingual TC + English so the install-mode choice is unambiguous to
Chinese-speaking users.

If we want the wizard chrome itself in Chinese later, we can vendor
the unofficial .isl into installer/ and reference it via a relative
path; not bundling it now to avoid baking a third-party file we'd
have to keep in sync.
When the user picked a text-only OpenAI model (gpt-5.5) on the LiteLLM
proxy, the OpenAI backend returned the verbose:

  Invalid 'messages[0]'. Content blocks are expected to be either text
  or image_url type.

…wrapped in HTTP boilerplate by _wrapDioError. The actual cause — 'this
model is text-only, you need a multimodal/audio one' — was buried.

Detect that specific OpenAI error pattern (image_url + 'Content blocks
are expected') in the LiteLLM chat path and rewrite into a Traditional
Chinese message that names the user-selected model and lists working
alternatives:
  • gemini-2.5-flash-lite / gemini-3-flash-preview (multimodal)
  • claude-haiku-4-5 / claude-sonnet-* (multimodal)
  • gpt-4o-audio-preview / gpt-4o-mini-audio-preview
  • whisper-1 (transcription endpoint, auto-routed)

All four families above were verified end-to-end against the user's
LiteLLM proxy at v1.1.0.
The v1.1.0 tag run failed at the 'Create GitHub Release' step with HTTP
403: 'Resource not accessible by integration'. The default GITHUB_TOKEN
on a fork only has read access to repository contents; uploading a
release requires the elevated permission to be requested explicitly via
`permissions: contents: write`.
- Replace placeholder your-username/zerotype URLs with alarmz/ZeroType
- Add Windows install section using ZeroTypeSetup-x.y.z.exe with the
  admin-vs-standard mode explanation that mirrors the installer's
  bilingual welcome page
- Document LiteLLM provider end-to-end: base URL format, the dynamic
  /v1/models picker, and the model-type → endpoint routing table
  (whisper-* → /v1/audio/transcriptions, multimodal → chat completions
  with input_audio, text-only → unsupported)
- Mention the new optional text-refinement feature (independent
  provider/prompt + Settings toggle)
- Update default-hotkey lines and paste-simulation lines to mention
  both ⌘V on macOS and Ctrl+V on Windows
- Add a v1.1.0 release-notes block; demote v1.0.2 from 'current'
- Note that Windows accessibility permission is NOT required (uses
  SendInput, no keyboard automation consent needed)
User reported transcripts arriving with the system prompt's internal
sections appended:
  <transcript>
  self_correction:
    - 【後者為準】無遺漏的修正訊號。
    - …
  language: …
  dictionary: …

Or, alternatively, fully wrapped in the prompt's YAML format with the
real answer hidden inside an `output:` block:
  self_correction: …
  output: |
    我錄了兩次,第二次會出現很多說明…

Root cause: the SpeechToText.prompt is structured as YAML — `instructions:`,
`examples:` (with `reasoning:` and `output:` per example), `self_correction:`,
`language:`, `dictionary:`. Gemini 2.5 Flash *does* respect the convention
because its thinking-mode channel separates reasoning from final answer; but
gemini-2.5-flash-lite has no such channel, so it imitates the YAML form
and emits the structure verbatim alongside its answer.

Add `_stripPromptStructureEcho()` in SpeechRecognitionService applied to
the chat-completions content. The sanitizer:
- Detects top-level YAML keys (`self_correction|reasoning|language|
  dictionary|examples|instructions|name|description`) at line start.
- If an explicit `output:` block exists (Pattern B), extracts and
  un-indents its YAML literal value.
- Otherwise (Pattern A — transcript first, structure trailing), cuts
  everything from the first structural key onward.

The log now shows both lengths so the effect is observable:
  [LiteLLM-chat] success length=72 (raw=487) tokens: in=2825 out=312
…meaning the sanitizer trimmed 415 chars of structural noise.

The upstream prompt is left untouched — this is purely a defensive
client-side fix, so the upstream maintainer can keep their prompt format
without anyone needing to coordinate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant