feat(whisper): expose language + prompt config (multilingual opt-in)#8063
feat(whisper): expose language + prompt config (multilingual opt-in)#8063YonganZhang wants to merge 1 commit intoAstrBotDevs:masterfrom
Conversation
## Problem `ProviderOpenAIWhisperAPI` does not pass `language` / `prompt` to `client.audio.transcriptions.create()`. Whisper's auto-detect can mis-classify Chinese / Japanese / Korean / etc, hurting transcription accuracy. Users have no way to provide a language hint or prompt through the existing provider config. ## Solution Expose two optional config fields, both defaulting to `""` (preserves current auto-detect behavior — fully backwards compatible): - `language`: e.g. `"zh"` / `"ja"` / `"ko"` — Whisper language hint - `prompt`: free-text guidance, e.g. domain vocabulary or phrasing (see https://platform.openai.com/docs/guides/speech-to-text/prompting) Plus `temperature=0` for deterministic output. ## Backwards compat Default `""` → `NOT_GIVEN` → Whisper auto-detect → identical to current behavior. Existing users see no change. New params are pure opt-in. ## Test Local run on Chinese voice samples: WER measurably better with `language="zh"` + a short Chinese prompt vs auto-detect.
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Hard-coding
temperature=0in the transcription call removes flexibility; consider making this configurable viaprovider_configwith a default of 0 to keep current behavior while allowing overrides. - Since
languageandpromptare now provider-level fields, it may be useful to allow per-call overrides inget_text(e.g., optional parameters) so callers can vary hints without redefining the provider.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Hard-coding `temperature=0` in the transcription call removes flexibility; consider making this configurable via `provider_config` with a default of 0 to keep current behavior while allowing overrides.
- Since `language` and `prompt` are now provider-level fields, it may be useful to allow per-call overrides in `get_text` (e.g., optional parameters) so callers can vary hints without redefining the provider.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces support for optional language and prompt parameters in the Whisper API source to improve transcription accuracy. Feedback includes adding unit tests for the new functionality, sanitizing configuration inputs with strip(), and making the hardcoded temperature parameter configurable. Additionally, a potential resource leak was identified where the audio file handle is not explicitly closed, which could cause issues on certain operating systems.
| # Optional language hint + prompt to guide Whisper transcription. | ||
| # Default empty = let Whisper auto-detect (preserves existing behavior). | ||
| # Users can configure these for higher accuracy on non-English speech. | ||
| self.language = provider_config.get("language", "") | ||
| self.prompt = provider_config.get("prompt", "") |
There was a problem hiding this comment.
According to the general rules, new functionality should be accompanied by corresponding unit tests. Please add tests to verify that the language and prompt parameters are correctly extracted from the configuration and passed to the transcription API.
References
- New functionality, such as handling attachments, should be accompanied by corresponding unit tests.
| self.language = provider_config.get("language", "") | ||
| self.prompt = provider_config.get("prompt", "") |
There was a problem hiding this comment.
It is recommended to strip() the language and prompt values to prevent issues caused by accidental leading or trailing whitespace in the configuration. Additionally, using or "" ensures that the code handles cases where the configuration value might be explicitly set to null.
| self.language = provider_config.get("language", "") | |
| self.prompt = provider_config.get("prompt", "") | |
| self.language = str(provider_config.get("language") or "").strip() | |
| self.prompt = str(provider_config.get("prompt") or "").strip() |
| language=self.language or NOT_GIVEN, | ||
| prompt=self.prompt or NOT_GIVEN, | ||
| temperature=0, |
There was a problem hiding this comment.
While adding these parameters, note that the open(audio_url, "rb") call on line 126 (context) creates a file handle that is not explicitly closed. This can lead to resource leaks and may cause the os.remove(audio_url) call on line 135 to fail on Windows. Consider refactoring this block to use a context manager for the file handle.
| file=("audio.wav", open(audio_url, "rb")), | ||
| language=self.language or NOT_GIVEN, | ||
| prompt=self.prompt or NOT_GIVEN, | ||
| temperature=0, |
There was a problem hiding this comment.
Problem
ProviderOpenAIWhisperAPIdoes not forwardlanguage/promptparameters toclient.audio.transcriptions.create(). Whisper auto-detect can mis-classify non-English speech (Chinese / Japanese / Korean / etc), hurting transcription accuracy. Users have no way to hint the language or provide a prompt through the existing provider config.Solution
Expose two optional config fields, both defaulting to
""(preserves current auto-detect behavior — fully backwards compatible):language: ISO language hint, e.g."zh"/"ja"/"ko"prompt: free-text guidance (OpenAI prompting guide)Plus
temperature=0for deterministic transcription output.Backwards compatibility
Default
""→NOT_GIVEN→ Whisper auto-detect → identical behavior to current code. Existing users see no change. New parameters are pure opt-in.Test
Tested locally on Chinese voice samples (PolyU lab). Setting
language="zh"plus a short Chinese context prompt produced visibly better WER than the previous auto-detect path.Diff size
8 effective lines: 5 init + 3 transcription call.
Summary by Sourcery
Expose configurable language and prompt options for the Whisper API provider to improve non-English transcription accuracy while preserving existing auto-detection behavior.
New Features:
Enhancements: