diff --git a/api-reference/server/services/stt/whisper.mdx b/api-reference/server/services/stt/whisper.mdx index cb0efe14..11aee386 100644 --- a/api-reference/server/services/stt/whisper.mdx +++ b/api-reference/server/services/stt/whisper.mdx @@ -54,6 +54,11 @@ uv add "pipecat-ai[whisper]" uv add "pipecat-ai[mlx-whisper]" ``` + + MLX Whisper requires macOS on Apple Silicon (arm64). It will not work on other + platforms, including Intel Macs. + + ## Prerequisites ### Local Model Setup @@ -246,5 +251,7 @@ stt = WhisperSTTServiceMLX( - **First run downloads**: If the selected model hasn't been downloaded previously, the first run will download it from the Hugging Face model hub. This may take significant time depending on model size. - **Segmented transcription**: Both `WhisperSTTService` and `WhisperSTTServiceMLX` extend `SegmentedSTTService`, meaning they process complete audio segments after VAD detects the user has stopped speaking. - **No-speech filtering**: The `no_speech_prob` threshold helps filter out hallucinations. Increase it to be more permissive, decrease it to filter more aggressively. +- **MLX platform requirement**: `WhisperSTTServiceMLX` requires macOS on Apple Silicon (arm64). On other platforms (including Intel Macs), use `WhisperSTTService` instead. - **MLX quantization**: The `LARGE_V3_TURBO_Q4` model provides reduced memory usage with minimal quality loss on Apple Silicon. +- **Model enums**: `Model` and `MLXModel` are `StrEnum` types, meaning enum members can be compared directly to strings (e.g., `Model.TINY == "tiny"`). Both enum members and plain strings work when setting the model. - **Language support**: Whisper supports 99+ languages. Use the `Language` enum for type-safe language selection. The default is `Language.EN` (English). Set `language=None` in settings to enable automatic language detection, which will transcribe whatever language the user speaks.