docs(asr): document parallel chunk processing in LongTranscription.md#626
Conversation
Adds a Parallel Chunk Processing section covering ASRConfig.parallelChunkConcurrency (default 4, clamped to >= 1), the worker-pool / ThrowingTaskGroup dispatch in ChunkProcessor, AsrManager.makeWorkerClone(), and the tuning notes from the parallelization benchmarks (2.2-2.8x on M3, ~19-31 MiB extra resident memory). Also extends the Current Paths table, the Relevant Code list (AsrManager and ChunkProcessor entries), and the focused-test list to include ASRConfigTests. The parallel path is orthogonal to mel-context / no-mel / dual-decode arbitration and applies to all stateless chunked batch TDT routes. Streaming and sliding- window paths are explicitly called out as unaffected because they need persistent decoder/encoder state.
…, merge, streaming Adds the long-transcription details that were only living in ChunkProcessor comments, source-side constants, and the issue #594 / #507 / #594-followup commit history: - Chunk Geometry section with concrete values from ASRConstants (encoder window, encoder frame, mel hop, visible chunk, overlap, stride, minimum seam overlap) and a note on why visible windows are smaller than the 240k-sample encoder window. - Boundary Search section explaining regular fixed-stride vs the silence-aligned (±4s) + valley fallback (±0.5s) + speech-tail-compression guard used on the v3 no-mel path, including the adaptive thresholds (0.05x and 0.35x medianScore). - Warmup Prefix vs Mel Context section clarifying that mel context (80ms) is decoder-skipped while warmup prefix (0-7 frames) is decoded from frame 0 with emitted tokens suppressed, and documenting the shouldUseWarmupPrefix quiet-lookahead gate (>=200ms below rms 0.003). - Why This Helps rewrite that traces the actual issue history: PR #264 (mel-context prepend for English blank-boundary failures), issue #594 (v3 multilingual English-prior drift), the persisted-decoder-state attempt in eb9c19f and why it was superseded once PR #507 parallelized chunk decoding, and the sentence-final BLANK-trap that the per-chunk SOS reset now masks. - Overlap Merge section documenting the merge ladder used by ChunkProcessor.mergeChunks: disjoint shortcut, contiguous time-tolerant SequenceMatcher (>= half overlap), LCS fallback, midpoint fallback; emphasizing that the merger is token-id-keyed and never re-decodes. - Streaming Threshold section covering ASRConfig.streamingEnabled and streamingThreshold, including how they compose with parallelChunkConcurrency for memory-constrained environments. - Relevant Code list extended with ASRConstants, the new ASRConfig.streaming* fields, the boundary/merge helper functions in ChunkProcessor.swift, and the TokenDeduplication SequenceMatcher. Pure documentation, no source changes.
PocketTTS Smoke Test ✅
Runtime: 0m36s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 5m23s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 5m47s • 05/18/2026, 11:59 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 40.1s diarization time • Test runtime: 2m 37s • 05/18/2026, 11:41 PM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m9s • 05/18/2026, 11:46 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 54s • 2026-05-19T03:50:48.371Z |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 114.3s processing • Test runtime: 1m 56s • 05/18/2026, 11:55 PM EST |
Summary
Brings
Documentation/ASR/LongTranscription.mdup to date with the parallel chunked Parakeet batch transcription work (PR #507, commitfcd80f10) and with the surrounding long-transcription history that was only documented in source comments and commit messages.Added sections
ASRConstants(encoder window 240k samples, encoder frame 1280 samples / 80 ms, mel hop 160 samples, visible chunk ≈ 14.96 s, 2.0 s overlap, frame-aligned stride, 6-frame minimum seam overlap) and why the visible window is smaller thanmaxModelSamples.0.05× medianScore/0.35× medianScorethresholds.shouldUseWarmupPrefixquiet-lookahead gate (≥ 200 ms ofrms < 0.003).parakeet-tdt-0.6b-v3-coremlmultilingual audio (English-prior drift at every seam).eb9c19f7tried persistingTdtDecoderStateand extending the audio prefix to 2.0 s, but that was incompatible with the parallel chunk dispatch infcd80f10(PR Parallelize chunked Parakeet batch transcription #507).melChunkContext = falsewith silence-aligned starts.ASRConfig.parallelChunkConcurrency(default4, clamped>= 1), worker-pool construction viaAsrManager.makeWorkerClone(),ThrowingTaskGroupdispatch with theavailableWorkersbackpressure list, ordered merge viaTaskResult { index, tokens, workerIndex }, and a callout thatStreamingAsrManager/SlidingWindowAsrManagerare intentionally unaffected. Tuning notes cover the 2.2–2.8× wall-clock speedup on M3 with a 1-hour file, ~19–31 MiB extra resident memory, andparallelChunkConcurrency = 1as the closest-to-serial / lowest-memory configuration.mergeChunksladder: disjoint shortcut → contiguous time-tolerantSequenceMatcher(must cover ≥ half the overlap) → LCS fallback → midpoint fallback. Emphasizes that the merger is token-id-keyed with aoverlapSeconds / 2time tolerance and never re-runs the decoder.ASRConfig.streamingEnabled/streamingThreshold(480k samples ≈ 30 s) and how they compose withparallelChunkConcurrencyfor memory-constrained environments.Updated sections
Parallel chunk workersrow.ASRConstantsgeometry constants;ASRConfig.streaming*;AsrManager.makeWorkerClone()/parallelChunkConcurrency;ChunkProcessorboundary helpers (regularChunkStarts,silenceAlignedChunkStarts,bestBoundaryCandidate,shouldUseWarmupPrefix,wouldCompressSpeechTail); merge helpers (mergeChunks,mergeUsingMatches,mergeByMidpoint);makeWorkerPooland statictranscribeChunk(...);TokenDeduplication/SequenceMatcher.swift.ASRConfigTests(coversparallelChunkConcurrencydefault, clamp, override).Pure documentation change — no source files touched.
Commits
f1e8694docs(asr): document parallel chunk processing in LongTranscription.mdb013476docs(asr): expand LongTranscription.md with geometry, boundary search, merge, streamingTest plan
Documentation/ASR/LongTranscription.mdand visually confirm the section ordering and tables.main:ASRConstants.maxModelSamples,samplesPerEncoderFrame,melHopSize,secondsPerEncoderFrameASRConfig.parallelChunkConcurrency,melChunkContext,dualDecodeArbitration,streamingEnabled,streamingThresholdAsrManager.makeWorkerClone(),AsrManager.parallelChunkConcurrencyChunkProcessor.chunkLayout,regularChunkStarts,silenceAlignedChunkStarts,bestBoundaryCandidate,shouldUseWarmupPrefix,wouldCompressSpeechTail,mergeChunks,mergeUsingMatches,mergeByMidpoint,makeWorkerPool,transcribeChunkTokenDeduplication/SequenceMatcher.swiftexposingfindContiguousMatchesandfindLongestCommonSubsequenceswift test --filter ASRConfigTestsandswift test --filter ChunkProcessorTests.