You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix RNNT state splitting so per-sample decoder state no longer keeps views into full batched LSTM states.
Align ONNX RNNT decoding with the torch path by using max_symbols_per_step from config with default 10.
Return token frames from ONNX ASR decode path for timestamp support.
Add focused regression tests for state copying, symbol-limit config, token frames, and SpecScaler no-mutation behavior.
Include small stability fixes for long positional encodings, checkpoint loading, VAD pipeline caching, downloads, flash-attention unpadding, and decoder device lookup.
Why
The RNNT split helpers returned views. Keeping those views in dec_state could retain full [L, B, H] LSTM buffers for each sample and cause memory growth on long audio
or larger batches. The ONNX decoder also had a hard-coded per-frame symbol limit of 3, while the torch decoder defaults to 10, which could produce different transcripts
on the same weights.
ONNXRuntime mode:
We use CUDA ONNXRuntime in the Triton/Python backend, not CPU-only ONNXRuntime.
Approximate amount of data processed:
The issue was observed under concurrent long-audio transcription load. One soak run processed 133 transcription requests in about 10 minutes at request concurrency 4.
Typical audio lengths:
Typical files were long-form audio/video files, approximately 10-20 minutes per request. One smoke file was about 17 minutes long.
Observed memory leak type / size:
It looked like CPU RSS growth / Python object retention in the RNNT ONNX/Triton path, not CUDA VRAM growth. The retained objects appear to be numpy/tensor views of batched
RNNT decoder LSTM states.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
max_symbols_per_stepfrom config with default10.SpecScalerno-mutation behavior.Why
The RNNT split helpers returned views. Keeping those views in
dec_statecould retain full[L, B, H]LSTM buffers for each sample and cause memory growth on long audioor larger batches. The ONNX decoder also had a hard-coded per-frame symbol limit of
3, while the torch decoder defaults to10, which could produce different transcriptson the same weights.