You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for maintaining TensorRT-Edge-LLM and for the recent Qwen3 ASR/TTS/Omni work. We are building an on-device, low-latency voice application on Jetson, and this library has been very useful for that work.
Our target deployment is:
Qwen3-ASR on Jetson
Qwen3-TTS 0.6B on Jetson
local real-time conversational applications
During deployment, Qwen3-ASR worked correctly, but Qwen3-TTS did not. The TTS pipeline could export/build/run and produce a .wav file, but the speech content was wrong or contained obvious artifacts. After debugging, we found several runtime / engine-contract issues. I am opening this issue to share a minimal repro and the fixes we used so the official implementation can be corrected.
For our own application we are also adding streaming support and some training/adaptation capabilities on top of this project. Because our use case requires low latency, we have a separate high-performance branch with streaming TTS adaptations. With the corrected runtime contracts, we have already run two voice applications on an 8GB Jetson Orin Nano, with Qwen3-ASR and Qwen3-TTS 0.6B loaded at the same time. The performance is very promising. If maintainers are interested, we can also share that high-performance / streaming branch separately. This is why we would like to contribute the correctness findings back upstream.
This issue itself is intentionally non-streaming and minimal, so it is easier to reproduce and review.
That issue mainly discusses whether the 0.6B model has an MTP module.
The maintainer clarified that Qwen3-TTS has a CodePredictor, which is unrelated to Qwen3.5 speculative-decoding MTP.
This report is different: the TTS pipeline can produce WAV files, but the generated speech content is incorrect unless the Talker / CodePredictor / Code2Wav contracts are handled correctly.
General Qwen3-Omni support question. This report is a concrete Qwen3-TTS 0.6B runtime correctness issue.
Environment
Repository: NVIDIA/TensorRT-Edge-LLM
Baseline: origin/main / v0.7.0-era code
Device used for repro: Jetson Orin NX / SM87
Additional deployment validation: Jetson Orin Nano 8GB
TensorRT: 10.3
CUDA: 12.6
Model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
Test mode: non-streaming examples/omni/qwen3_tts_inference
What fails
The failing case is not simply a crash. The current/generic path can complete successfully and write a valid WAV container, but the speech content is wrong.
ASR often recognized only 嗯, 你, or other partial text.
In one stateless Code2Wav variant, the vocoder also hit CUDA illegal memory access.
So checking only process exit status or whether a WAV file exists is not enough for Qwen3-TTS validation. The output needs semantic validation, for example by ASR or an equivalent correctness test.
ASR-verified successful output
With the corrected minimal path, we generated:
Input:
您好,今天天气真不错。
Output:
Generated frames: 49
Output samples: 94080
Duration: 3.92s at 24 kHz
ASR device 1: 您好,今天天气真不错
ASR device 2: 您好,今天天气真不错
We can attach the generated WAV:
qwen3-tts-minimal-stateful-slim.wav
We also generated an English sample that can be attached:
qwen3-tts-minimal-stateful-en.wav
Findings / suspected root causes
1. Talker prefill contract
The Qwen3-TTS explicit Talker engine may expose an inputs_embeds profile with max sequence length 1, while its KV cache supports a longer sequence. The runtime must distinguish:
max input sequence length for the current enqueue
max KV sequence length / cache capacity
If the whole prompt prefill is sent at once to a single-token input profile, it fails. If the runtime falls back to a generic prefill layout, the generated codec stream can be semantically wrong. Iterative prefill fixed this for our explicit-KV Talker path.
2. CodePredictor contract
Qwen3-TTS uses a CodePredictor to generate residual codec codebooks. This should not be treated as Qwen3.5 speculative-decoding MTP, but the runtime still needs a native Qwen3-TTS CodePredictor path that matches its engine contract and residual-code generation semantics.
The working path uses the native Qwen3-TTS CodePredictor engine and the expected residual-code generation contract.
3. Code2Wav state contract
The vocoder path that produced correct audio is stateful. The engine exposes codes / waveform plus recurrent state bindings named like:
<name>_in
<name>_out
Ignoring those state tensors, or using a stateless Code2Wav runner for this engine contract, can still produce a WAV but with incorrect content, or can fail at runtime.
The stateful Code2Wav engine profile we validated has:
So runtime chunks must not exceed 16 codec frames for that engine.
Minimal implementation used to verify
We reduced the successful path to a non-streaming example:
No worker.
No streaming protocol.
No async queue.
No service framework.
A small StatefulCode2WavRunner that mirrors the existing Code2WavRunner::generateWaveform() API.
qwen3_tts_inference selects the stateful runner when code2wav_stateful.engine exists under --code2wavEngineDir; otherwise it keeps the existing stateless fallback.
The full local validation stack also includes the Talker / CodePredictor / export / Jetson build fixes needed to run from origin/main.
Branches / patches
We can provide both a minimal correctness repro branch and a separate high-performance branch.
Minimal correctness repro
qwen3tts-minimal-stateful-repro
Purpose:
Minimal non-streaming correctness repro.
Closest to the official example flow.
Intended as the main reference for this issue.
Patch generated from origin/main..qwen3tts-minimal-stateful-repro:
git format-patch origin/main..qwen3tts-minimal-stateful-repro --stdout > qwen3-tts-minimal-repro.patch
git am qwen3-tts-minimal-repro.patch
Patch stack used in our local repro:
b9c57c8 Freeze Qwen3-TTS direct BF16 reference runtime
b09f644 Fix CuTe DSL CUDA 12.x compatibility for Talker MLP kernels
d4ccf66 Fix CuTe DSL shim propagation to static library consumers
2077555 Fix tokenizer loading for Qwen3-TTS models
1eebd83 Make quantization import optional in llm_export.py
111a684 Make modelopt import lazy in model_utils.py
8d21f47 Add tensor dtype accessor for Qwen3-TTS validation
32446c1 fix: Add minimal stateful Qwen3-TTS Code2Wav repro
The last commit is only the stateful Code2Wav runner + minimal non-streaming example integration. The earlier commits are the Talker / CodePredictor / export / build fixes needed to make the full repro run from origin/main.
There is also a W8A16/high-performance experimental branch:
qwen3-tts-highperf-runtime-w8a16
That one is more experimental and broader in scope. I would recommend reviewing qwen3tts-minimal-stateful-repro first for correctness, and only looking at the high-performance branches if the streaming/runtime-service direction is useful.
Reproduction outline
1. Apply patch
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git checkout origin/main
git am /path/to/qwen3-tts-minimal-repro.patch
git submodule update --init --recursive
2. Export Qwen3-TTS 0.6B ONNX
Export on an x86 host. Qwen3-TTS currently requires qwen-tts, so we used a dedicated Python environment.
Important note: building only the current stateless code2wav.engine path is enough to reproduce the bug, but not enough to reproduce the ASR-correct output. The ASR-correct path requires the stateful Code2Wav engine contract above.
The most useful thing to check is whether the exported/built Qwen3-TTS engines match these runtime contracts. I can attach the exact ASR-verified WAVs and the minimal patch used to route qwen3_tts_inference through the stateful vocoder path.
The important correctness check is semantic: the generated WAV should be
recognized as the input text by ASR, rather than only verifying that a WAV file
was written successfully.
Validation artifact links
The WAV files and JSON outputs are also available from this artifact branch:
Context
First of all, thank you for maintaining TensorRT-Edge-LLM and for the recent Qwen3 ASR/TTS/Omni work. We are building an on-device, low-latency voice application on Jetson, and this library has been very useful for that work.
Our target deployment is:
During deployment, Qwen3-ASR worked correctly, but Qwen3-TTS did not. The TTS pipeline could export/build/run and produce a
.wavfile, but the speech content was wrong or contained obvious artifacts. After debugging, we found several runtime / engine-contract issues. I am opening this issue to share a minimal repro and the fixes we used so the official implementation can be corrected.For our own application we are also adding streaming support and some training/adaptation capabilities on top of this project. Because our use case requires low latency, we have a separate high-performance branch with streaming TTS adaptations. With the corrected runtime contracts, we have already run two voice applications on an 8GB Jetson Orin Nano, with Qwen3-ASR and Qwen3-TTS 0.6B loaded at the same time. The performance is very promising. If maintainers are interested, we can also share that high-performance / streaming branch separately. This is why we would like to contribute the correctness findings back upstream.
This issue itself is intentionally non-streaming and minimal, so it is easier to reproduce and review.
Related issues checked
Qwen/Qwen3-TTS-12Hz-0.6B-Base.Environment
NVIDIA/TensorRT-Edge-LLMorigin/main/ v0.7.0-era codeQwen/Qwen3-TTS-12Hz-0.6B-Baseexamples/omni/qwen3_tts_inferenceWhat fails
The failing case is not simply a crash. The current/generic path can complete successfully and write a valid WAV container, but the speech content is wrong.
Examples from our validation:
你好,今天天气真不错。有你好你好,你好嗯,你, or other partial text.So checking only process exit status or whether a WAV file exists is not enough for Qwen3-TTS validation. The output needs semantic validation, for example by ASR or an equivalent correctness test.
ASR-verified successful output
With the corrected minimal path, we generated:
Input:
Output:
您好,今天天气真不错您好,今天天气真不错We can attach the generated WAV:
We also generated an English sample that can be attached:
Findings / suspected root causes
1. Talker prefill contract
The Qwen3-TTS explicit Talker engine may expose an
inputs_embedsprofile with max sequence length 1, while its KV cache supports a longer sequence. The runtime must distinguish:If the whole prompt prefill is sent at once to a single-token input profile, it fails. If the runtime falls back to a generic prefill layout, the generated codec stream can be semantically wrong. Iterative prefill fixed this for our explicit-KV Talker path.
2. CodePredictor contract
Qwen3-TTS uses a CodePredictor to generate residual codec codebooks. This should not be treated as Qwen3.5 speculative-decoding MTP, but the runtime still needs a native Qwen3-TTS CodePredictor path that matches its engine contract and residual-code generation semantics.
The working path uses the native Qwen3-TTS CodePredictor engine and the expected residual-code generation contract.
3. Code2Wav state contract
The vocoder path that produced correct audio is stateful. The engine exposes
codes/waveformplus recurrent state bindings named like:Ignoring those state tensors, or using a stateless Code2Wav runner for this engine contract, can still produce a WAV but with incorrect content, or can fail at runtime.
The stateful Code2Wav engine profile we validated has:
So runtime chunks must not exceed 16 codec frames for that engine.
Minimal implementation used to verify
We reduced the successful path to a non-streaming example:
StatefulCode2WavRunnerthat mirrors the existingCode2WavRunner::generateWaveform()API.qwen3_tts_inferenceselects the stateful runner whencode2wav_stateful.engineexists under--code2wavEngineDir; otherwise it keeps the existing stateless fallback.Final stateful runner integration diff size:
The full local validation stack also includes the Talker / CodePredictor / export / Jetson build fixes needed to run from
origin/main.Branches / patches
We can provide both a minimal correctness repro branch and a separate high-performance branch.
Minimal correctness repro
Purpose:
Patch generated from
origin/main..qwen3tts-minimal-stateful-repro:git format-patch origin/main..qwen3tts-minimal-stateful-repro --stdout > qwen3-tts-minimal-repro.patch git am qwen3-tts-minimal-repro.patchPatch stack used in our local repro:
The last commit is only the stateful Code2Wav runner + minimal non-streaming example integration. The earlier commits are the Talker / CodePredictor / export / build fixes needed to make the full repro run from
origin/main.If using my fork, the branch URL will be:
High-performance / streaming reference
Purpose:
If using my fork, the branch URL will be:
There is also a W8A16/high-performance experimental branch:
That one is more experimental and broader in scope. I would recommend reviewing
qwen3tts-minimal-stateful-reprofirst for correctness, and only looking at the high-performance branches if the streaming/runtime-service direction is useful.Reproduction outline
1. Apply patch
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git cd TensorRT-Edge-LLM git checkout origin/main git am /path/to/qwen3-tts-minimal-repro.patch git submodule update --init --recursive2. Export Qwen3-TTS 0.6B ONNX
Export on an x86 host. Qwen3-TTS currently requires
qwen-tts, so we used a dedicated Python environment.Expected export layout:
The CodePredictor directory must include:
3. Build TensorRT-Edge-LLM on Orin NX / TRT 10.3 / SM87
4. Build engines / prepare artifacts
The runtime repro needs this artifact layout:
Example preparation commands:
Important note: building only the current stateless
code2wav.enginepath is enough to reproduce the bug, but not enough to reproduce the ASR-correct output. The ASR-correct path requires the stateful Code2Wav engine contract above.The most useful thing to check is whether the exported/built Qwen3-TTS engines match these runtime contracts. I can attach the exact ASR-verified WAVs and the minimal patch used to route
qwen3_tts_inferencethrough the stateful vocoder path.5. Run non-streaming TTS
Expected result:
The important correctness check is semantic: the generated WAV should be
recognized as the input text by ASR, rather than only verifying that a WAV file
was written successfully.
Validation artifact links
The WAV files and JSON outputs are also available from this artifact branch: