Skip to content

[Bug] Qwen3-TTS 0.6B TTS output is incorrect on Jetson unless Talker / CodePredictor / Code2Wav contracts are fixed #87

@suharvest

Description

@suharvest

Context

First of all, thank you for maintaining TensorRT-Edge-LLM and for the recent Qwen3 ASR/TTS/Omni work. We are building an on-device, low-latency voice application on Jetson, and this library has been very useful for that work.

Our target deployment is:

  • Qwen3-ASR on Jetson
  • Qwen3-TTS 0.6B on Jetson
  • local real-time conversational applications

During deployment, Qwen3-ASR worked correctly, but Qwen3-TTS did not. The TTS pipeline could export/build/run and produce a .wav file, but the speech content was wrong or contained obvious artifacts. After debugging, we found several runtime / engine-contract issues. I am opening this issue to share a minimal repro and the fixes we used so the official implementation can be corrected.

For our own application we are also adding streaming support and some training/adaptation capabilities on top of this project. Because our use case requires low latency, we have a separate high-performance branch with streaming TTS adaptations. With the corrected runtime contracts, we have already run two voice applications on an 8GB Jetson Orin Nano, with Qwen3-ASR and Qwen3-TTS 0.6B loaded at the same time. The performance is very promising. If maintainers are interested, we can also share that high-performance / streaming branch separately. This is why we would like to contribute the correctness findings back upstream.

This issue itself is intentionally non-streaming and minimal, so it is easier to reproduce and review.

Related issues checked

  • support qwen3-tts-0.6B #86
    • Similar model: Qwen/Qwen3-TTS-12Hz-0.6B-Base.
    • That issue mainly discusses whether the 0.6B model has an MTP module.
    • The maintainer clarified that Qwen3-TTS has a CodePredictor, which is unrelated to Qwen3.5 speculative-decoding MTP.
    • This report is different: the TTS pipeline can produce WAV files, but the generated speech content is incorrect unless the Talker / CodePredictor / Code2Wav contracts are handled correctly.
  • streaming output for qwen3-tts #61
    • About streaming output for Qwen3-TTS. This report is a non-streaming correctness repro.
  • [Question] Qwen3-Omni Full Support on Thor/Orin (v0.6.0) #70
    • General Qwen3-Omni support question. This report is a concrete Qwen3-TTS 0.6B runtime correctness issue.

Environment

  • Repository: NVIDIA/TensorRT-Edge-LLM
  • Baseline: origin/main / v0.7.0-era code
  • Device used for repro: Jetson Orin NX / SM87
  • Additional deployment validation: Jetson Orin Nano 8GB
  • TensorRT: 10.3
  • CUDA: 12.6
  • Model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
  • Test mode: non-streaming examples/omni/qwen3_tts_inference

What fails

The failing case is not simply a crash. The current/generic path can complete successfully and write a valid WAV container, but the speech content is wrong.

Examples from our validation:

  • Non-high-performance runtime + native CodePredictor + stateless Code2Wav:
    • Input: 你好,今天天气真不错。
    • ASR result on one device: 有你好
    • ASR result on another device: 你好,你好
  • Other minimal stateless/projection experiments:
    • ASR often recognized only , , or other partial text.
  • In one stateless Code2Wav variant, the vocoder also hit CUDA illegal memory access.

So checking only process exit status or whether a WAV file exists is not enough for Qwen3-TTS validation. The output needs semantic validation, for example by ASR or an equivalent correctness test.

ASR-verified successful output

With the corrected minimal path, we generated:

Input:

您好,今天天气真不错。

Output:

  • Generated frames: 49
  • Output samples: 94080
  • Duration: 3.92s at 24 kHz
  • ASR device 1: 您好,今天天气真不错
  • ASR device 2: 您好,今天天气真不错

We can attach the generated WAV:

qwen3-tts-minimal-stateful-slim.wav

We also generated an English sample that can be attached:

qwen3-tts-minimal-stateful-en.wav

Findings / suspected root causes

1. Talker prefill contract

The Qwen3-TTS explicit Talker engine may expose an inputs_embeds profile with max sequence length 1, while its KV cache supports a longer sequence. The runtime must distinguish:

  • max input sequence length for the current enqueue
  • max KV sequence length / cache capacity

If the whole prompt prefill is sent at once to a single-token input profile, it fails. If the runtime falls back to a generic prefill layout, the generated codec stream can be semantically wrong. Iterative prefill fixed this for our explicit-KV Talker path.

2. CodePredictor contract

Qwen3-TTS uses a CodePredictor to generate residual codec codebooks. This should not be treated as Qwen3.5 speculative-decoding MTP, but the runtime still needs a native Qwen3-TTS CodePredictor path that matches its engine contract and residual-code generation semantics.

The working path uses the native Qwen3-TTS CodePredictor engine and the expected residual-code generation contract.

3. Code2Wav state contract

The vocoder path that produced correct audio is stateful. The engine exposes codes / waveform plus recurrent state bindings named like:

<name>_in
<name>_out

Ignoring those state tensors, or using a stateless Code2Wav runner for this engine contract, can still produce a WAV but with incorrect content, or can fail at runtime.

The stateful Code2Wav engine profile we validated has:

codes min/opt/max = [1, 16, 1] / [1, 16, 4] / [1, 16, 16]

So runtime chunks must not exceed 16 codec frames for that engine.

Minimal implementation used to verify

We reduced the successful path to a non-streaming example:

  • No worker.
  • No streaming protocol.
  • No async queue.
  • No service framework.
  • A small StatefulCode2WavRunner that mirrors the existing Code2WavRunner::generateWaveform() API.
  • qwen3_tts_inference selects the stateful runner when code2wav_stateful.engine exists under --code2wavEngineDir; otherwise it keeps the existing stateless fallback.

Final stateful runner integration diff size:

cpp/multimodal/statefulCode2WavRunner.cpp | 518 ++++++++++++++++++++++
cpp/multimodal/statefulCode2WavRunner.h   | 122 +++++
examples/omni/qwen3_tts_inference.cpp     |  32 +-

The full local validation stack also includes the Talker / CodePredictor / export / Jetson build fixes needed to run from origin/main.

Branches / patches

We can provide both a minimal correctness repro branch and a separate high-performance branch.

Minimal correctness repro

qwen3tts-minimal-stateful-repro

Purpose:

  • Minimal non-streaming correctness repro.
  • Closest to the official example flow.
  • Intended as the main reference for this issue.

Patch generated from origin/main..qwen3tts-minimal-stateful-repro:

git format-patch origin/main..qwen3tts-minimal-stateful-repro --stdout > qwen3-tts-minimal-repro.patch
git am qwen3-tts-minimal-repro.patch

Patch stack used in our local repro:

b9c57c8 Freeze Qwen3-TTS direct BF16 reference runtime
b09f644 Fix CuTe DSL CUDA 12.x compatibility for Talker MLP kernels
d4ccf66 Fix CuTe DSL shim propagation to static library consumers
2077555 Fix tokenizer loading for Qwen3-TTS models
1eebd83 Make quantization import optional in llm_export.py
111a684 Make modelopt import lazy in model_utils.py
8d21f47 Add tensor dtype accessor for Qwen3-TTS validation
32446c1 fix: Add minimal stateful Qwen3-TTS Code2Wav repro

The last commit is only the stateful Code2Wav runner + minimal non-streaming example integration. The earlier commits are the Talker / CodePredictor / export / build fixes needed to make the full repro run from origin/main.

If using my fork, the branch URL will be:

https://github.com/suharvest/TensorRT-Edge-LLM/tree/qwen3tts-minimal-stateful-repro

High-performance / streaming reference

highperf/runtime-service

Purpose:

  • Low-latency runtime-service path for our product/application needs.
  • Includes streaming-oriented Qwen3 ASR/TTS work and service integration.
  • Useful as an additional reference if maintainers are interested in streaming behavior or performance tradeoffs.
  • Not intended as the minimal fix for this issue.

If using my fork, the branch URL will be:

https://github.com/suharvest/TensorRT-Edge-LLM/tree/highperf/runtime-service

There is also a W8A16/high-performance experimental branch:

qwen3-tts-highperf-runtime-w8a16

That one is more experimental and broader in scope. I would recommend reviewing qwen3tts-minimal-stateful-repro first for correctness, and only looking at the high-performance branches if the streaming/runtime-service direction is useful.

Reproduction outline

1. Apply patch

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git checkout origin/main
git am /path/to/qwen3-tts-minimal-repro.patch
git submodule update --init --recursive

2. Export Qwen3-TTS 0.6B ONNX

Export on an x86 host. Qwen3-TTS currently requires qwen-tts, so we used a dedicated Python environment.

cd ~/TensorRT-Edge-LLM
python3 -m venv venv-qwen3-tts
source venv-qwen3-tts/bin/activate
pip install .
pip install qwen-tts

export TTS_MODEL=Qwen3-TTS-12Hz-0.6B-Base
export MODEL_DIR=Qwen/$TTS_MODEL
export ONNX_OUTPUT_DIR=$HOME/tensorrt-edgellm-workspace/$TTS_MODEL/onnx
export TTS_CHAT_TEMPLATE=./tensorrt_edgellm/chat_templates/templates/qwen3tts.json

tensorrt-edgellm-export-llm \
  --model_dir $MODEL_DIR \
  --output_dir $ONNX_OUTPUT_DIR \
  --chat_template $TTS_CHAT_TEMPLATE \
  --export_models talker,code_predictor

tensorrt-edgellm-export-audio \
  --model_dir $MODEL_DIR \
  --output_dir $ONNX_OUTPUT_DIR \
  --export_models tokenizer_decoder

Expected export layout:

$ONNX_OUTPUT_DIR/
├── llm/
│   ├── talker/
│   └── code_predictor/
└── audio/
    └── tokenizer_decoder/

The CodePredictor directory must include:

codec_embeddings.safetensors
lm_heads.safetensors
small_to_mtp_projection.safetensors

3. Build TensorRT-Edge-LLM on Orin NX / TRT 10.3 / SM87

cd ~/TensorRT-Edge-LLM
mkdir -p build_trt103_sm87_aarch64
cd build_trt103_sm87_aarch64

PATH=/usr/local/cuda/bin:$PATH cmake .. \
  -DTRT_PACKAGE_DIR=/usr \
  -DAARCH64_BUILD=ON \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DCUDA_DIR=/usr/local/cuda \
  -DCUDA_CTK_VERSION=12.6 \
  -DCMAKE_CUDA_ARCHITECTURES=87 \
  -DENABLE_CUTE_DSL=gemm \
  -DCMAKE_BUILD_TYPE=Release

PATH=/usr/local/cuda/bin:$PATH cmake --build . \
  --target edgellmCore qwen3_tts_inference NvInfer_edgellm_plugin -j2

4. Build engines / prepare artifacts

The runtime repro needs this artifact layout:

$ENG/
├── llm/
│   ├── llm.engine
│   ├── config.json
│   ├── text_embedding.safetensors
│   ├── text_projection.safetensors
│   ├── embedding.safetensors
│   └── tokenizer / chat-template files
├── talker_explicit_kv/
│   └── talker_decode_bf16_dual.engine
├── code_predictor/
│   └── cp_dir/
│       ├── qwen3_tts_cp.engine
│       ├── cp_embed_fp32.bin
│       ├── codec_embeddings.safetensors
│       ├── lm_heads.safetensors
│       ├── small_to_mtp_projection.safetensors
│       └── config.json
└── code2wav_stateful/
    ├── code2wav_stateful.engine
    └── config.json

Example preparation commands:

cd ~/TensorRT-Edge-LLM

export TTS_MODEL=Qwen3-TTS-12Hz-0.6B-Base
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export ONNX=$WORKSPACE_DIR/$TTS_MODEL/onnx
export ENG=$WORKSPACE_DIR/$TTS_MODEL/engines

export EDGELLM_PLUGIN_PATH=$PWD/build_trt103_sm87_aarch64/libNvInfer_edgellm_plugin.so
export LD_LIBRARY_PATH=$PWD/build_trt103_sm87_aarch64:/usr/lib/aarch64-linux-gnu:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

./build_trt103_sm87_aarch64/examples/llm/llm_build \
  --onnxDir $ONNX/llm/talker \
  --engineDir $ENG/llm \
  --maxInputLen 32 \
  --maxKVCacheCapacity 128 \
  --maxBatchSize 1

mkdir -p $ENG/code_predictor/cp_dir
cp $ONNX/llm/code_predictor/codec_embeddings.safetensors $ENG/code_predictor/cp_dir/
cp $ONNX/llm/code_predictor/lm_heads.safetensors $ENG/code_predictor/cp_dir/
cp $ONNX/llm/code_predictor/small_to_mtp_projection.safetensors $ENG/code_predictor/cp_dir/
cp $ONNX/llm/code_predictor/config.json $ENG/code_predictor/cp_dir/

mkdir -p $ENG/code2wav_stateful
cp $ONNX/audio/tokenizer_decoder/config.json $ENG/code2wav_stateful/

Important note: building only the current stateless code2wav.engine path is enough to reproduce the bug, but not enough to reproduce the ASR-correct output. The ASR-correct path requires the stateful Code2Wav engine contract above.

The most useful thing to check is whether the exported/built Qwen3-TTS engines match these runtime contracts. I can attach the exact ASR-verified WAVs and the minimal patch used to route qwen3_tts_inference through the stateful vocoder path.

5. Run non-streaming TTS

cd ~/TensorRT-Edge-LLM

export EDGELLM_PLUGIN_PATH=$PWD/build_trt103_sm87_aarch64/libNvInfer_edgellm_plugin.so
export LD_LIBRARY_PATH=$PWD/build_trt103_sm87_aarch64:/usr/lib/aarch64-linux-gnu:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

export TTS_MODEL=Qwen3-TTS-12Hz-0.6B-Base
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export ENG=$WORKSPACE_DIR/$TTS_MODEL/engines
export OUT=/tmp/qwen3_tts_minimal_stateful
mkdir -p $OUT

cat >/tmp/qwen3_tts_input.json <<'JSON'
{
  "requests": [
    {
      "messages": [
        {
          "role": "user",
          "content": "您好,今天天气真不错。"
        }
      ]
    }
  ]
}
JSON

./build_trt103_sm87_aarch64/examples/omni/qwen3_tts_inference \
  --talkerEngineDir $ENG/llm \
  --codePredictorEngineDir $ENG/code_predictor/cp_dir \
  --code2wavEngineDir $ENG/code2wav_stateful \
  --tokenizerDir $ENG/llm \
  --inputFile /tmp/qwen3_tts_input.json \
  --outputFile $OUT/output.json \
  --outputAudioDir $OUT \
  --dumpOutput

Expected result:

$OUT/audio_req0.wav

The important correctness check is semantic: the generated WAV should be
recognized as the input text by ASR, rather than only verifying that a WAV file
was written successfully.

Validation artifact links

The WAV files and JSON outputs are also available from this artifact branch:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions