Skip to content

Commit 4ddb996

Browse files
author
root
committed
update results
1 parent c7686fa commit 4ddb996

4 files changed

Lines changed: 156 additions & 10 deletions

File tree

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
## Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM
2+
3+
Contributed by Yuekai Zhang (NVIDIA).
4+
5+
### Quick Start
6+
7+
Launch the service directly with Docker Compose:
8+
```sh
9+
docker compose -f docker-compose.cosyvoice3.yml up
10+
```
11+
12+
### Build the Docker Image
13+
14+
To build the image from scratch:
15+
```sh
16+
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
17+
```
18+
19+
### Run a Docker Container
20+
```sh
21+
your_mount_dir=/mnt:/mnt
22+
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
23+
```
24+
25+
### Understanding `run_cosyvoice3.sh`
26+
27+
The `run_cosyvoice3.sh` script orchestrates the entire workflow through numbered stages.
28+
29+
You can run a subset of stages with:
30+
```sh
31+
bash run_cosyvoice3.sh <start_stage> <stop_stage>
32+
```
33+
- `<start_stage>`: The stage to start from.
34+
- `<stop_stage>`: The stage to stop after.
35+
36+
**Stages:**
37+
38+
- **Stage -1**: Clones the `CosyVoice` repository.
39+
- **Stage 0**: Downloads the `Fun-CosyVoice3-0.5B-2512` model and its HuggingFace LLM checkpoint.
40+
- **Stage 1**: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
41+
- **Stage 2**: Creates the Triton model repository, including configurations for `cosyvoice3`, `token2wav`, `vocoder`, `audio_tokenizer`, and `speaker_embedding`.
42+
- **Stage 3**: Launches the Triton Inference Server for Token2Wav module and uses `trtllm-serve` to deploy CosyVoice3 LLM.
43+
- **Stage 4**: Runs the gRPC benchmark client for performance testing.
44+
- **Stage 5**: Runs the offline TTS inference benchmark test.
45+
46+
### Export Models and Launch Server
47+
48+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
49+
```sh
50+
# This command runs stages 0, 1, 2, and 3
51+
bash run_cosyvoice3.sh 0 3
52+
```
53+
54+
### Benchmark with client-server mode
55+
56+
To benchmark the running Triton server, run stage 4:
57+
```sh
58+
bash run_cosyvoice3.sh 4 4
59+
60+
# You can customize parameters such as the number of tasks inside the script.
61+
```
62+
The following results were obtained by decoding on a single L20 GPU.
63+
64+
#### Streaming TTS (Concurrent Tasks = 4)
65+
66+
**First Chunk Latency**
67+
68+
| Concurrent Tasks | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) |
69+
| ---------------- | ------------ | -------------------- | -------------------- | -------------------- | -------------------- |
70+
| 4 | 750.42 | 740.31 | 941.05 | 977.55 | 1002.37 |
71+
72+
### Benchmark with offline inference mode
73+
74+
For offline inference mode benchmark, please run stage 5:
75+
```sh
76+
bash run_cosyvoice3.sh 5 5
77+
```
78+
79+
#### Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT)
80+
81+
| Backend | Batch Size | llm_time (s) | token2wav_time (s) | pipeline_time (s) | RTF |
82+
|---------|------------|--------------|--------------------|--------------------|--------|
83+
| TRTLLM | 1 | 13.21 | 5.72 | 19.48 | 0.1091 |
84+
| TRTLLM | 2 | 8.46 | 6.02 | 14.91 | 0.0822 |
85+
| TRTLLM | 4 | 5.07 | 5.95 | 11.43 | 0.0630 |
86+
| TRTLLM | 8 | 2.98 | 6.11 | 9.53 | 0.0562 |
87+
| TRTLLM | 16 | 2.12 | 6.27 | 8.83 | 0.0501 |

runtime/triton_trtllm/README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Accelerating CosyVoice with NVIDIA Triton Inference Server and TensorRT-LLM
2+
3+
Contributed by Yuekai Zhang (NVIDIA).
4+
5+
This repository provides three acceleration solutions for CosyVoice, each targeting a different model version and Token2Wav architecture. All solutions use TensorRT-LLM for LLM acceleration and NVIDIA Triton Inference Server for serving.
6+
7+
## Solutions
8+
9+
### [CosyVoice3](README.Cosyvoice3.md)
10+
11+
Acceleration solution for [Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512), the latest CosyVoice model. The pipeline includes `audio_tokenizer`, `speaker_embedding`, `token2wav`, and `vocoder` modules managed by Triton, with the LLM served via `trtllm-serve`.
12+
13+
### [CosyVoice2 + UNet Token2Wav](README.Cosyvoice2.Unet.md)
14+
15+
The baseline acceleration solution for CosyVoice2, using the original UNet-based flow-matching Token2Wav module.
16+
17+
### [CosyVoice2 + DiT Token2Wav](README.Cosyvoice2.DiT.md)
18+
19+
Replaces the UNet Token2Wav with a DiT-based Token2Wav module from [Step-Audio2](https://github.com/stepfun-ai/Step-Audio-2). Supports disaggregated deployment where the LLM and Token2Wav run on separate GPUs for better resource utilization under high concurrency.
20+
21+
22+
23+
## Quick Start
24+
25+
Each solution can be launched with a single Docker Compose command:
26+
27+
```sh
28+
# CosyVoice3
29+
docker compose -f docker-compose.cosyvoice3.yml up
30+
31+
# CosyVoice2 + UNet Token2Wav
32+
docker compose -f docker-compose.cosyvoice2.unet.yml up
33+
34+
# CosyVoice2 + DiT Token2Wav
35+
docker compose -f docker-compose.cosyvoice2.dit.yml up
36+
```
37+
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
services:
2+
tts:
3+
image: soar97/triton-cosyvoice:25.06
4+
shm_size: '1gb'
5+
ports:
6+
- "8000:8000"
7+
- "8001:8001"
8+
- "8002:8002"
9+
environment:
10+
- PYTHONIOENCODING=utf-8
11+
- MODEL_ID=${MODEL_ID}
12+
deploy:
13+
resources:
14+
reservations:
15+
devices:
16+
- driver: nvidia
17+
device_ids: ['0']
18+
capabilities: [gpu]
19+
command: >
20+
/bin/bash -c "cd /workspace && git clone https://github.com/FunAudioLLM/CosyVoice.git && cd CosyVoice && git submodule update --init --recursive && cd runtime/triton_trtllm && bash run_cosyvoice3.sh 0 3"

runtime/triton_trtllm/run_cosyvoice3.sh

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
#!/bin/bash
22
# Copyright (c) 2026 NVIDIA (authors: Yuekai Zhang)
33
export CUDA_VISIBLE_DEVICES=0
4-
# cosyvoice_path=/workspace/CosyVoice
5-
cosyvoice_path=/workspace_yuekai/tts/CosyVoice
4+
cosyvoice_path=/workspace/CosyVoice
65

76
export PYTHONPATH=${cosyvoice_path}:$PYTHONPATH
87
export PYTHONPATH=${cosyvoice_path}/third_party/Matcha-TTS:$PYTHONPATH
@@ -24,7 +23,6 @@ bls_instance_num=10
2423
if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
2524

2625
echo "Cloning CosyVoice"
27-
pip3 install --upgrade x_transformers s3tokenizer
2826
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git $cosyvoice_path
2927
cd $cosyvoice_path
3028
git submodule update --init --recursive
@@ -33,6 +31,10 @@ fi
3331

3432
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
3533
echo "Downloading CosyVoice3 Checkpoints"
34+
# if s3 tokenizer version is not 0.3.0
35+
if [ $(pip3 show s3tokenizer | grep -o "0\.2\.[0-9]") != "0.3.0" ]; then
36+
pip3 install --upgrade x_transformers s3tokenizer
37+
fi
3638
huggingface-cli download --local-dir $huggingface_llm_local_dir yuekai/Fun-CosyVoice3-0.5B-2512-LLM-HF
3739
huggingface-cli download --local-dir $cosyvoice3_official_model_dir yuekai/Fun-CosyVoice3-0.5B-2512-FP16-ONNX
3840
huggingface-cli download --local-dir $cosyvoice3_official_model_dir FunAudioLLM/Fun-CosyVoice3-0.5B-2512
@@ -76,7 +78,7 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
7678
LLM_TOKENIZER_DIR=$huggingface_llm_local_dir
7779
BLS_INSTANCE_NUM=$bls_instance_num
7880
TRITON_MAX_BATCH_SIZE=1
79-
DECOUPLED_MODE=True
81+
DECOUPLED_MODE=True # False for offline TTS
8082

8183
python3 scripts/fill_template.py -i ${model_repo}/cosyvoice3/config.pbtxt model_dir:${MODEL_DIR},bls_instance_num:${BLS_INSTANCE_NUM},llm_tokenizer_dir:${LLM_TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS}
8284
python3 scripts/fill_template.py -i ${model_repo}/token2wav/config.pbtxt model_dir:${MODEL_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS}
@@ -111,17 +113,17 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
111113
fi
112114

113115
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
114-
echo "stage 10: Python script CosyVoice3 TTS (LLM + CosyVoice3 Token2Wav) inference"
116+
echo "stage 5: Python script CosyVoice3 TTS (LLM + CosyVoice3 Token2Wav) inference"
115117

116118
datasets=(wenetspeech4tts) # wenetspeech4tts
117-
backend=trtllm-serve # hf, trtllm, vllm, trtllm-serve
119+
backend=trtllm # hf, trtllm, vllm, trtllm-serve
118120

119-
batch_sizes=(1)
121+
batch_sizes=(16 8 4 2 1)
120122
token2wav_batch_size=1 # Only support 1 for now
121123

122124
for batch_size in ${batch_sizes[@]}; do
123125
for dataset in ${datasets[@]}; do
124-
output_dir=./cosyvoice3_${dataset}_${backend}_llm_batch_size_${batch_size}_token2wav_batch_size_${token2wav_batch_size}_streaming_trt
126+
output_dir=./cosyvoice3_${dataset}_${backend}_llm_batch_size_${batch_size}_token2wav_batch_size_${token2wav_batch_size}_offline_tts_trt
125127
CUDA_VISIBLE_DEVICES=0 \
126128
python3 infer_cosyvoice3.py \
127129
--output-dir $output_dir \
@@ -130,8 +132,8 @@ if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
130132
--backend $backend \
131133
--batch-size $batch_size --token2wav-batch-size $token2wav_batch_size \
132134
--engine-dir $trt_engines_dir \
133-
--enable-trt --streaming\
134-
--epoch 1 \
135+
--enable-trt \
136+
--epoch 3 \
135137
--split-name ${dataset} || exit 1
136138
done
137139
done

0 commit comments

Comments
 (0)