You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
23
+
```
24
+
25
+
### Understanding `run_cosyvoice3.sh`
26
+
27
+
The `run_cosyvoice3.sh` script orchestrates the entire workflow through numbered stages.
28
+
29
+
You can run a subset of stages with:
30
+
```sh
31
+
bash run_cosyvoice3.sh <start_stage><stop_stage>
32
+
```
33
+
-`<start_stage>`: The stage to start from.
34
+
-`<stop_stage>`: The stage to stop after.
35
+
36
+
**Stages:**
37
+
38
+
-**Stage -1**: Clones the `CosyVoice` repository.
39
+
-**Stage 0**: Downloads the `Fun-CosyVoice3-0.5B-2512` model and its HuggingFace LLM checkpoint.
40
+
-**Stage 1**: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
41
+
-**Stage 2**: Creates the Triton model repository, including configurations for `cosyvoice3`, `token2wav`, `vocoder`, `audio_tokenizer`, and `speaker_embedding`.
42
+
-**Stage 3**: Launches the Triton Inference Server for Token2Wav module and uses `trtllm-serve` to deploy CosyVoice3 LLM.
43
+
-**Stage 4**: Runs the gRPC benchmark client for performance testing.
44
+
-**Stage 5**: Runs the offline TTS inference benchmark test.
45
+
46
+
### Export Models and Launch Server
47
+
48
+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
49
+
```sh
50
+
# This command runs stages 0, 1, 2, and 3
51
+
bash run_cosyvoice3.sh 0 3
52
+
```
53
+
54
+
### Benchmark with client-server mode
55
+
56
+
To benchmark the running Triton server, run stage 4:
57
+
```sh
58
+
bash run_cosyvoice3.sh 4 4
59
+
60
+
# You can customize parameters such as the number of tasks inside the script.
61
+
```
62
+
The following results were obtained by decoding on a single L20 GPU.
# Accelerating CosyVoice with NVIDIA Triton Inference Server and TensorRT-LLM
2
+
3
+
Contributed by Yuekai Zhang (NVIDIA).
4
+
5
+
This repository provides three acceleration solutions for CosyVoice, each targeting a different model version and Token2Wav architecture. All solutions use TensorRT-LLM for LLM acceleration and NVIDIA Triton Inference Server for serving.
6
+
7
+
## Solutions
8
+
9
+
### [CosyVoice3](README.Cosyvoice3.md)
10
+
11
+
Acceleration solution for [Fun-CosyVoice3-0.5B-2512](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512), the latest CosyVoice model. The pipeline includes `audio_tokenizer`, `speaker_embedding`, `token2wav`, and `vocoder` modules managed by Triton, with the LLM served via `trtllm-serve`.
The baseline acceleration solution for CosyVoice2, using the original UNet-based flow-matching Token2Wav module.
16
+
17
+
### [CosyVoice2 + DiT Token2Wav](README.Cosyvoice2.DiT.md)
18
+
19
+
Replaces the UNet Token2Wav with a DiT-based Token2Wav module from [Step-Audio2](https://github.com/stepfun-ai/Step-Audio-2). Supports disaggregated deployment where the LLM and Token2Wav run on separate GPUs for better resource utilization under high concurrency.
20
+
21
+
22
+
23
+
## Quick Start
24
+
25
+
Each solution can be launched with a single Docker Compose command:
26
+
27
+
```sh
28
+
# CosyVoice3
29
+
docker compose -f docker-compose.cosyvoice3.yml up
30
+
31
+
# CosyVoice2 + UNet Token2Wav
32
+
docker compose -f docker-compose.cosyvoice2.unet.yml up
33
+
34
+
# CosyVoice2 + DiT Token2Wav
35
+
docker compose -f docker-compose.cosyvoice2.dit.yml up
0 commit comments