Skip to content

Commit 41ce0e2

Browse files
BihanAndrey Cheptsov
andauthored
Add Dynamo docs (#3877)
* Add Dynamo docs * Minor Update * [Docs] NVIDIA Dynamo docs minor edits --------- Co-authored-by: Bihan Rana Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>
1 parent f328774 commit 41ce0e2

4 files changed

Lines changed: 281 additions & 8 deletions

File tree

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,7 @@ nav:
321321
- NCCL/RCCL tests: docs/examples/clusters/nccl-rccl-tests.md
322322
- Inference:
323323
- SGLang: docs/examples/inference/sglang.md
324+
- Dynamo: docs/examples/inference/dynamo.md
324325
- vLLM: docs/examples/inference/vllm.md
325326
- NIM: docs/examples/inference/nim.md
326327
- TensorRT-LLM: docs/examples/inference/trtllm.md

mkdocs/docs/concepts/services.md

Lines changed: 110 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -342,13 +342,13 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
342342

343343
<!-- NOTE: this section is referenced from the CLI, keep the URL unchanged -->
344344

345-
Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use it, configure three replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers, and one for decode workers.
345+
Since 0.20.17, `dstack` supports serving a model using Prefill-Decode disaggregation. To use it, configure three replica groups: one for the router, one for prefill workers, and one for decode workers.
346346

347-
> Currently, Prefill-Decode disaggregation is supported only for SGLang.
347+
`dstack` integrates with two routers for PD disaggregation: [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html) and [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo).
348348

349349
Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
350350

351-
=== "NVIDIA"
351+
=== "SMG"
352352

353353
<div editor-title="pd.dstack.yml">
354354

@@ -372,10 +372,10 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
372372
--port 8000 \
373373
--pd-disaggregation \
374374
--prefill-policy cache_aware
375-
router:
376-
type: sglang
377375
resources:
378376
cpu: 4
377+
router:
378+
type: sglang
379379

380380
- count: 1..4
381381
scaling:
@@ -418,6 +418,111 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
418418

419419
</div>
420420

421+
> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
422+
423+
=== "Dynamo"
424+
425+
<div editor-title="pd.dstack.yml">
426+
427+
```yaml
428+
type: service
429+
name: dynamo-pd
430+
431+
env:
432+
- HF_TOKEN
433+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
434+
435+
replicas:
436+
- count: 1
437+
docker: true
438+
commands:
439+
- apt-get update
440+
- apt-get install -y python3-dev python3-venv
441+
- python3 -m venv ~/dyn-venv
442+
- source ~/dyn-venv/bin/activate
443+
- pip install -U pip
444+
- pip install "ai-dynamo[sglang]==1.1.1"
445+
- git clone https://github.com/ai-dynamo/dynamo.git
446+
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
447+
- docker compose -f dynamo/deploy/docker-compose.yml up -d
448+
- |
449+
python3 -m dynamo.frontend \
450+
--http-host 0.0.0.0 --http-port 8000 \
451+
--discovery-backend etcd --router-mode kv \
452+
--kv-cache-block-size 64
453+
resources:
454+
cpu: 4
455+
router:
456+
type: dynamo
457+
458+
- count: 1..4
459+
scaling:
460+
metric: rps
461+
target: 3
462+
python: "3.12"
463+
nvcc: true
464+
commands:
465+
# dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica
466+
# is provisioned. Compose the etcd/NATS endpoints from it.
467+
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
468+
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
469+
# Set to enable /health endpoint required by dstack probes.
470+
- export DYN_SYSTEM_PORT="8000"
471+
# Wait until the router's etcd and NATS ports are actually accepting connections.
472+
- |
473+
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
474+
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
475+
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
476+
done
477+
- pip install "ai-dynamo[sglang]==1.1.1"
478+
- |
479+
python3 -m dynamo.sglang \
480+
--model-path $MODEL_ID --served-model-name $MODEL_ID \
481+
--discovery-backend etcd --host 0.0.0.0 \
482+
--page-size 64 \
483+
--disaggregation-mode prefill --disaggregation-transfer-backend nixl
484+
resources:
485+
gpu: H200
486+
487+
- count: 1..8
488+
scaling:
489+
metric: rps
490+
target: 2
491+
python: "3.12"
492+
nvcc: true
493+
commands:
494+
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
495+
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
496+
- export DYN_SYSTEM_PORT="8000"
497+
- |
498+
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
499+
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
500+
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
501+
done
502+
- pip install "ai-dynamo[sglang]==1.1.1"
503+
- |
504+
python3 -m dynamo.sglang \
505+
--model-path $MODEL_ID --served-model-name $MODEL_ID \
506+
--discovery-backend etcd --host 0.0.0.0 \
507+
--page-size 64 \
508+
--disaggregation-mode decode --disaggregation-transfer-backend nixl
509+
resources:
510+
gpu: H200
511+
512+
port: 8000
513+
model: zai-org/GLM-4.5-Air-FP8
514+
515+
# Custom probe is required for PD disaggregation.
516+
probes:
517+
- type: http
518+
url: /health
519+
interval: 15s
520+
```
521+
522+
</div>
523+
524+
> With the the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers.
525+
421526
!!! info "Cluster"
422527
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
423528

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
title: Dynamo
3+
description: Deploying zai-org/GLM-4.5-Air-FP8 using NVIDIA Dynamo
4+
---
5+
6+
# Dynamo
7+
8+
This example shows how to deploy `zai-org/GLM-4.5-Air-FP8` using
9+
[NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) and `dstack`.
10+
11+
12+
## Apply a configuration
13+
14+
Here's an example of a service that deploys `zai-org/GLM-4.5-Air-FP8` using
15+
Dynamo with PD disaggregation.
16+
17+
<div editor-title="service.dstack.yml">
18+
19+
```yaml
20+
type: service
21+
name: dynamo-pd
22+
23+
env:
24+
- HF_TOKEN
25+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
26+
27+
replicas:
28+
- count: 1
29+
docker: true
30+
commands:
31+
- apt-get update
32+
- apt-get install -y python3-dev python3-venv
33+
- python3 -m venv ~/dyn-venv
34+
- source ~/dyn-venv/bin/activate
35+
- pip install -U pip
36+
- pip install "ai-dynamo[sglang]==1.1.1"
37+
- git clone https://github.com/ai-dynamo/dynamo.git
38+
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
39+
- docker compose -f dynamo/deploy/docker-compose.yml up -d
40+
- |
41+
python3 -m dynamo.frontend \
42+
--http-host 0.0.0.0 --http-port 8000 \
43+
--discovery-backend etcd --router-mode kv \
44+
--kv-cache-block-size 64
45+
resources:
46+
cpu: 4
47+
router:
48+
type: dynamo
49+
50+
- count: 1..4
51+
scaling:
52+
metric: rps
53+
target: 3
54+
python: "3.12"
55+
nvcc: true
56+
commands:
57+
# dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica
58+
# is provisioned. Compose the etcd/NATS endpoints from it.
59+
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
60+
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
61+
# Set to enable /health endpoint required by dstack probes.
62+
- export DYN_SYSTEM_PORT="8000"
63+
# Wait until the router's etcd and NATS ports are actually accepting connections.
64+
- |
65+
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
66+
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
67+
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
68+
done
69+
- pip install "ai-dynamo[sglang]==1.1.1"
70+
- |
71+
python3 -m dynamo.sglang \
72+
--model-path $MODEL_ID --served-model-name $MODEL_ID \
73+
--discovery-backend etcd --host 0.0.0.0 \
74+
--page-size 64 \
75+
--disaggregation-mode prefill --disaggregation-transfer-backend nixl
76+
resources:
77+
gpu: H200
78+
79+
- count: 1..8
80+
scaling:
81+
metric: rps
82+
target: 2
83+
python: "3.12"
84+
nvcc: true
85+
commands:
86+
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
87+
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
88+
- export DYN_SYSTEM_PORT="8000"
89+
- |
90+
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
91+
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
92+
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
93+
done
94+
- pip install "ai-dynamo[sglang]==1.1.1"
95+
- |
96+
python3 -m dynamo.sglang \
97+
--model-path $MODEL_ID --served-model-name $MODEL_ID \
98+
--discovery-backend etcd --host 0.0.0.0 \
99+
--page-size 64 \
100+
--disaggregation-mode decode --disaggregation-transfer-backend nixl
101+
resources:
102+
gpu: H200
103+
104+
port: 8000
105+
model: zai-org/GLM-4.5-Air-FP8
106+
107+
# Custom probe is required for PD disaggregation.
108+
probes:
109+
- type: http
110+
url: /health
111+
interval: 15s
112+
```
113+
114+
</div>
115+
116+
> With the the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers.
117+
118+
Save the configuration as `service.dstack.yml`, then use the
119+
[`dstack apply`](../../reference/cli/dstack/apply.md) command.
120+
121+
<div class="termy">
122+
123+
```shell
124+
$ dstack apply -f service.dstack.yml
125+
```
126+
127+
</div>
128+
129+
If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
130+
131+
<div class="termy">
132+
133+
```shell
134+
curl http://127.0.0.1:3000/proxy/services/main/dynamo-pd/v1/chat/completions \
135+
-X POST \
136+
-H 'Authorization: Bearer &lt;user token&gt;' \
137+
-H 'Content-Type: application/json' \
138+
-d '{
139+
"model": "zai-org/GLM-4.5-Air-FP8",
140+
"messages": [
141+
{
142+
"role": "user",
143+
"content": "What is prefill-decode disaggregation?"
144+
}
145+
],
146+
"max_tokens": 1024
147+
}'
148+
```
149+
150+
</div>
151+
152+
> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://dynamo-pd.<gateway domain>/`.
153+
154+
## Configuration options
155+
156+
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
157+
158+
!!! info "Cluster"
159+
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
160+
161+
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
162+
163+
## What's next?
164+
165+
1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
166+
2. Browse the [NVIDIA Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo) and the [SGLang](./sglang.md) example

mkdocs/docs/examples/inference/sglang.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,6 @@ Here's an example of a service that deploys
9292
The AMD example keeps the deployment close to the upstream Qwen and SGLang
9393
guidance: a pinned ROCm image, tensor parallelism across all four GPUs, and the
9494
standard `qwen3` reasoning parser without extra ROCm-specific tuning flags.
95-
The first startup on MI300X can take longer while SGLang compiles ROCm kernels.
9695

9796
Save one of the configurations above as `service.dstack.yml`, then use the
9897
[`dstack apply`](../../reference/cli/dstack/apply.md) command.
@@ -164,10 +163,10 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/
164163
--port 8000 \
165164
--pd-disaggregation \
166165
--prefill-policy cache_aware
167-
router:
168-
type: sglang
169166
resources:
170167
cpu: 4
168+
router:
169+
type: sglang
171170

172171
- count: 1..4
173172
scaling:
@@ -212,6 +211,8 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/
212211

213212
</div>
214213

214+
> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
215+
215216
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
216217

217218
!!! info "Cluster"

0 commit comments

Comments
 (0)