This cookbook provides step-by-step instructions for different SGLang deployment scenarios, from single-node testing to high-throughput distributed serving.
Before running any of the examples below, you must run the initialization script to install dependencies and the SGLang package.
# Initialize SGLang environment
source ./init_sglang_git.shFor most development and simple testing, running SGLang on a single node is sufficient.
cd sglang && uv run python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000This section outlines the steps to enable Prefill-Decode (PD) disaggregation with the Mooncake transfer backend and configure metrics.
To enable Grafana/Prometheus integration, add the --enable-metrics flag to your worker instances.
CUDA_VISIBLE_DEVICES=0 uv run python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--port 8001 \
--host 0.0.0.0 \
--enable-metricsCUDA_VISIBLE_DEVICES=1 uv run python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--port 8002 \
--host 0.0.0.0 \
--enable-metricsThe router exposes metrics on port 29000 by default.
uv run python3 -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://localhost:8001 \
--decode http://localhost:8002 \
--host 0.0.0.0 \
--port 8000Test the inference API through the router:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 16
}'Verify that metrics are accessible:
| Component | URL | Note |
|---|---|---|
| Router | http://localhost:29000/metrics |
Enabled by default |
| Prefill | http://localhost:8001/metrics |
Requires --enable-metrics |
| Decode | http://localhost:8002/metrics |
Requires --enable-metrics |
Add this to your prometheus.yml to scrape all three components:
scrape_configs:
- job_name: 'sglang_pd'
static_configs:
- targets:
- 'localhost:29000' # Router
- 'localhost:8001' # Prefill
- 'localhost:8002' # Decode