diff --git a/mkdocs.yml b/mkdocs.yml index 5de85a37d..debf86167 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -171,12 +171,14 @@ plugins: "examples/distributed-training/trl.md": "docs/examples/training/trl.md" "examples/distributed-training/axolotl.md": "docs/examples/training/axolotl.md" "examples/distributed-training/ray-ragen.md": "docs/examples/training/ray-ragen.md" + "examples/distributed-training/miles.md": "docs/examples/training/miles.md" # Single-node and distributed training merged under Training "docs/examples/single-node-training/trl.md": "docs/examples/training/trl.md" "docs/examples/single-node-training/axolotl.md": "docs/examples/training/axolotl.md" "docs/examples/distributed-training/trl.md": "docs/examples/training/trl.md" "docs/examples/distributed-training/axolotl.md": "docs/examples/training/axolotl.md" "docs/examples/distributed-training/ray-ragen.md": "docs/examples/training/ray-ragen.md" + "docs/examples/distributed-training/miles.md": "docs/examples/training/miles.md" "examples/clusters/aws.md": "docs/examples/clusters/aws.md" "examples/clusters/gcp.md": "docs/examples/clusters/gcp.md" "examples/clusters/lambda.md": "docs/examples/clusters/lambda.md" @@ -312,6 +314,7 @@ nav: - TRL: docs/examples/training/trl.md - Axolotl: docs/examples/training/axolotl.md - Ray+RAGEN: docs/examples/training/ray-ragen.md + - Miles: docs/examples/training/miles.md - Clusters: - AWS: docs/examples/clusters/aws.md - GCP: docs/examples/clusters/gcp.md diff --git a/mkdocs/docs/concepts/tasks.md b/mkdocs/docs/concepts/tasks.md index 43eb8e80c..d3a3515b8 100644 --- a/mkdocs/docs/concepts/tasks.md +++ b/mkdocs/docs/concepts/tasks.md @@ -151,7 +151,7 @@ Jobs on each node communicate using their private IP addresses. Use `DSTACK_MAST `dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks. !!! info "Examples" - See the training examples for [TRL](../examples/training/trl.md#distributed-training), [Axolotl](../examples/training/axolotl.md#distributed-training), and [Ray+RAGEN](../examples/training/ray-ragen.md). + See the training examples for [TRL](../examples/training/trl.md#distributed-training), [Axolotl](../examples/training/axolotl.md#distributed-training), [Ray+RAGEN](../examples/training/ray-ragen.md), and [Miles](../examples/training/miles.md). See the cluster examples for [AWS](../examples/clusters/aws.md), [GCP](../examples/clusters/gcp.md), [Lambda](../examples/clusters/lambda.md), [Crusoe](../examples/clusters/crusoe.md), [Nebius](../examples/clusters/nebius.md), and [NCCL/RCCL tests](../examples/clusters/nccl-rccl-tests.md). diff --git a/mkdocs/docs/examples.md b/mkdocs/docs/examples.md index 59203cdf8..ad8bf4839 100644 --- a/mkdocs/docs/examples.md +++ b/mkdocs/docs/examples.md @@ -49,6 +49,17 @@ hide: Fine-tune an agent on multiple nodes with RAGEN, verl, and Ray.

+ + +

+ Miles +

+ +

+ RL-fine-tune Qwen2.5-32B with Miles. +

+
diff --git a/mkdocs/docs/examples/training/miles.md b/mkdocs/docs/examples/training/miles.md new file mode 100644 index 000000000..59451d150 --- /dev/null +++ b/mkdocs/docs/examples/training/miles.md @@ -0,0 +1,283 @@ +--- +title: Miles +description: RL post-training Qwen2.5-32B with Miles, SGLang, Megatron-LM, and Ray across two 8xH100 nodes +--- + +# Miles + +This example shows how to use `dstack` and [Miles](https://github.com/radixark/miles) +for reinforcement learning (RL) post-training of a 32B language model with +[GRPO](https://arxiv.org/abs/2402.03300) across a multi-node cluster. +Miles integrates [SGLang](https://github.com/sgl-project/sglang) for +high-throughput rollouts, [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +for training, and [Ray](https://docs.ray.io/en/latest/) to coordinate the +trainer and rollout actors across nodes. + +Here we fine-tune `Qwen/Qwen2.5-32B-Instruct` on the +[GSM8K](https://huggingface.co/datasets/openai/gsm8k) dataset. + +!!! info "Prerequisites" + Multi-node tasks require a [fleet](../../concepts/fleets.md) with + `placement` set to [`cluster`](../../concepts/fleets.md#cluster-placement). + +## Run a Ray cluster + +### Define a configuration + +The [task](../../concepts/tasks.md) below starts Ray on two nodes and prepares +each node by downloading the model and dataset, then converting the checkpoint +to Megatron's `torch_dist` format. + +
+ +```yaml +type: task +name: miles-qwen32b-h100 +nodes: 2 +image: radixark/miles:sglang-miles-v0.5.12 +env: + - WANDB_API_KEY + - PYTHONPATH=/root/Megatron-LM + - NCCL_DEBUG=INFO + - MODEL_ID=Qwen/Qwen2.5-32B-Instruct +commands: + # 1. Download the model and dataset. + - pip install -U "huggingface_hub[cli]" + - hf download "$MODEL_ID" --local-dir "/root/$(basename "$MODEL_ID")" + - hf download --repo-type dataset openai/gsm8k --local-dir /root/gsm8k + # 2. Convert the Hugging Face checkpoint to Megatron torch_dist. + - | + MODEL_NAME="$(basename "$MODEL_ID")" + cd /root/miles && python tools/convert_hf_to_torch_dist.py \ + --swiglu \ + --num-layers 64 \ + --hidden-size 5120 \ + --ffn-hidden-size 27648 \ + --num-attention-heads 40 \ + --use-rotary-position-embeddings \ + --disable-bias-linear \ + --add-qkv-bias \ + --normalization RMSNorm \ + --norm-epsilon 1e-5 \ + --rotary-base 1000000 \ + --group-query-attention \ + --num-query-groups 8 \ + --vocab-size 152064 \ + --untie-embeddings-and-output-weights \ + --hf-checkpoint "/root/$MODEL_NAME" \ + --save "/root/${MODEL_NAME}_torch_dist" + # 3. Start Ray. + - | + if [ $DSTACK_NODE_RANK = 0 ]; then + ray start --head --port=6379 + else + ray start --address=$DSTACK_MASTER_NODE_IP:6379 + fi +ports: + - 8265 +resources: + gpu: H100:8 + shm_size: 32GB + disk: 1000GB.. +volumes: + - /checkpoints:/checkpoints +``` + +
+ +### Run the configuration + +Run the task with [`dstack apply`](../../reference/cli/dstack/apply.md). By +default, `dstack apply` forwards the Ray dashboard port to `localhost:8265`. + +
+ +```shell +$ export WANDB_API_KEY=... +$ dstack apply -f miles-qwen32b-h100.dstack.yml +``` + +
+ +While `dstack apply` is attached, you can submit Ray jobs through +`localhost:8265`. If you detach or run from another machine, use +[`dstack attach`](../../reference/cli/dstack/attach.md) to re-attach and make +the dashboard port accessible on `localhost`. + +> To run on a single node, remove `nodes` or set it to `1`, then submit the job +> with `NUM_NODES=1`. In this case, `placement: cluster` is not required. + +## Submit Ray jobs + +Install `ray` locally before submitting jobs: + +
+ +```shell +$ pip install ray +``` + +
+ +The submit script below runs the Miles training job on the Ray cluster. The +model is sharded across all 8 GPUs per node with tensor parallelism, and SGLang +uses the same 8 GPUs per node for rollout. + +
+ +```bash +#!/bin/bash +set -euo pipefail + +export RAY_ADDRESS=http://localhost:8265 + +: "${NUM_NODES:?NUM_NODES is not set}" +: "${GPUS_PER_NODE:?GPUS_PER_NODE is not set}" + +MODEL_ID="Qwen/Qwen2.5-32B-Instruct" +MODEL_NAME="$(basename "$MODEL_ID")" +HF_CHECKPOINT="/root/$MODEL_NAME" +REF_LOAD="/root/${MODEL_NAME}_torch_dist" +PROMPT_DATA="/root/gsm8k/main/train-00000-of-00001.parquet" +EVAL_PROMPT_DATA="/root/gsm8k/main/test-00000-of-00001.parquet" +INPUT_KEY="question" +LABEL_KEY="answer" +EVAL_DATASET_NAME="gsm8k" +CHECKPOINT_DIR="/checkpoints/${MODEL_NAME}-${EVAL_DATASET_NAME}" +SAVE_INTERVAL=10 +WANDB_PROJECT="dstack-miles-RL" +WANDB_GROUP="${MODEL_NAME}-gsm8k-${NUM_NODES}node-${GPUS_PER_NODE}gpu" +WANDB_NAME="rollout-$(date +%Y%m%d-%H%M%S)" +ROLLOUT_GPUS_PER_ENGINE=8 + +CMD='cd /root/miles && python3 train.py \ + --actor-num-nodes '"$NUM_NODES"' \ + --actor-num-gpus-per-node '"$GPUS_PER_NODE"' \ + --num-gpus-per-node '"$GPUS_PER_NODE"' \ + --rollout-num-gpus-per-engine '"$ROLLOUT_GPUS_PER_ENGINE"' \ + --sglang-server-concurrency 128 \ + --colocate \ + --calculate-per-token-loss \ + --use-miles-router \ + --swiglu \ + --num-layers 64 \ + --hidden-size 5120 \ + --ffn-hidden-size 27648 \ + --num-attention-heads 40 \ + --use-rotary-position-embeddings \ + --disable-bias-linear \ + --add-qkv-bias \ + --normalization RMSNorm \ + --norm-epsilon 1e-5 \ + --rotary-base 1000000 \ + --group-query-attention \ + --num-query-groups 8 \ + --vocab-size 152064 \ + --untie-embeddings-and-output-weights \ + --hf-checkpoint '"$HF_CHECKPOINT"' \ + --ref-load '"$REF_LOAD"' \ + --prompt-data '"$PROMPT_DATA"' \ + --input-key '"$INPUT_KEY"' \ + --label-key '"$LABEL_KEY"' \ + --apply-chat-template \ + --rollout-shuffle \ + --rm-type math \ + --num-rollout 20 \ + --rollout-batch-size 8 \ + --n-samples-per-prompt 8 \ + --rollout-max-response-len 512 \ + --rollout-temperature 1 \ + --global-batch-size 64 \ + --eval-interval 5 \ + --eval-prompt-data '"$EVAL_DATASET_NAME"' '"$EVAL_PROMPT_DATA"' \ + --n-samples-per-eval-prompt 1 \ + --eval-max-response-len 512 \ + --eval-top-k 1 \ + --tensor-model-parallel-size 8 \ + --sequence-parallel \ + --pipeline-model-parallel-size 1 \ + --context-parallel-size 1 \ + --expert-model-parallel-size 1 \ + --expert-tensor-parallel-size 1 \ + --use-dynamic-batch-size \ + --max-tokens-per-gpu 9216 \ + --advantage-estimator grpo \ + --use-kl-loss \ + --kl-loss-coef 0.00 \ + --kl-loss-type low_var_kl \ + --kl-coef 0.00 \ + --entropy-coef 0.00 \ + --eps-clip 0.2 \ + --eps-clip-high 0.28 \ + --optimizer adam \ + --lr 1e-6 \ + --lr-decay-style constant \ + --weight-decay 0.1 \ + --adam-beta1 0.9 \ + --adam-beta2 0.98 \ + --sglang-mem-fraction-static 0.7 \ + --use-wandb \ + --wandb-host https://wandb.ai/ \ + --wandb-project '"$WANDB_PROJECT"' \ + --wandb-group '"$WANDB_GROUP"' \ + --wandb-exp-name '"$WANDB_NAME"' \ + --attention-dropout 0.0 \ + --hidden-dropout 0.0 \ + --accumulate-allreduce-grads-in-fp32 \ + --attention-softmax-in-fp32 \ + --attention-backend flash \ + --save '"$CHECKPOINT_DIR"' \ + --save-interval '"$SAVE_INTERVAL"'' + +# GLOO_SOCKET_IFNAME=eth0 is required for multi-node Gloo process group init. +# Without it, Gloo resolves to a loopback address (127.0.1.1) instead of the +# inter-node interface, causing `init_gloo_group()` to timeout. +RUNTIME_ENV_JSON=$(cat < + +Submit the job with the same cluster shape as the task: + +
+ +```shell +$ NUM_NODES=2 GPUS_PER_NODE=8 bash submit-miles-train.sh +``` + +
+ +!!! info "Training parameters" + 1. `--tensor-model-parallel-size 8` shards the 32B model across all 8 GPUs + per node. + 2. `--rollout-num-gpus-per-engine 8` starts SGLang with TP-8 on each node. + 3. `--sglang-server-concurrency` sets how many requests SGLang processes + concurrently. + 4. `--max-tokens-per-gpu 9216` sets the per-GPU token budget. Lower this if + Megatron OOMs during training. + 5. `--sglang-mem-fraction-static 0.7` sets the SGLang KV cache memory + fraction. Lower this if Megatron OOMs at startup. + +Using Ray via `dstack` gives you access to the Ray ecosystem while benefiting +from `dstack`'s provisioning capabilities. + +!!! info "What's next" + 1. Read about [distributed tasks](../../concepts/tasks.md#distributed-tasks) + and [fleets](../../concepts/fleets.md) + 2. See the [SGLang inference](../inference/sglang.md) example + 3. Browse Miles' [examples](https://github.com/radixark/miles/tree/main/examples)