Skip to content

Add unified DeepSpeed finetune demo with MMLU/GSM8K benchmarks#1001

Merged
delock merged 1 commit into
deepspeedai:masterfrom
delock:gma/add-deepspeed-finetune-demo
May 20, 2026
Merged

Add unified DeepSpeed finetune demo with MMLU/GSM8K benchmarks#1001
delock merged 1 commit into
deepspeedai:masterfrom
delock:gma/add-deepspeed-finetune-demo

Conversation

@delock
Copy link
Copy Markdown
Contributor

@delock delock commented May 20, 2026

Summary

This PR adds a new standalone finetuning example under training/deepspeed_finetune_demo/ that demonstrates DeepSpeed's philosophy: use different training features via config files with no code change needed.

The example is extracted and extended from DeepSpeed-ZenFlow/finetuning.

Key Features

  • Dataset registry with automatic format detection (Alpaca, Magicoder, MMLU, etc.) and sample_rate support for downsampling large datasets
  • MMLU and GSM8K benchmark evaluation scripts (vLLM-based generation + accuracy scoring)
  • Auto-detection of flash_attn availability (falls back gracefully if not installed)
  • DistributedSampler for correct multi-GPU data sharding
  • Checkpoint conversion (convert_ds_to_hf.py) from DeepSpeed format to HuggingFace, with AutoEP (expert parallelism) support
  • 9 DeepSpeed config variants: ZeRO-2/3, Offload, ZenFlow, SuperOffload, Muon optimizer, AutoTP

Tested Configurations

Verified on Qwen2.5-0.5B with 2x RTX 4090 (AutoDL):

Benchmark Baseline After 10-step finetune
GSM8K 28.43%
MMLU 33.12% 38.01% (+4.89%)

Full pipeline validated: train → convert checkpoint → vLLM eval.

File Structure

training/deepspeed_finetune_demo/
├── README.md                    # Documentation
├── finetune_llama.py            # Main training script
├── finetune.sh / benchmark.sh / profile.sh
├── run_and_evaluate.sh          # End-to-end: train + convert + eval
├── convert_ds_to_hf.py          # DeepSpeed → HuggingFace checkpoint conversion
├── requirements.txt
├── configs/                     # 9 DeepSpeed config variants
│   ├── z2_config.json           # ZeRO Stage 2 with AdamW
│   ├── z3_config.json           # ZeRO Stage 3 with AdamW
│   ├── zo_config.json           # ZeRO Offload, stage 2
│   ├── ...
│   └── z2_muon.json             # ZeRO 2 with Muon optimizer
└── evaluate/
    ├── mmlu/                    # MMLU eval (gen + score)
    └── gsm8k/                   # GSM8K eval (gen + score)

Dataset Support

Dataset Format Use Case
tatsu-lab/alpaca Alpaca General instruction tuning
sahil2801/CodeAlpaca-20k Alpaca Code instruction tuning
meta-math/MetaMathQA Alpaca (sample_rate=0.1) Math reasoning (GSM8K downstream)
cais/mmlu MMLU MCQ Knowledge (MMLU downstream)
Any HF dataset Auto-detect Magicoder or Alpaca format

@delock delock requested a review from tjruwase as a code owner May 20, 2026 06:53
…AutoEP configs

- Add unified finetune script (finetune_llama.py) with DATASET_REGISTRY
  supporting Alpaca, CodeAlpaca, Magicoder, MetaMathQA, MMLU, MBPP datasets
- Add sample_rate mechanism for dataset downsampling (MetaMathQA: 0.1)
- Add MMLU and GSM8K evaluation pipelines (vllm-based generation + scoring)
- Add Moonlight AutoEP ZeRO-2 configs (AdamW and Muon)
- Add end-to-end run_and_evaluate.sh supporting MBPP/MMLU/GSM8K benchmarks
- Add DeepSpeed checkpoint to HF model conversion with AutoEP/MoE support
- Update README with dataset registry details, benchmark usage, and configs

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
@delock delock force-pushed the gma/add-deepspeed-finetune-demo branch from a0ae7bc to 1171f89 Compare May 20, 2026 09:46
@delock delock merged commit 6eb2582 into deepspeedai:master May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant