Add unified DeepSpeed finetune demo with MMLU/GSM8K benchmarks by delock · Pull Request #1001 · deepspeedai/DeepSpeedExamples

delock · 2026-05-20T06:53:33Z

Summary

This PR adds a new standalone finetuning example under training/deepspeed_finetune_demo/ that demonstrates DeepSpeed's philosophy: use different training features via config files with no code change needed.

The example is extracted and extended from DeepSpeed-ZenFlow/finetuning.

Key Features

Dataset registry with automatic format detection (Alpaca, Magicoder, MMLU, etc.) and sample_rate support for downsampling large datasets
MMLU and GSM8K benchmark evaluation scripts (vLLM-based generation + accuracy scoring)
Auto-detection of flash_attn availability (falls back gracefully if not installed)
DistributedSampler for correct multi-GPU data sharding
Checkpoint conversion (convert_ds_to_hf.py) from DeepSpeed format to HuggingFace, with AutoEP (expert parallelism) support
9 DeepSpeed config variants: ZeRO-2/3, Offload, ZenFlow, SuperOffload, Muon optimizer, AutoTP

Tested Configurations

Verified on Qwen2.5-0.5B with 2x RTX 4090 (AutoDL):

Benchmark	Baseline	After 10-step finetune
GSM8K	28.43%	—
MMLU	33.12%	38.01% (+4.89%)

Full pipeline validated: train → convert checkpoint → vLLM eval.

File Structure

training/deepspeed_finetune_demo/
├── README.md                    # Documentation
├── finetune_llama.py            # Main training script
├── finetune.sh / benchmark.sh / profile.sh
├── run_and_evaluate.sh          # End-to-end: train + convert + eval
├── convert_ds_to_hf.py          # DeepSpeed → HuggingFace checkpoint conversion
├── requirements.txt
├── configs/                     # 9 DeepSpeed config variants
│   ├── z2_config.json           # ZeRO Stage 2 with AdamW
│   ├── z3_config.json           # ZeRO Stage 3 with AdamW
│   ├── zo_config.json           # ZeRO Offload, stage 2
│   ├── ...
│   └── z2_muon.json             # ZeRO 2 with Muon optimizer
└── evaluate/
    ├── mmlu/                    # MMLU eval (gen + score)
    └── gsm8k/                   # GSM8K eval (gen + score)

Dataset Support

Dataset	Format	Use Case
`tatsu-lab/alpaca`	Alpaca	General instruction tuning
`sahil2801/CodeAlpaca-20k`	Alpaca	Code instruction tuning
`meta-math/MetaMathQA`	Alpaca (sample_rate=0.1)	Math reasoning (GSM8K downstream)
`cais/mmlu`	MMLU MCQ	Knowledge (MMLU downstream)
Any HF dataset	Auto-detect	Magicoder or Alpaca format

…AutoEP configs - Add unified finetune script (finetune_llama.py) with DATASET_REGISTRY supporting Alpaca, CodeAlpaca, Magicoder, MetaMathQA, MMLU, MBPP datasets - Add sample_rate mechanism for dataset downsampling (MetaMathQA: 0.1) - Add MMLU and GSM8K evaluation pipelines (vllm-based generation + scoring) - Add Moonlight AutoEP ZeRO-2 configs (AdamW and Muon) - Add end-to-end run_and_evaluate.sh supporting MBPP/MMLU/GSM8K benchmarks - Add DeepSpeed checkpoint to HF model conversion with AutoEP/MoE support - Update README with dataset registry details, benchmark usage, and configs Signed-off-by: Guokai Ma <guokai.ma@gmail.com>

delock requested a review from tjruwase as a code owner May 20, 2026 06:53

delock force-pushed the gma/add-deepspeed-finetune-demo branch from a0ae7bc to 1171f89 Compare May 20, 2026 09:46

delock merged commit 6eb2582 into deepspeedai:master May 20, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unified DeepSpeed finetune demo with MMLU/GSM8K benchmarks#1001

Add unified DeepSpeed finetune demo with MMLU/GSM8K benchmarks#1001
delock merged 1 commit into
deepspeedai:masterfrom
delock:gma/add-deepspeed-finetune-demo

delock commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delock commented May 20, 2026

Summary

Key Features

Tested Configurations

File Structure

Dataset Support

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant