Skip to content

MegaMOE adaptation for SM90#24

Closed
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:main
Closed

MegaMOE adaptation for SM90#24
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:main

Conversation

@qiushixiaoyu
Copy link
Copy Markdown

Add mega moe support for sm90.
Use the following command to test:

python tests/test_mega_moe_sm90.py --layers 1 2 3 --num-processes 8 --fail-fast
python tests/test_mega_moe_sm90.py --layers 4 --num-processes 8 --fail-fast
python tests/test_mega_moe_sm90.py --layers 5 --num-correctness-tests 16 --num-processes 8

Co-authored with AI

@Fridge003
Copy link
Copy Markdown
Collaborator

Hi @qiushixiaoyu, can you please rebase your PR upon dev branch, from which we build wheels

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown

Hi @qiushixiaoyu do you have any performance report for the mega MOE sm90 implementaion ?

I also noticed that some modification on sm100. May I know which kind machine you are testing ?

@qiushixiaoyu
Copy link
Copy Markdown
Author

qiushixiaoyu commented May 19, 2026

Hi @qiushixiaoyu do you have any performance report for the mega MOE sm90 implementaion ?

I also noticed that some modification on sm100. May I know which kind machine you are testing ?

I test on H20.

<title></title> <style type="text/css"> p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px 'Helvetica Neue'} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: right; font: 13.0px 'Helvetica Neue'} table.t1 {border-collapse: collapse} td.td1 {border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #9a9a9a #9a9a9a #9a9a9a #9a9a9a; padding: 1.0px 5.0px 1.0px 5.0px} </style>

export PYTHONPATH=/workspace/DeepGEMM:/workspace/DeepEP:${PYTHONPATH:-}
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:${LD_LIBRARY_PATH:-}
python3 tests/test_mega_moe_hopper.py
--num-processes 8
--num-max-tokens-per-rank
--num-tokens
--hidden 4096
--intermediate-hidden 2048
--num-experts 256
--num-topk 6
--num-bench-tests 5
--num-warmup 2
--num-repeat 5
--l2-flush-gb 0
--run-baseline

Batch Fused avg us Baseline avg us Baseline / Fused Fused TFLOPS Baseline TFLOPS Fused HBM GB/s Baseline HBM GB/s Status
1 183.4 327.6 1.787 1.6 1.0 755.1 422.8 ok
2 263.0 380.4 1.446 2.1 1.5 1005.5 695.6 ok
4 406.1 497.4 1.225 3.0 2.4 1070.5 873.6 ok
8 497.1 546.1 1.099 4.8 4.5 1293.1 1177.2 ok
16 566.0 641.2 1.133 8.4 7.4 1376.8 1214.6 ok
32 576.0 651.0 1.130 16.8 14.8 1404.6 1242.4 ok
64 592.5 653.2 1.103 32.8 29.6 1371.9 1242.5 ok
128 597.9 680.1 1.138 64.9 56.9 1370.9 1202.9 ok
512 1144.0 1220.9 1.067 135.9 126.6 752.1 702.0 ok
1024 1989.5 2189.1 1.100 156.0 141.1 458.8 415.0 ok
4096 6949.8 6913.9 0.995 179.0 179.0 176.0 176.0 ok
8192 13514.9 13343.6 0.987 184.2 185.4 121.2 122.2 ok

@qiushixiaoyu
Copy link
Copy Markdown
Author

Hi @qiushixiaoyu, can you please rebase your PR upon dev branch, from which we build wheels

Done

@Fridge003 Fridge003 changed the base branch from dev-0426 to dev May 19, 2026 09:35
@Fridge003
Copy link
Copy Markdown
Collaborator

Moved to #36

@Fridge003 Fridge003 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants