Skip to content

Sm90 mega moe on sgl dev#36

Open
qiushixiaoyu wants to merge 1 commit into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev
Open

Sm90 mega moe on sgl dev#36
qiushixiaoyu wants to merge 1 commit into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev

Conversation

@qiushixiaoyu
Copy link
Copy Markdown

@qiushixiaoyu qiushixiaoyu commented May 19, 2026

DeepSeekV4Flash(8卡H20)

batch/rank fused us baseline us speedup MoE TFLOPS/rank baseline TFLOPS/rank
1 191.9 503.4 2.62x 1.6 0.5
2 279.0 621.6 2.23x 2.1 1.0
4 422.4 834.1 1.97x 2.8 1.4
8 517.0 956.9 1.85x 4.6 2.4
16 596.6 1082.9 1.82x 8.1 4.2
32 600.6 1095.2 1.82x 16.1 9.0
64 611.2 1081.4 1.77x 31.6 17.9
128 621.5 1078.0 1.73x 62.2 35.8
256 637.0 1186.6 1.86x 121.9 65.1
512 867.4 1240.8 1.43x 179.2 124.6
1024 1634.8 2102.6 1.29x 190.2 146.9
2048 2854.1 3576.5 1.25x 218.1 173.0
4096 5202.8 6372.5 1.22x 239.5 194.1
8192 9802.5 11952.6 1.22x 254.2 207.1

DeepSeekV4Pro(4卡H20)

batch/rank fused us baseline us speedup MoE TFLOPS/rank baseline TFLOPS/rank
1 502.2 909.8 1.81x 1.8 0.8
2 752.1 1298.0 1.73x 2.1 1.4
4 1131.2 1863.0 1.65x 2.9 1.8
8 1807.8 2719.6 1.50x 3.8 2.2
16 2076.6 3160.8 1.52x 6.0 4.0
32 2168.8 3296.2 1.52x 11.6 7.6
64 2132.2 3307.0 1.55x 23.9 15.4
128 2163.0 3346.0 1.55x 47.0 30.5
256 2182.6 3378.6 1.55x 93.0 60.1
512 3012.0 3467.5 1.15x 135.0 117.0
1024 4828.9 5408.6 1.12x 168.6 150.0
2048 7861.6 8752.8 1.11x 207.4 185.6
4096 13651.0 15083.1 1.10x 238.9 215.2
8192 25604.1 28346.9 1.11x 254.8 229.0

@Fridge003
Copy link
Copy Markdown
Collaborator

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

@qiushixiaoyu
Copy link
Copy Markdown
Author

qiushixiaoyu commented May 29, 2026

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

@Fridge003
ok. I already opened a PR(323) against the upstream repository, but it contained some changes from sgl-project/DeepGEMM. I’ll revise the patch and strip out those project-specific modifications. That said, I’ve heard it’s unlikely to be accepted into deepseek-ai/DeepGEMM.

Comment thread tests/test_mega_moe.py Outdated
explicitly_destroy=True,
allow_multiple_reduction=False,
gpu_timeout_secs=10, cpu_timeout_secs=30
num_gpu_timeout_secs=10, num_cpu_timeout_secs=30
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this change

Comment thread tests/test_mega_moe_sm90.py Outdated
@@ -0,0 +1,527 @@
"""Layered tests for the SM90 (Hopper) MegaMoE kernel.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this test since there is test_mega_moe_hopper.py

Comment thread deep_gemm/utils/layout.py Outdated
def _get_C():
from .. import _C
return _C
import deep_gemm
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO need to modify these two lines

return config;
}

// ============================================================================
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a new file for sm90 heuristics

Comment thread csrc/jit/handle.hpp Outdated
// Define `DG_JIT_FORCE_LEGACY_LOAD` to force the older `cuModuleLoad` path
// (useful when building against a newer CUDA SDK but running with an older
// driver that lacks the `cuLibrary*` symbols).
#if CUDA_VERSION >= 12040 && !defined(DG_JIT_FORCE_LEGACY_LOAD)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change

Comment thread csrc/jit/compiler.hpp Outdated
flags += " --ptxas-options=--verbose,--warn-on-local-memory-usage";
if (get_env("DG_JIT_WITH_LINEINFO", 0))
flags += " -Xcompiler -rdynamic -lineinfo";
// NOTES: `--device-debug` (-G) emits full device DWARF so that cuda-gdb
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert changes in this file

Comment thread csrc/apis/mega.hpp Outdated

#include <functional>
// #include <pybind11/functional.h>
#include <pybind11/functional.h>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why uncomment. It will cause conflict

Comment thread csrc/apis/mega.hpp
#include <functional>
// #include <pybind11/functional.h>
#include <pybind11/functional.h>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a new file for sm90 fp8 mega moe definition and leave this file unchanged

// For 2-CTA clusters, neighbour SMs share the same m_block_idx with adjacent
// n_block_idx; the asserts below guarantee that pairing is always possible.
// SM90 / single-CTA paths set kClusterSize = 1 and do not need this.
DG_STATIC_ASSERT(kClusterSize == 1 or kClusterSize == 2, "Invalid cluster size");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had better do

if kClusterSize > 1:
    DG_STATIC_ASSERT(...)

Comment thread deep_gemm/mega/__init__.py Outdated
@@ -1,3 +1,4 @@
import os
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't modify this file. modify the https://github.com/sgl-project/DeepGEMM/blob/dev/sgl_deep_gemm/__init__.py instead.

Please compile and test your kernels under this instruction: https://github.com/sgl-project/DeepGEMM/blob/dev/sgl_deep_gemm/README.md, since we build and release wheel in this way

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll go through the review comments one by one and address them accordingly. I also noticed that some performance optimization changes were missing from this branch, so I’ve just synced them over. Without those changes, the performance gains would not be reproducible.

@yz-tang
Copy link
Copy Markdown

yz-tang commented Jun 2, 2026

When enable --run-low-latency-baseline, Will there be a performance degradation?

@qiushixiaoyu qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch 4 times, most recently from 067fc03 to 78772d1 Compare June 4, 2026 11:58
@qiushixiaoyu qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch from 78772d1 to fce68b3 Compare June 5, 2026 04:38
@qiushixiaoyu
Copy link
Copy Markdown
Author

When enable --run-low-latency-baseline, Will there be a performance degradation?

I don’t think so. This is only for comparing performance against the low-latency baseline. While testing, I found that the performance with small batch sizes is not very stable. I’m still investigating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants