Sm90 mega moe on sgl dev by qiushixiaoyu · Pull Request #36 · sgl-project/DeepGEMM

qiushixiaoyu · 2026-05-19T09:05:01Z

DeepSeekV4Flash（8卡H20）

batch/rank	fused us	baseline us	speedup	MoE TFLOPS/rank	baseline TFLOPS/rank
1	191.9	503.4	2.62x	1.6	0.5
2	279.0	621.6	2.23x	2.1	1.0
4	422.4	834.1	1.97x	2.8	1.4
8	517.0	956.9	1.85x	4.6	2.4
16	596.6	1082.9	1.82x	8.1	4.2
32	600.6	1095.2	1.82x	16.1	9.0
64	611.2	1081.4	1.77x	31.6	17.9
128	621.5	1078.0	1.73x	62.2	35.8
256	637.0	1186.6	1.86x	121.9	65.1
512	867.4	1240.8	1.43x	179.2	124.6
1024	1634.8	2102.6	1.29x	190.2	146.9
2048	2854.1	3576.5	1.25x	218.1	173.0
4096	5202.8	6372.5	1.22x	239.5	194.1
8192	9802.5	11952.6	1.22x	254.2	207.1

DeepSeekV4Pro（4卡H20）

batch/rank	fused us	baseline us	speedup	MoE TFLOPS/rank	baseline TFLOPS/rank
1	502.2	909.8	1.81x	1.8	0.8
2	752.1	1298.0	1.73x	2.1	1.4
4	1131.2	1863.0	1.65x	2.9	1.8
8	1807.8	2719.6	1.50x	3.8	2.2
16	2076.6	3160.8	1.52x	6.0	4.0
32	2168.8	3296.2	1.52x	11.6	7.6
64	2132.2	3307.0	1.55x	23.9	15.4
128	2163.0	3346.0	1.55x	47.0	30.5
256	2182.6	3378.6	1.55x	93.0	60.1
512	3012.0	3467.5	1.15x	135.0	117.0
1024	4828.9	5408.6	1.12x	168.6	150.0
2048	7861.6	8752.8	1.11x	207.4	185.6
4096	13651.0	15083.1	1.10x	238.9	215.2
8192	25604.1	28346.9	1.11x	254.8	229.0

Fridge003 · 2026-05-29T06:52:14Z

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

qiushixiaoyu · 2026-05-29T07:20:15Z

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

@Fridge003
ok. I already opened a PR(323) against the upstream repository, but it contained some changes from sgl-project/DeepGEMM. I’ll revise the patch and strip out those project-specific modifications. That said, I’ve heard it’s unlikely to be accepted into deepseek-ai/DeepGEMM.

Fridge003 · 2026-05-29T06:43:58Z

        explicitly_destroy=True,
        allow_multiple_reduction=False,
-        gpu_timeout_secs=10, cpu_timeout_secs=30
+        num_gpu_timeout_secs=10, num_cpu_timeout_secs=30


Why do this change

Fridge003 · 2026-05-29T06:44:23Z

@@ -0,0 +1,527 @@
+"""Layered tests for the SM90 (Hopper) MegaMoE kernel.


We don't need this test since there is test_mega_moe_hopper.py

Fridge003 · 2026-05-29T06:45:31Z

 def _get_C():
-    from .. import _C
-    return _C
+    import deep_gemm


NO need to modify these two lines

Fridge003 · 2026-05-29T08:48:09Z

    return config;
 }

+// ============================================================================


Can we create a new file for sm90 heuristics

Fridge003 · 2026-05-29T08:48:43Z

+// Define `DG_JIT_FORCE_LEGACY_LOAD` to force the older `cuModuleLoad` path
+// (useful when building against a newer CUDA SDK but running with an older
+// driver that lacks the `cuLibrary*` symbols).
+#if CUDA_VERSION >= 12040 && !defined(DG_JIT_FORCE_LEGACY_LOAD)


Do we need this change

Fridge003 · 2026-05-29T08:49:00Z

            flags += " --ptxas-options=--verbose,--warn-on-local-memory-usage";
        if (get_env("DG_JIT_WITH_LINEINFO", 0))
            flags += " -Xcompiler -rdynamic -lineinfo";
+        // NOTES: `--device-debug` (-G) emits full device DWARF so that cuda-gdb


Revert changes in this file

Fridge003 · 2026-05-29T08:49:37Z


 #include <functional>
-// #include <pybind11/functional.h>
+#include <pybind11/functional.h>


Why uncomment. It will cause conflict

Fridge003 · 2026-05-29T08:50:52Z

 #include <functional>
-// #include <pybind11/functional.h>
+#include <pybind11/functional.h>



Can we create a new file for sm90 fp8 mega moe definition and leave this file unchanged

Fridge003 · 2026-05-29T08:52:43Z

+    // For 2-CTA clusters, neighbour SMs share the same m_block_idx with adjacent
+    // n_block_idx; the asserts below guarantee that pairing is always possible.
+    // SM90 / single-CTA paths set kClusterSize = 1 and do not need this.
+    DG_STATIC_ASSERT(kClusterSize == 1 or kClusterSize == 2, "Invalid cluster size");


We had better do

if kClusterSize > 1: DG_STATIC_ASSERT(...)

Fridge003 · 2026-05-29T09:00:32Z

@@ -1,3 +1,4 @@
+import os


Don't modify this file. modify the https://github.com/sgl-project/DeepGEMM/blob/dev/sgl_deep_gemm/__init__.py instead.

Please compile and test your kernels under this instruction: https://github.com/sgl-project/DeepGEMM/blob/dev/sgl_deep_gemm/README.md, since we build and release wheel in this way

I’ll go through the review comments one by one and address them accordingly. I also noticed that some performance optimization changes were missing from this branch, so I’ve just synced them over. Without those changes, the performance gains would not be reproducible.

yz-tang · 2026-06-02T06:55:55Z

When enable --run-low-latency-baseline, Will there be a performance degradation?

qiushixiaoyu · 2026-06-05T09:18:11Z

When enable --run-low-latency-baseline, Will there be a performance degradation?

I don’t think so. This is only for comparing performance against the low-latency baseline. While testing, I found that the performance with small batch sizes is not very stable. I’m still investigating it.

This was referenced May 19, 2026

MegaMOE adaptation for SM90 #24

Closed

DeepSeek V4 Roadmap sgl-project/sglang#23602

Open

Fridge003 requested changes May 29, 2026

View reviewed changes

qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch 4 times, most recently from 067fc03 to 78772d1 Compare June 4, 2026 11:58

Add SM90 MegaMoE support with TVM FFI bindings

fce68b3

qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch from 78772d1 to fce68b3 Compare June 5, 2026 04:38

		@@ -0,0 +1,527 @@
		"""Layered tests for the SM90 (Hopper) MegaMoE kernel.

Conversation

qiushixiaoyu commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented May 29, 2026

Uh oh!

qiushixiaoyu commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yz-tang commented Jun 2, 2026

Uh oh!

qiushixiaoyu commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qiushixiaoyu commented May 19, 2026 •

edited

Loading

qiushixiaoyu commented May 29, 2026 •

edited

Loading