Sm90 mega moe on sgl dev#36
Conversation
|
@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future |
@Fridge003 |
| explicitly_destroy=True, | ||
| allow_multiple_reduction=False, | ||
| gpu_timeout_secs=10, cpu_timeout_secs=30 | ||
| num_gpu_timeout_secs=10, num_cpu_timeout_secs=30 |
| @@ -0,0 +1,527 @@ | |||
| """Layered tests for the SM90 (Hopper) MegaMoE kernel. | |||
There was a problem hiding this comment.
We don't need this test since there is test_mega_moe_hopper.py
| def _get_C(): | ||
| from .. import _C | ||
| return _C | ||
| import deep_gemm |
There was a problem hiding this comment.
NO need to modify these two lines
| return config; | ||
| } | ||
|
|
||
| // ============================================================================ |
There was a problem hiding this comment.
Can we create a new file for sm90 heuristics
| // Define `DG_JIT_FORCE_LEGACY_LOAD` to force the older `cuModuleLoad` path | ||
| // (useful when building against a newer CUDA SDK but running with an older | ||
| // driver that lacks the `cuLibrary*` symbols). | ||
| #if CUDA_VERSION >= 12040 && !defined(DG_JIT_FORCE_LEGACY_LOAD) |
| flags += " --ptxas-options=--verbose,--warn-on-local-memory-usage"; | ||
| if (get_env("DG_JIT_WITH_LINEINFO", 0)) | ||
| flags += " -Xcompiler -rdynamic -lineinfo"; | ||
| // NOTES: `--device-debug` (-G) emits full device DWARF so that cuda-gdb |
There was a problem hiding this comment.
Revert changes in this file
|
|
||
| #include <functional> | ||
| // #include <pybind11/functional.h> | ||
| #include <pybind11/functional.h> |
There was a problem hiding this comment.
Why uncomment. It will cause conflict
| #include <functional> | ||
| // #include <pybind11/functional.h> | ||
| #include <pybind11/functional.h> | ||
|
|
There was a problem hiding this comment.
Can we create a new file for sm90 fp8 mega moe definition and leave this file unchanged
| // For 2-CTA clusters, neighbour SMs share the same m_block_idx with adjacent | ||
| // n_block_idx; the asserts below guarantee that pairing is always possible. | ||
| // SM90 / single-CTA paths set kClusterSize = 1 and do not need this. | ||
| DG_STATIC_ASSERT(kClusterSize == 1 or kClusterSize == 2, "Invalid cluster size"); |
There was a problem hiding this comment.
We had better do
if kClusterSize > 1:
DG_STATIC_ASSERT(...)
| @@ -1,3 +1,4 @@ | |||
| import os | |||
There was a problem hiding this comment.
Don't modify this file. modify the https://github.com/sgl-project/DeepGEMM/blob/dev/sgl_deep_gemm/__init__.py instead.
Please compile and test your kernels under this instruction: https://github.com/sgl-project/DeepGEMM/blob/dev/sgl_deep_gemm/README.md, since we build and release wheel in this way
There was a problem hiding this comment.
I’ll go through the review comments one by one and address them accordingly. I also noticed that some performance optimization changes were missing from this branch, so I’ve just synced them over. Without those changes, the performance gains would not be reproducible.
|
When enable --run-low-latency-baseline, Will there be a performance degradation? |
067fc03 to
78772d1
Compare
78772d1 to
fce68b3
Compare
I don’t think so. This is only for comparing performance against the low-latency baseline. While testing, I found that the performance with small batch sizes is not very stable. I’m still investigating it. |
DeepSeekV4Flash(8卡H20)
DeepSeekV4Pro(4卡H20)