Fix various issues in DeepGEMM tests by b8zhong · Pull Request #39 · sgl-project/DeepGEMM

b8zhong · 2026-06-02T20:43:28Z

m_grouped_bf16_gemm_nn_contiguous was not correctly export to tvm-ffi
test_mega_moe_l1_fp4_accuracy.py had a test bug, of reuse buffer.
After fixing dlpack issue test_mega_moe_l1_sentinel.py, there will be a TMA error. (Not related and seems to be preexisting). Here is the error (after checking out Add DeepGEMM prerelease wheel tests sglang#27075)

Running on SM103:

bash /sgl-workspace/sglang/scripts/test_sgl_deep_gemm.sh /sgl-workspace/DeepGEMM
...
----- RUN test_mega_moe_l1_sentinel.py  -----
=== A0.2.1 sentinel — y rel-RMSE (FP4 vs FP8 acts) ===
  y_fp8 RMS:        4252.2241
  y_rmse:           4955.0830
  rel-RMSE:         1.1653
  target:           ≤ 0.50 (A3 chain noise floor)
  verdict:          FAIL

  y_fp8 [0, :8]:  [672.0, 2272.0, -2112.0, -1536.0, 96.0, 576.0, -2000.0, -504.0]
  y_fp4 [0, :8]:  [1408.0, 1856.0, 2320.0, 408.0, 676.0, 454.0, 568.0, 158.0]
W0604 21:25:04.905000 1151790 torch/multiprocessing/spawn.py:165] Terminating process 1151908 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 195, in <module>
    torch.multiprocessing.spawn(test, args=(num_processes, args), nprocs=num_processes)
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 211, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 174, in test
    assert rel_rmse <= 0.5, \
           ^^^^^^^^^^^^^^^
AssertionError: A0.2.1 layout regression: y rel-RMSE 1.1653 > 0.5

----- FAIL test_mega_moe_l1_sentinel.py (exit 1) -----

When changing the world size to 8:

cd /sgl-workspace/DeepGEMM/tests && python3 test_mega_moe_l1_sentinel.py --num-processes 8
...
Making TMA desc: global memory: 1024 4, shared memory: 128 1, outer stride: 1024, swizzle: 0 (base: 0), elem size: 4, pointer: 134754319020544
        false,
        false
    >);
};

Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
Generated kernel code:
// Includes' hash value: 57cc5f7f1281d6eaeb32ee19a6626e35

Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
#include <deep_gemm/impls/sm100_fp8_fp4_mega_moe.cuh>

using namespace deep_gemm;

static void __instantiate_kernel() {
    auto ptr = reinterpret_cast<void*>(&sm100_fp8_fp4_mega_moe_impl<
        8448,
        1024, 512,
        8, 2,
        1,
        192, 128, 128,
        32,
        256, 128,
        67968,
        1087488,
        6,
        128, 128, 256,
        148, 8,
        0x1.4p+3f,
        true,
        false,
        false,
        false
    >);
};

Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 139069136320752
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 131574317725936
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 123893439805680
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 136289856272624
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 123507865827568
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 126239465027824
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 134754707454192
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 123291137751280
W0604 21:30:39.941000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158925 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158926 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158928 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158929 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158930 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158931 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158932 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 195, in <module>
    torch.multiprocessing.spawn(test, args=(num_processes, args), nprocs=num_processes)
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 211, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 147, in test
    y_fp4 = make_buffer_and_run(use_fp4_acts=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 137, in make_buffer_and_run
    _ = run_once()
        ^^^^^^^^^^
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 129, in run_once
    deep_gemm.fp8_fp4_mega_moe(
  File "/usr/local/lib/python3.12/dist-packages/deep_gemm/mega/__init__.py", line 120, in fp8_fp4_mega_moe
    _C.fp8_fp4_mega_moe(
  File "python/tvm_ffi/cython/function.pxi", line 929, in tvm_ffi.core.Function.__call__
RuntimeError: CUDA driver error (/sgl-workspace/DeepGEMM/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143): 1 (CUDA_ERROR_INVALID_VALUE, invalid argument)

Remaining tests:

==============================================================
 Summary: 8 passed, 1 failed, 3 skipped
   PASS  test_bf16.py
   PASS  test_einsum.py
   PASS  test_fp8_fp4.py
   PASS  test_hyperconnection.py
   PASS  test_layout.py
   PASS  test_attention.py
   PASS  test_mega_moe_l1_fp4_accuracy.py
   PASS  test_mega_moe_pre_dispatch.py
   SKIP  test_lazy_init.py (tvm_ffi eager CUDA init (upstream))
   SKIP  test_mega_moe.py (deep_ep with ElasticBuffer not installed)
   SKIP  test_sanitizer.py (fp8_fp4_mqa_logits register spill (known))
   FAIL  test_mega_moe_l1_sentinel.py

…est fix - Bind m_grouped_bf16_gemm_nn_contiguous and k_grouped_bf16_gemm_tn_contiguous in the tvm-ffi API (+ Python wrappers); the kernels already existed, only the pybind module exposed them. - Expose SymmBuffer input views as torch tensors via torch.from_dlpack so callers can index/copy into them. - test_mega_moe_l1_fp4_accuracy: allocate the symm buffer per mode with DG_USE_FP4_ACTS set before allocation and feed matching FP4 data (fixes FP4 routing / recv-stats). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fridge003 · 2026-06-06T00:23:53Z

        k_uint_idx = kb // 4
        byte_idx = kb % 4
        # `sf_bytes_int32` is M-major: index [token, k_uint] gives the int32
        # word at that token's k_uint slot.


I'm thinking about migrating the tests for sgl-deep-gemm to sgl_deep_gemm/tests, so the tests under root folder don't cause conflicts when rebase next time

Fridge003 reviewed Jun 4, 2026

View reviewed changes

Comment thread deep_gemm/mega/__init__.py

Fridge003 reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various issues in DeepGEMM tests#39

Fix various issues in DeepGEMM tests#39
b8zhong wants to merge 1 commit into
sgl-project:devfrom
bzhng-development:feat/deepgemm-tests

b8zhong commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Fridge003 Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b8zhong commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Fridge003 Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b8zhong commented Jun 2, 2026 •

edited

Loading