Skip to content

Fix various issues in DeepGEMM tests#39

Open
b8zhong wants to merge 1 commit into
sgl-project:devfrom
bzhng-development:feat/deepgemm-tests
Open

Fix various issues in DeepGEMM tests#39
b8zhong wants to merge 1 commit into
sgl-project:devfrom
bzhng-development:feat/deepgemm-tests

Conversation

@b8zhong
Copy link
Copy Markdown

@b8zhong b8zhong commented Jun 2, 2026

  • m_grouped_bf16_gemm_nn_contiguous was not correctly export to tvm-ffi
  • test_mega_moe_l1_fp4_accuracy.py had a test bug, of reuse buffer.
  • After fixing dlpack issue test_mega_moe_l1_sentinel.py, there will be a TMA error. (Not related and seems to be preexisting). Here is the error (after checking out Add DeepGEMM prerelease wheel tests sglang#27075)

Running on SM103:

bash /sgl-workspace/sglang/scripts/test_sgl_deep_gemm.sh /sgl-workspace/DeepGEMM
...
----- RUN test_mega_moe_l1_sentinel.py  -----
=== A0.2.1 sentinel — y rel-RMSE (FP4 vs FP8 acts) ===
  y_fp8 RMS:        4252.2241
  y_rmse:           4955.0830
  rel-RMSE:         1.1653
  target:           ≤ 0.50 (A3 chain noise floor)
  verdict:          FAIL

  y_fp8 [0, :8]:  [672.0, 2272.0, -2112.0, -1536.0, 96.0, 576.0, -2000.0, -504.0]
  y_fp4 [0, :8]:  [1408.0, 1856.0, 2320.0, 408.0, 676.0, 454.0, 568.0, 158.0]
W0604 21:25:04.905000 1151790 torch/multiprocessing/spawn.py:165] Terminating process 1151908 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 195, in <module>
    torch.multiprocessing.spawn(test, args=(num_processes, args), nprocs=num_processes)
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 211, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 174, in test
    assert rel_rmse <= 0.5, \
           ^^^^^^^^^^^^^^^
AssertionError: A0.2.1 layout regression: y rel-RMSE 1.1653 > 0.5

----- FAIL test_mega_moe_l1_sentinel.py (exit 1) -----

When changing the world size to 8:

cd /sgl-workspace/DeepGEMM/tests && python3 test_mega_moe_l1_sentinel.py --num-processes 8
...
Making TMA desc: global memory: 1024 4, shared memory: 128 1, outer stride: 1024, swizzle: 0 (base: 0), elem size: 4, pointer: 134754319020544
        false,
        false
    >);
};

Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
Generated kernel code:
// Includes' hash value: 57cc5f7f1281d6eaeb32ee19a6626e35

Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
#include <deep_gemm/impls/sm100_fp8_fp4_mega_moe.cuh>

using namespace deep_gemm;

static void __instantiate_kernel() {
    auto ptr = reinterpret_cast<void*>(&sm100_fp8_fp4_mega_moe_impl<
        8448,
        1024, 512,
        8, 2,
        1,
        192, 128, 128,
        32,
        256, 128,
        67968,
        1087488,
        6,
        128, 128, 256,
        148, 8,
        0x1.4p+3f,
        true,
        false,
        false,
        false
    >);
};

Launch kernel with {148, 1} x 512, shared memory: 204068 bytes, cluster: 2, pdl: 0, stream: 0
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 139069136320752
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 131574317725936
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 123893439805680
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 136289856272624
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 123507865827568
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 126239465027824
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 134754707454192
Making TMA desc: global memory: 1024 67968, shared memory: 128 96, outer stride: 512, swizzle: 128 (base: 0), elem size: 1, pointer: 123291137751280
W0604 21:30:39.941000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158925 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158926 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158928 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158929 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158930 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158931 via signal SIGTERM
W0604 21:30:39.942000 1158816 torch/multiprocessing/spawn.py:165] Terminating process 1158932 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 195, in <module>
    torch.multiprocessing.spawn(test, args=(num_processes, args), nprocs=num_processes)
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 211, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
    fn(i, *args)
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 147, in test
    y_fp4 = make_buffer_and_run(use_fp4_acts=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 137, in make_buffer_and_run
    _ = run_once()
        ^^^^^^^^^^
  File "/sgl-workspace/DeepGEMM/tests/test_mega_moe_l1_sentinel.py", line 129, in run_once
    deep_gemm.fp8_fp4_mega_moe(
  File "/usr/local/lib/python3.12/dist-packages/deep_gemm/mega/__init__.py", line 120, in fp8_fp4_mega_moe
    _C.fp8_fp4_mega_moe(
  File "python/tvm_ffi/cython/function.pxi", line 929, in tvm_ffi.core.Function.__call__
RuntimeError: CUDA driver error (/sgl-workspace/DeepGEMM/csrc/apis/../jit_kernels/impls/runtime_utils.hpp:143): 1 (CUDA_ERROR_INVALID_VALUE, invalid argument)

Remaining tests:

==============================================================
 Summary: 8 passed, 1 failed, 3 skipped
   PASS  test_bf16.py
   PASS  test_einsum.py
   PASS  test_fp8_fp4.py
   PASS  test_hyperconnection.py
   PASS  test_layout.py
   PASS  test_attention.py
   PASS  test_mega_moe_l1_fp4_accuracy.py
   PASS  test_mega_moe_pre_dispatch.py
   SKIP  test_lazy_init.py (tvm_ffi eager CUDA init (upstream))
   SKIP  test_mega_moe.py (deep_ep with ElasticBuffer not installed)
   SKIP  test_sanitizer.py (fp8_fp4_mqa_logits register spill (known))
   FAIL  test_mega_moe_l1_sentinel.py

…est fix

- Bind m_grouped_bf16_gemm_nn_contiguous and k_grouped_bf16_gemm_tn_contiguous in the tvm-ffi API (+ Python wrappers); the kernels already existed, only the pybind module exposed them.
- Expose SymmBuffer input views as torch tensors via torch.from_dlpack so callers can index/copy into them.
- test_mega_moe_l1_fp4_accuracy: allocate the symm buffer per mode with DG_USE_FP4_ACTS set before allocation and feed matching FP4 data (fixes FP4 routing / recv-stats).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread deep_gemm/mega/__init__.py
k_uint_idx = kb // 4
byte_idx = kb % 4
# `sf_bytes_int32` is M-major: index [token, k_uint] gives the int32
# word at that token's k_uint slot.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about migrating the tests for sgl-deep-gemm to sgl_deep_gemm/tests, so the tests under root folder don't cause conflicts when rebase next time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants