[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch by yyj6666667 · Pull Request #2015 · kvcache-ai/ktransformers

yyj6666667 · 2026-05-18T11:01:41Z

Add AVX2 MXFP4 MoE kernel (mxfp4-moe.hpp) with 4-token M-blocking, enabling MXFP4 MoE on non-AMX CPUs
Wire AVX2MXFP4_MOE binding in ext_bindings.cpp
Support TP_MOE down_proj slicing and multi-pool per-expert loading

Test Results (DeepSeek V4 Flash × 1×RTX 5090)

MMLU 100-subset

Build	GPU Experts	Chunked Prefill	mem-fraction-static	Score
AVX2	6	2048	0.80	90%

Throughput

Input tokens	TTFT (s)	Prefill (tok/s)	Output tokens	Decode (tok/s)
512	7.419	60.5	144	18.83
1 K	10.842	90.9	146	19.13
2 K	3.496	574.7	149	19.28
4 K	4.970	805.3	148	18.89

Changed Files

kt-kernel/operators/avx2/mxfp4-moe.hpp — new AVX2 MXFP4 MoE kernel
kt-kernel/python/utils/amx.py — _select_mxfp4_backend() dispatch
kt-kernel/ext_bindings.cpp — AVX2MXFP4_MOE binding
kt-kernel/examples/test_fp4_moe_avx2.py — integration test

- Add AVX2 MXFP4 MoE kernel (mxfp4-moe.hpp) with 4-token M-blocking - Add AMX N-tail fallback in fp4-moe.hpp for non-aligned expert sizes - Add AMX tile MXFP4 backend selection (_select_mxfp4_backend in amx.py) - Wire AVX2MXFP4_MOE binding in ext_bindings.cpp - Support TP_MOE down_proj slicing and multi-pool per-expert loading - Add test_fp4_moe_avx2.py integration test

…lignment, dynamic expert update - Track aligned_alloc pointers in AVX2_MOE_BASE::owned_aligned_allocs_ and free them in the destructor (fixes BufferB backing memory leak on destroy). - Track per-TP down_buf allocations in TP_MOE::tp_owned_down_bufs_ with nullptr checks and size rounding to alignment boundary. - Add nibble-alignment runtime check for per_tp_interm in MXFP4 TP K-split. - Add write_weight_scale_to_buffer override to TP_MOE<AVX2_MXFP4_MOE_TP>, enabling dynamic expert update with kt-threadpool-count>=2. - Guard against ZeroDivisionError in test_fp4_moe_avx2.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…2BF16 fallback Merge kvcache-ai/ktransformers main (PR kvcache-ai#2006) into feat/avx2-mxfp4-moe. Upstream adds ActivationBF16, DequantizedWeight, and mxfp4_dot_bf16() abstractions providing a non-AVX512BF16 fallback path (FP32 LUT). Resolve conflict in GemmKernel224MXFP4SmallKGroup inner loops by taking upstream's refactored code; keep our GemmKernel224MXFP4 AMX tile struct and HAVE_AMX dispatch. Update AMX N-tail to use the new abstractions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-buffer path The per-expert path validates that intermediate_size is even (required for nibble-aligned FP4 addressing), but the flat-buffer path was missing this check — an odd value would silently truncate /2 divisions, corrupting memcpy sizes and offsets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements the AVX2 MXFP4 MoE operator for DeepSeek V4 inference, providing a fallback path for CPUs without AMX support, and includes necessary bindings, backend selection logic, and a validation test script. Review feedback identifies potential memory leaks in AVX2_MOE_BASE and AVX2_MXFP4_MOE_TP where manual allocations (aligned_alloc and new[]) are not protected by RAII against exceptions. Additionally, the AMX kernel implementation lacks validation or handling for cases where the inner dimension k is not a multiple of the group size, which could lead to incorrect results or alignment issues.

gemini-code-assist · 2026-05-18T11:08:10Z

+          std::aligned_alloc(64, (buffer_b_required_size(config_.intermediate_size, config_.hidden_size) + 63) & ~63ULL);
+      if (!gate_bb_ptr) throw std::runtime_error("aligned_alloc failed for gate BufferB");
+      owned_aligned_allocs_.push_back(gate_bb_ptr);


If owned_aligned_allocs_.push_back(gate_bb_ptr) throws an exception (e.g., due to memory exhaustion), the memory allocated by std::aligned_alloc will be leaked as it is not yet tracked by the vector and won't be freed in the destructor. Consider using a temporary RAII wrapper or performing the push_back immediately after allocation.

gemini-code-assist · 2026-05-18T11:08:10Z

+        tpc.gate_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2];
+        tpc.up_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2];
+        tpc.down_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2];
+        tpc.gate_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count];
+        tpc.up_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count];
+        tpc.down_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count];


These raw memory allocations using new[] are prone to leaks. If any allocation fails (throwing std::bad_alloc), previously allocated buffers in this block will leak. Furthermore, if DO_TPS_LOAD_WEIGHTS (line 745) throws, all these buffers will leak as the manual delete[] calls at lines 748-753 are skipped. It is recommended to use std::unique_ptr<uint8_t[]> or a similar RAII container to manage these temporary buffers.

gemini-code-assist · 2026-05-18T11:08:10Z

+  static void mat_mul_kgroup_impl(int m, int n, int k, int k_group_size, BufferA* ba, BufferB* bb, BufferC* bc, int ith,
+                                  int nth) {
+    assert(k_group_size == TILE_K && "GemmKernel224MXFP4 requires k_group_size == 32");


The kernel implementation assumes that the inner dimension k is a multiple of k_group_size (32). If k % 32 != 0, the loops in amx_block and the N-tail fallback will ignore the remaining elements, leading to incorrect results. Additionally, pointer alignment for __m512bh at line 551 depends on k being a multiple of 32. Consider adding a check or handling the k-tail.

C1-C4: Fix incorrect TP offset calculations in load_weights() - Per-expert mode used per_tp_interm instead of full_interm for offsets - This caused segfault when TP > 1 due to invalid pointer arithmetic H1-H3: Add safety checks - H1: Validate source weight pointers are not null - H2: Check lid index is within bounds - H3: Check BufferB.b is not null in gemm_mxfp4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reverts the incorrect offset calculation changes from previous commit. The original per_tp_interm-based offsets were correct: - gate/up weights are N-split (along intermediate dim) - Each TP partition handles per_tp_interm rows - Offset = i * per_tp_interm * hidden / 2 (not full_interm) Keeps H1-H3 safety checks: - H1: Validate source weight pointers are not null - H2: Check lid index is within bounds - H3: Check BufferB.b is not null in gemm_mxfp4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously, AVX2 MXFP4 MoE per-expert mode directly pointed BufferB.b into mmap'd safetensor data. This caused use-after-free when Python layer releases the mmap after load_weights() returns. Now AVX2 copies weights into owned buffers via memcpy/from_raw_mat(), matching AMX behavior. This decouples the MoE weights from mmap lifecycle. Changes: - buffer_b_required_size_impl: always allocate full buffer (weights + scales) - make_buffer_b_impl: always create full BufferB with owned storage - Single-TP per-expert: use from_raw_mat() instead of direct pointer - TP_MOE per-expert: add gate/up owned buffers with memcpy - Destructor: free gate/up buffers alongside down Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…phire Rapids CPUs (kvcache-ai#2018)" This reverts commit f1e2b82.

The AMX tile path cannot amortize tile load/store overhead with k_group_size=32: every group forces a tile_zero → tile_dpbf16ps → tile_stored → scale cycle, and the VNNI packing uses scalar transpose. The existing AVX-512 SmallKGroup path (register-resident dpbf16_ps + inline scale fmadd) is strictly better for this workload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yyj6666667 and others added 4 commits May 17, 2026 13:25

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

yyj6666667 and others added 6 commits May 19, 2026 16:08

Merge remote-tracking branch 'upstream/main' into feat/avx2-mxfp4-moe

9216015

Revert "[fix] Add runtime AMX BF16 check to prevent SIGILL on pre-Sap…

ec447d8

…phire Rapids CPUs (kvcache-ai#2018)" This reverts commit f1e2b82.

yyj6666667 changed the title ~~[feat](kt-kernel): AVX2 MXFP4 MoE + AMX tile MXFP4 dispatch~~ [feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch May 30, 2026

yyj6666667 merged commit ef6c47f into kvcache-ai:main May 30, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch#2015

[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch#2015
yyj6666667 merged 10 commits into
kvcache-ai:mainfrom
yyj6666667:feat/avx2-mxfp4-moe

yyj6666667 commented May 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yyj6666667 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results (DeepSeek V4 Flash × 1×RTX 5090)

MMLU 100-subset

Throughput

Changed Files

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yyj6666667 commented May 18, 2026 •

edited

Loading