Skip to content

[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch#2015

Merged
yyj6666667 merged 10 commits into
kvcache-ai:mainfrom
yyj6666667:feat/avx2-mxfp4-moe
May 30, 2026
Merged

[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch#2015
yyj6666667 merged 10 commits into
kvcache-ai:mainfrom
yyj6666667:feat/avx2-mxfp4-moe

Conversation

@yyj6666667
Copy link
Copy Markdown
Collaborator

@yyj6666667 yyj6666667 commented May 18, 2026

  • Add AVX2 MXFP4 MoE kernel (mxfp4-moe.hpp) with 4-token M-blocking, enabling MXFP4 MoE on non-AMX CPUs
  • Wire AVX2MXFP4_MOE binding in ext_bindings.cpp
  • Support TP_MOE down_proj slicing and multi-pool per-expert loading

Test Results (DeepSeek V4 Flash × 1×RTX 5090)

MMLU 100-subset

Build GPU Experts Chunked Prefill mem-fraction-static Score
AVX2 6 2048 0.80 90%

Throughput

Input tokens TTFT (s) Prefill (tok/s) Output tokens Decode (tok/s)
512 7.419 60.5 144 18.83
1 K 10.842 90.9 146 19.13
2 K 3.496 574.7 149 19.28
4 K 4.970 805.3 148 18.89

Changed Files

  • kt-kernel/operators/avx2/mxfp4-moe.hpp — new AVX2 MXFP4 MoE kernel
  • kt-kernel/python/utils/amx.py_select_mxfp4_backend() dispatch
  • kt-kernel/ext_bindings.cpp — AVX2MXFP4_MOE binding
  • kt-kernel/examples/test_fp4_moe_avx2.py — integration test

yyj6666667 and others added 4 commits May 17, 2026 13:25
- Add AVX2 MXFP4 MoE kernel (mxfp4-moe.hpp) with 4-token M-blocking
- Add AMX N-tail fallback in fp4-moe.hpp for non-aligned expert sizes
- Add AMX tile MXFP4 backend selection (_select_mxfp4_backend in amx.py)
- Wire AVX2MXFP4_MOE binding in ext_bindings.cpp
- Support TP_MOE down_proj slicing and multi-pool per-expert loading
- Add test_fp4_moe_avx2.py integration test
…lignment, dynamic expert update

- Track aligned_alloc pointers in AVX2_MOE_BASE::owned_aligned_allocs_ and
  free them in the destructor (fixes BufferB backing memory leak on destroy).
- Track per-TP down_buf allocations in TP_MOE::tp_owned_down_bufs_ with
  nullptr checks and size rounding to alignment boundary.
- Add nibble-alignment runtime check for per_tp_interm in MXFP4 TP K-split.
- Add write_weight_scale_to_buffer override to TP_MOE<AVX2_MXFP4_MOE_TP>,
  enabling dynamic expert update with kt-threadpool-count>=2.
- Guard against ZeroDivisionError in test_fp4_moe_avx2.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…2BF16 fallback

Merge kvcache-ai/ktransformers main (PR kvcache-ai#2006) into feat/avx2-mxfp4-moe.
Upstream adds ActivationBF16, DequantizedWeight, and mxfp4_dot_bf16()
abstractions providing a non-AVX512BF16 fallback path (FP32 LUT).
Resolve conflict in GemmKernel224MXFP4SmallKGroup inner loops by taking
upstream's refactored code; keep our GemmKernel224MXFP4 AMX tile struct
and HAVE_AMX dispatch. Update AMX N-tail to use the new abstractions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-buffer path

The per-expert path validates that intermediate_size is even (required for
nibble-aligned FP4 addressing), but the flat-buffer path was missing this
check — an odd value would silently truncate /2 divisions, corrupting
memcpy sizes and offsets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the AVX2 MXFP4 MoE operator for DeepSeek V4 inference, providing a fallback path for CPUs without AMX support, and includes necessary bindings, backend selection logic, and a validation test script. Review feedback identifies potential memory leaks in AVX2_MOE_BASE and AVX2_MXFP4_MOE_TP where manual allocations (aligned_alloc and new[]) are not protected by RAII against exceptions. Additionally, the AMX kernel implementation lacks validation or handling for cases where the inner dimension k is not a multiple of the group size, which could lead to incorrect results or alignment issues.

Comment on lines +123 to +125
std::aligned_alloc(64, (buffer_b_required_size(config_.intermediate_size, config_.hidden_size) + 63) & ~63ULL);
if (!gate_bb_ptr) throw std::runtime_error("aligned_alloc failed for gate BufferB");
owned_aligned_allocs_.push_back(gate_bb_ptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If owned_aligned_allocs_.push_back(gate_bb_ptr) throws an exception (e.g., due to memory exhaustion), the memory allocated by std::aligned_alloc will be leaked as it is not yet tracked by the vector and won't be freed in the destructor. Consider using a temporary RAII wrapper or performing the push_back immediately after allocation.

Comment on lines +688 to +693
tpc.gate_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2];
tpc.up_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2];
tpc.down_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2];
tpc.gate_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count];
tpc.up_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count];
tpc.down_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These raw memory allocations using new[] are prone to leaks. If any allocation fails (throwing std::bad_alloc), previously allocated buffers in this block will leak. Furthermore, if DO_TPS_LOAD_WEIGHTS (line 745) throws, all these buffers will leak as the manual delete[] calls at lines 748-753 are skipped. It is recommended to use std::unique_ptr<uint8_t[]> or a similar RAII container to manage these temporary buffers.

Comment thread kt-kernel/operators/amx/fp4-moe.hpp Outdated
Comment on lines +533 to +535
static void mat_mul_kgroup_impl(int m, int n, int k, int k_group_size, BufferA* ba, BufferB* bb, BufferC* bc, int ith,
int nth) {
assert(k_group_size == TILE_K && "GemmKernel224MXFP4 requires k_group_size == 32");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The kernel implementation assumes that the inner dimension k is a multiple of k_group_size (32). If k % 32 != 0, the loops in amx_block and the N-tail fallback will ignore the remaining elements, leading to incorrect results. Additionally, pointer alignment for __m512bh at line 551 depends on k being a multiple of 32. Consider adding a check or handling the k-tail.

yyj6666667 and others added 6 commits May 19, 2026 16:08
C1-C4: Fix incorrect TP offset calculations in load_weights()
- Per-expert mode used per_tp_interm instead of full_interm for offsets
- This caused segfault when TP > 1 due to invalid pointer arithmetic

H1-H3: Add safety checks
- H1: Validate source weight pointers are not null
- H2: Check lid index is within bounds
- H3: Check BufferB.b is not null in gemm_mxfp4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverts the incorrect offset calculation changes from previous commit.
The original per_tp_interm-based offsets were correct:
- gate/up weights are N-split (along intermediate dim)
- Each TP partition handles per_tp_interm rows
- Offset = i * per_tp_interm * hidden / 2 (not full_interm)

Keeps H1-H3 safety checks:
- H1: Validate source weight pointers are not null
- H2: Check lid index is within bounds
- H3: Check BufferB.b is not null in gemm_mxfp4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, AVX2 MXFP4 MoE per-expert mode directly pointed BufferB.b
into mmap'd safetensor data. This caused use-after-free when Python
layer releases the mmap after load_weights() returns.

Now AVX2 copies weights into owned buffers via memcpy/from_raw_mat(),
matching AMX behavior. This decouples the MoE weights from mmap lifecycle.

Changes:
- buffer_b_required_size_impl: always allocate full buffer (weights + scales)
- make_buffer_b_impl: always create full BufferB with owned storage
- Single-TP per-expert: use from_raw_mat() instead of direct pointer
- TP_MOE per-expert: add gate/up owned buffers with memcpy
- Destructor: free gate/up buffers alongside down

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The AMX tile path cannot amortize tile load/store overhead with
k_group_size=32: every group forces a tile_zero → tile_dpbf16ps →
tile_stored → scale cycle, and the VNNI packing uses scalar transpose.
The existing AVX-512 SmallKGroup path (register-resident dpbf16_ps +
inline scale fmadd) is strictly better for this workload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yyj6666667 yyj6666667 changed the title [feat](kt-kernel): AVX2 MXFP4 MoE + AMX tile MXFP4 dispatch [feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch May 30, 2026
@yyj6666667 yyj6666667 merged commit ef6c47f into kvcache-ai:main May 30, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant