[feat](kt-kernel): AVX2 MXFP4 MoE MXFP4 dispatch#2015
Conversation
- Add AVX2 MXFP4 MoE kernel (mxfp4-moe.hpp) with 4-token M-blocking - Add AMX N-tail fallback in fp4-moe.hpp for non-aligned expert sizes - Add AMX tile MXFP4 backend selection (_select_mxfp4_backend in amx.py) - Wire AVX2MXFP4_MOE binding in ext_bindings.cpp - Support TP_MOE down_proj slicing and multi-pool per-expert loading - Add test_fp4_moe_avx2.py integration test
…lignment, dynamic expert update - Track aligned_alloc pointers in AVX2_MOE_BASE::owned_aligned_allocs_ and free them in the destructor (fixes BufferB backing memory leak on destroy). - Track per-TP down_buf allocations in TP_MOE::tp_owned_down_bufs_ with nullptr checks and size rounding to alignment boundary. - Add nibble-alignment runtime check for per_tp_interm in MXFP4 TP K-split. - Add write_weight_scale_to_buffer override to TP_MOE<AVX2_MXFP4_MOE_TP>, enabling dynamic expert update with kt-threadpool-count>=2. - Guard against ZeroDivisionError in test_fp4_moe_avx2.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…2BF16 fallback Merge kvcache-ai/ktransformers main (PR kvcache-ai#2006) into feat/avx2-mxfp4-moe. Upstream adds ActivationBF16, DequantizedWeight, and mxfp4_dot_bf16() abstractions providing a non-AVX512BF16 fallback path (FP32 LUT). Resolve conflict in GemmKernel224MXFP4SmallKGroup inner loops by taking upstream's refactored code; keep our GemmKernel224MXFP4 AMX tile struct and HAVE_AMX dispatch. Update AMX N-tail to use the new abstractions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-buffer path The per-expert path validates that intermediate_size is even (required for nibble-aligned FP4 addressing), but the flat-buffer path was missing this check — an odd value would silently truncate /2 divisions, corrupting memcpy sizes and offsets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements the AVX2 MXFP4 MoE operator for DeepSeek V4 inference, providing a fallback path for CPUs without AMX support, and includes necessary bindings, backend selection logic, and a validation test script. Review feedback identifies potential memory leaks in AVX2_MOE_BASE and AVX2_MXFP4_MOE_TP where manual allocations (aligned_alloc and new[]) are not protected by RAII against exceptions. Additionally, the AMX kernel implementation lacks validation or handling for cases where the inner dimension k is not a multiple of the group size, which could lead to incorrect results or alignment issues.
| std::aligned_alloc(64, (buffer_b_required_size(config_.intermediate_size, config_.hidden_size) + 63) & ~63ULL); | ||
| if (!gate_bb_ptr) throw std::runtime_error("aligned_alloc failed for gate BufferB"); | ||
| owned_aligned_allocs_.push_back(gate_bb_ptr); |
There was a problem hiding this comment.
If owned_aligned_allocs_.push_back(gate_bb_ptr) throws an exception (e.g., due to memory exhaustion), the memory allocated by std::aligned_alloc will be leaked as it is not yet tracked by the vector and won't be freed in the destructor. Consider using a temporary RAII wrapper or performing the push_back immediately after allocation.
| tpc.gate_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2]; | ||
| tpc.up_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2]; | ||
| tpc.down_proj = new uint8_t[(tpc.expert_num * weight_elem_count) / 2]; | ||
| tpc.gate_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count]; | ||
| tpc.up_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count]; | ||
| tpc.down_scale = new ggml_bf16_t[tpc.expert_num * scales_elem_count]; |
There was a problem hiding this comment.
These raw memory allocations using new[] are prone to leaks. If any allocation fails (throwing std::bad_alloc), previously allocated buffers in this block will leak. Furthermore, if DO_TPS_LOAD_WEIGHTS (line 745) throws, all these buffers will leak as the manual delete[] calls at lines 748-753 are skipped. It is recommended to use std::unique_ptr<uint8_t[]> or a similar RAII container to manage these temporary buffers.
| static void mat_mul_kgroup_impl(int m, int n, int k, int k_group_size, BufferA* ba, BufferB* bb, BufferC* bc, int ith, | ||
| int nth) { | ||
| assert(k_group_size == TILE_K && "GemmKernel224MXFP4 requires k_group_size == 32"); |
There was a problem hiding this comment.
The kernel implementation assumes that the inner dimension k is a multiple of k_group_size (32). If k % 32 != 0, the loops in amx_block and the N-tail fallback will ignore the remaining elements, leading to incorrect results. Additionally, pointer alignment for __m512bh at line 551 depends on k being a multiple of 32. Consider adding a check or handling the k-tail.
C1-C4: Fix incorrect TP offset calculations in load_weights() - Per-expert mode used per_tp_interm instead of full_interm for offsets - This caused segfault when TP > 1 due to invalid pointer arithmetic H1-H3: Add safety checks - H1: Validate source weight pointers are not null - H2: Check lid index is within bounds - H3: Check BufferB.b is not null in gemm_mxfp4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverts the incorrect offset calculation changes from previous commit. The original per_tp_interm-based offsets were correct: - gate/up weights are N-split (along intermediate dim) - Each TP partition handles per_tp_interm rows - Offset = i * per_tp_interm * hidden / 2 (not full_interm) Keeps H1-H3 safety checks: - H1: Validate source weight pointers are not null - H2: Check lid index is within bounds - H3: Check BufferB.b is not null in gemm_mxfp4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, AVX2 MXFP4 MoE per-expert mode directly pointed BufferB.b into mmap'd safetensor data. This caused use-after-free when Python layer releases the mmap after load_weights() returns. Now AVX2 copies weights into owned buffers via memcpy/from_raw_mat(), matching AMX behavior. This decouples the MoE weights from mmap lifecycle. Changes: - buffer_b_required_size_impl: always allocate full buffer (weights + scales) - make_buffer_b_impl: always create full BufferB with owned storage - Single-TP per-expert: use from_raw_mat() instead of direct pointer - TP_MOE per-expert: add gate/up owned buffers with memcpy - Destructor: free gate/up buffers alongside down Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…phire Rapids CPUs (kvcache-ai#2018)" This reverts commit f1e2b82.
The AMX tile path cannot amortize tile load/store overhead with k_group_size=32: every group forces a tile_zero → tile_dpbf16ps → tile_stored → scale cycle, and the VNNI packing uses scalar transpose. The existing AVX-512 SmallKGroup path (register-resident dpbf16_ps + inline scale fmadd) is strictly better for this workload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mxfp4-moe.hpp) with 4-token M-blocking, enabling MXFP4 MoE on non-AMX CPUsAVX2MXFP4_MOEbinding inext_bindings.cppdown_projslicing and multi-pool per-expert loadingTest Results (DeepSeek V4 Flash × 1×RTX 5090)
MMLU 100-subset
Throughput
Changed Files
kt-kernel/operators/avx2/mxfp4-moe.hpp— new AVX2 MXFP4 MoE kernelkt-kernel/python/utils/amx.py—_select_mxfp4_backend()dispatchkt-kernel/ext_bindings.cpp— AVX2MXFP4_MOE bindingkt-kernel/examples/test_fp4_moe_avx2.py— integration test