x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#89
Conversation
…n lanes Parallel matmul cutover: - Remove ggml-inherited ith/nth thread-partition fields from all kernel op events; tensor views are now the only slice descriptor, with partition policy living solely at the orchestration fork site. - Add 8-lane view-sliced fork/join matmul routes for prefill (GEMM) and decode (GEMV) behind explicit guards; lane kernels and thread pool are constructed once at backend init, dispatch stays allocation-free. - Fix thread-pool scheduler fork/join lost-wakeup (Dekker race and destroy-during-release) in the join latch. Decode wavefront: - New text/generator/decode_wavefront component (co_sm lane orchestration) with lifecycle tests, focused bench, and eval tool. Cross-engine comparison lanes: - parallel_matmul bench gains a ggml reference lane (warm 8-thread ggml threadpool, plain GGUF-native blocks both sides). Evidence at dim 2048: EMEL wins prefill GEMM 0.843x; ggml leads GEMV (q8_0 2.7x, q6_k 2.5x, q4_k 4.2x, f32 9.6x) on per-kernel arithmetic, not orchestration. - generation bench gains EMEL_BENCH_REFERENCE_THREADS for matched-core end-to-end compares; maintained publication rows stay at 1 thread. - LFM2.5-230M-Q8_0 fixture wired end to end: fixture registry, workload manifests, preserve_thinking ChatML formatter contract (both EMEL and reference resolvers), 230M strict model contract. - LFM2 attention-layer layout is now metadata-driven from per-layer lfm2.attention.head_count_kv instead of the hardcoded 1.2B block list, with lifecycle tests for the pattern layout and contradiction rejection. Matched-thread evidence (230M, 8 threads both sides): steady-state decode within ~1.15x of llama.cpp; end-to-end gap concentrated in a ~443 ms EMEL first-token path (top follow-up), full decomposition in coroutine-plan.md. Also: coverage lane fixes for gcovr 8.6 (merge-mode-functions separate, atomic profile updates, negative-hits tolerance); benchmark and lint snapshot refreshes pre-approved for this changeset. Known open items: changed-line coverage for the new parallel decode action structs is below gate (needs lifecycle tests driving those routes); strict LFM2 1.2B x86 lane still blocked on the optimized plain-Q4 kernel.
…Q8_0) and FMA F32 GEMM
- Port reference AVX2 arithmetic for Q4_K x Q8_K mul_mat with guard-routed
sm dispatch and optimized/shared counters
- Add AVX2+FMA row-dot kernels and execute paths for legacy quants
Q4_0/Q4_1/Q5_0/Q8_0 x Q8_0
- Add FMA variant of the blocked F32 GEMM, selected by explicit guard when
fma_available; AVX2-only path remains as fallback
- Fix latent scalar gap: run_mul_mat validated q4_0/q4_1 but had no compute
branch (requests reported done with untouched dst); add scalar branches
- clang-format src/emel/model/data.{cpp,hpp} (unformatted additions from
parent branch commit; lint lane had been skipped on hosts without
clang-format)
Gates: kernel parity ok, bench snapshot ok, changed-line coverage 97.6%,
lint clean. Paritychecker generation-baseline failures are pre-existing
(missing snapshots/parity/generation_lfm2_5_230m_q8_0_*).
…tor AVX2+FMA kernel - Embeddings, whisper encoder/decoder, and sortformer contexts own an emel::kernel::sm (kernel::any) and dispatch op_mul_mat via process_event; per-domain #if-aarch64/scalar dispatch duplicates deleted. Kernel machine guards now do arch/ISA routing for all these domains. - Add shared emel::kernel::detect_host_kind() helper in any.hpp - Fix dense src0 views carrying quantized byte-stride (nb[0]) that kernel validation would reject - Add execute_avx2_fma_mul_mat_f32_vector_unchecked: dedicated matrix x vector (n==1) AVX2+FMA kernel with guard-routed sm row; the blocked GEMM degenerates to scalar tail for n==1 - x86 sortformer AMI bench: 20.07s (pre-refactor scalar) -> 2.24s, output checksum unchanged vs baseline Known outstanding (documented, needs user decision): - bench lane: snapshots/bench/benchmarks.txt baselines were recorded on the aarch64 host (contains kernel/aarch64 entries); x86 cannot meet ARM absolute-ns baselines for diarization_sortformer - coverage lane: whisper decoder forward-pass lines are a pre-existing coverage hole now gating because the refactor touched them
…matrix matmul path - Encoder/decoder lifecycle tests now run q4_0, q4_1, and q8_0+f32-aux weight variants end-to-end through the machines with the tiny fixture, exercising compute_decoder_cross_cache, run_decoder_layer_sequence, the logits paths, and the kernel-machine dispatch wiring per variant - Root cause: the changed-line coverage gate keys records by line number, so never-executed template instantiations shadowed covered q8_0 records - Add embeddings matmul_f32_matrix multi-column test and pointwise lane-rejection test - sortformer_bench: thread the kernel machine into the stage-profile compute_speaker_probabilities call (one-time static construction) Coverage gate: changed-line 83/83 (100%), branches 55.8%, exit 0. Speech shard 50/50 cases green.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4912bab. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4912bab5bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ont lanes inline - CMake: compiler flag acceptance alone allowed -mavx2/-mfma/-mf16c into every consumer TU on x86_64 hosts that cannot execute AVX2, bypassing the runtime CPUID guards (SIGILL). Add a check_cxx_source_runs cpuid/xgetbv probe (no libgcc __cpu_model dependency, works under zig c++) so host-tuned codegen is only enabled where it runs. - decode_wavefront: a rejected try_submit (queue full or caller is a pool worker) discarded the lane; it now runs inline per the documented backpressure contract used by the parallel matmul helper. Addresses PR #89 review findings.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd3c26b5e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The negated operand is always a quantize_row_q8_0_strided activation block (clamped to [-127,127]); weight lanes may hold -128 but are only abs'd, where the u8 reinterpretation is exact. Addresses PR #89 review finding.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: adbe964bd3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
guard_valid_compute_reserved accepted any well-formed compute_reserved as long as some reservation existed, so a stale or mismatched request could execute one tensor lifecycle against another reservation's topology. The guard now requires the request manifest's tensor bindings (pointer and count) to match the reservation's; per-phase manifests over the same binding set remain accepted. Regression test added. Addresses PR #89 review finding.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82dbe475c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
guard_parallel_dispatch scheduled every compatible lane on the pool without checking graph actor distinctness; two lanes sharing one emel::graph::sm would process_event concurrently and break the RTC single-writer contract. Parallel dispatch now requires pairwise-distinct graph refs (statically bounded scan); duplicates fall to the serial path, same algorithm and output. Regression test asserts the serial route (no pool scheduling) and correct completion for shared-actor lanes. Addresses PR #89 review finding (P1).
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4850c046c1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Resubmission of #87 (reverted in #88 to go through proper review).
Summary
emel::kernel::any) so sm guards do all arch/ISA routing; per-domain#if aarch64 / scalardispatch duplicates deletedPerformance (Ryzen 9 5950X)
Gates
snapshots/bench/benchmarks.txtwas recorded on the aarch64 host and is unreachable on x86 (needs per-arch baselines — separate decision);snapshots/parity/generation_lfm2_5_230m_q8_0_*baselines were never committed (pre-existing gap)Note
Low Risk
This diff is planning, generated architecture diagrams, and local Codex path retargeting—no runtime C++ changes in the shown patch—so merge risk to production behavior is low unless unstaged kernel commits ship separately.
Overview
Closes v1.27 Ryzen AVX2/FMA Kernel Support in planning artifacts: PROJECT, STATE, ROADMAP, MILESTONES, and RETROSPECTIVE now record v1.27 as shipped (2026-06-25), no active milestone, and v1.26 moved to archived milestone docs including a passed v1.26-MILESTONE-AUDIT and full v1.26-ROADMAP under
.planning/milestones/. New v1.27 audit, requirements archive, and roadmap capture the 13/13 requirement pass, optimized benchmark attribution repair (XBN-01), and unary SML rule-debt cleanup.Generated kernel_x86_64 architecture docs (markdown + mermaid) are updated to show explicit SML transitions on
dispatch_op_mul_matfor q2_K/q3_K/q6_K × q8_K SIMD guards/effects and ondispatch_op_flash_attn_extfor the AVX2/FMA one-chunk path vs shared fallback, matching the v1.27 “NEON-style” host contract + attribution story in planning text..codex/config.toml rewrites all GSD agent
config_filepaths from a developer-local tree to/shared/stateforward/emel.cpp/.codex/agents/....Reviewed by Cursor Bugbot for commit 4850c04. Bugbot is set up for automated code reviews on this repo. Configure here.