x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing by gabewillen · Pull Request #89 · stateforward/emel.cpp

gabewillen · 2026-07-02T13:22:48Z

Resubmission of #87 (reverted in #88 to go through proper review).

Summary

Add AVX2+FMA x86_64 kernels: Q4_K×Q8_K matmul, legacy quants Q4_0/Q4_1/Q5_0/Q8_0×Q8_0, FMA F32 blocked GEMM, and a dedicated f32 matrix×vector kernel (the GEMM degenerates to scalar for n==1)
Fix latent scalar bug: run_mul_mat validated q4_0/q4_1 but had no compute branch (reported done with untouched dst); scalar branches added
Route embeddings, whisper encoder/decoder, and sortformer matmuls through the kernel machine (emel::kernel::any) so sm guards do all arch/ISA routing; per-domain #if aarch64 / scalar dispatch duplicates deleted
Cover all whisper linear weight-variant template instantiations with machine-level tests (changed-line coverage gate: 65% → 100%)
Includes stacked parent commits: v1.27 ryzen avx2 fma support and the view-sliced parallel matmul cutover

Performance (Ryzen 9 5950X)

Sortformer AMI diarization bench: 20.07s → 2.24s (9×), output checksum unchanged
Kernel bench vs llama.cpp reference lane: q2_k 0.78x, q3_k 1.09x, f32 mul_mat 0.39x (lower = emel faster)

Gates

kernel parity ok, lint clean, changed-line coverage 100%, full test suite green
Known red (documented): diarization bench baseline in snapshots/bench/benchmarks.txt was recorded on the aarch64 host and is unreachable on x86 (needs per-arch baselines — separate decision); snapshots/parity/generation_lfm2_5_230m_q8_0_* baselines were never committed (pre-existing gap)

Note

Low Risk
This diff is planning, generated architecture diagrams, and local Codex path retargeting—no runtime C++ changes in the shown patch—so merge risk to production behavior is low unless unstaged kernel commits ship separately.

Overview
Closes v1.27 Ryzen AVX2/FMA Kernel Support in planning artifacts: PROJECT, STATE, ROADMAP, MILESTONES, and RETROSPECTIVE now record v1.27 as shipped (2026-06-25), no active milestone, and v1.26 moved to archived milestone docs including a passed v1.26-MILESTONE-AUDIT and full v1.26-ROADMAP under .planning/milestones/. New v1.27 audit, requirements archive, and roadmap capture the 13/13 requirement pass, optimized benchmark attribution repair (XBN-01), and unary SML rule-debt cleanup.

Generated kernel_x86_64 architecture docs (markdown + mermaid) are updated to show explicit SML transitions on dispatch_op_mul_mat for q2_K/q3_K/q6_K × q8_K SIMD guards/effects and on dispatch_op_flash_attn_ext for the AVX2/FMA one-chunk path vs shared fallback, matching the v1.27 “NEON-style” host contract + attribution story in planning text.

.codex/config.toml rewrites all GSD agent config_file paths from a developer-local tree to /shared/stateforward/emel.cpp/.codex/agents/....

^{Reviewed by Cursor Bugbot for commit 4850c04. Bugbot is set up for automated code reviews on this repo. Configure here.}

…n lanes Parallel matmul cutover: - Remove ggml-inherited ith/nth thread-partition fields from all kernel op events; tensor views are now the only slice descriptor, with partition policy living solely at the orchestration fork site. - Add 8-lane view-sliced fork/join matmul routes for prefill (GEMM) and decode (GEMV) behind explicit guards; lane kernels and thread pool are constructed once at backend init, dispatch stays allocation-free. - Fix thread-pool scheduler fork/join lost-wakeup (Dekker race and destroy-during-release) in the join latch. Decode wavefront: - New text/generator/decode_wavefront component (co_sm lane orchestration) with lifecycle tests, focused bench, and eval tool. Cross-engine comparison lanes: - parallel_matmul bench gains a ggml reference lane (warm 8-thread ggml threadpool, plain GGUF-native blocks both sides). Evidence at dim 2048: EMEL wins prefill GEMM 0.843x; ggml leads GEMV (q8_0 2.7x, q6_k 2.5x, q4_k 4.2x, f32 9.6x) on per-kernel arithmetic, not orchestration. - generation bench gains EMEL_BENCH_REFERENCE_THREADS for matched-core end-to-end compares; maintained publication rows stay at 1 thread. - LFM2.5-230M-Q8_0 fixture wired end to end: fixture registry, workload manifests, preserve_thinking ChatML formatter contract (both EMEL and reference resolvers), 230M strict model contract. - LFM2 attention-layer layout is now metadata-driven from per-layer lfm2.attention.head_count_kv instead of the hardcoded 1.2B block list, with lifecycle tests for the pattern layout and contradiction rejection. Matched-thread evidence (230M, 8 threads both sides): steady-state decode within ~1.15x of llama.cpp; end-to-end gap concentrated in a ~443 ms EMEL first-token path (top follow-up), full decomposition in coroutine-plan.md. Also: coverage lane fixes for gcovr 8.6 (merge-mode-functions separate, atomic profile updates, negative-hits tolerance); benchmark and lint snapshot refreshes pre-approved for this changeset. Known open items: changed-line coverage for the new parallel decode action structs is below gate (needs lifecycle tests driving those routes); strict LFM2 1.2B x86 lane still blocked on the optimized plain-Q4 kernel.

…Q8_0) and FMA F32 GEMM - Port reference AVX2 arithmetic for Q4_K x Q8_K mul_mat with guard-routed sm dispatch and optimized/shared counters - Add AVX2+FMA row-dot kernels and execute paths for legacy quants Q4_0/Q4_1/Q5_0/Q8_0 x Q8_0 - Add FMA variant of the blocked F32 GEMM, selected by explicit guard when fma_available; AVX2-only path remains as fallback - Fix latent scalar gap: run_mul_mat validated q4_0/q4_1 but had no compute branch (requests reported done with untouched dst); add scalar branches - clang-format src/emel/model/data.{cpp,hpp} (unformatted additions from parent branch commit; lint lane had been skipped on hosts without clang-format) Gates: kernel parity ok, bench snapshot ok, changed-line coverage 97.6%, lint clean. Paritychecker generation-baseline failures are pre-existing (missing snapshots/parity/generation_lfm2_5_230m_q8_0_*).

…tor AVX2+FMA kernel - Embeddings, whisper encoder/decoder, and sortformer contexts own an emel::kernel::sm (kernel::any) and dispatch op_mul_mat via process_event; per-domain #if-aarch64/scalar dispatch duplicates deleted. Kernel machine guards now do arch/ISA routing for all these domains. - Add shared emel::kernel::detect_host_kind() helper in any.hpp - Fix dense src0 views carrying quantized byte-stride (nb[0]) that kernel validation would reject - Add execute_avx2_fma_mul_mat_f32_vector_unchecked: dedicated matrix x vector (n==1) AVX2+FMA kernel with guard-routed sm row; the blocked GEMM degenerates to scalar tail for n==1 - x86 sortformer AMI bench: 20.07s (pre-refactor scalar) -> 2.24s, output checksum unchanged vs baseline Known outstanding (documented, needs user decision): - bench lane: snapshots/bench/benchmarks.txt baselines were recorded on the aarch64 host (contains kernel/aarch64 entries); x86 cannot meet ARM absolute-ns baselines for diarization_sortformer - coverage lane: whisper decoder forward-pass lines are a pre-existing coverage hole now gating because the refactor touched them

…matrix matmul path - Encoder/decoder lifecycle tests now run q4_0, q4_1, and q8_0+f32-aux weight variants end-to-end through the machines with the tiny fixture, exercising compute_decoder_cross_cache, run_decoder_layer_sequence, the logits paths, and the kernel-machine dispatch wiring per variant - Root cause: the changed-line coverage gate keys records by line number, so never-executed template instantiations shadowed covered q8_0 records - Add embeddings matmul_f32_matrix multi-column test and pointwise lane-rejection test - sortformer_bench: thread the kernel machine into the stage-profile compute_speaker_probabilities call (one-time static construction) Coverage gate: changed-line 83/83 (100%), branches 55.8%, exit 0. Speech shard 50/50 cases green.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4912bab. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4912bab5bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ont lanes inline - CMake: compiler flag acceptance alone allowed -mavx2/-mfma/-mf16c into every consumer TU on x86_64 hosts that cannot execute AVX2, bypassing the runtime CPUID guards (SIGILL). Add a check_cxx_source_runs cpuid/xgetbv probe (no libgcc __cpu_model dependency, works under zig c++) so host-tuned codegen is only enabled where it runs. - decode_wavefront: a rejected try_submit (queue full or caller is a pool worker) discarded the lane; it now runs inline per the documented backpressure contract used by the parallel matmul helper. Addresses PR #89 review findings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd3c26b5e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The negated operand is always a quantize_row_q8_0_strided activation block (clamped to [-127,127]); weight lanes may hold -128 but are only abs'd, where the u8 reinterpretation is exact. Addresses PR #89 review finding.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adbe964bd3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

guard_valid_compute_reserved accepted any well-formed compute_reserved as long as some reservation existed, so a stale or mismatched request could execute one tensor lifecycle against another reservation's topology. The guard now requires the request manifest's tensor bindings (pointer and count) to match the reservation's; per-phase manifests over the same binding set remain accepted. Regression test added. Addresses PR #89 review finding.

gabewillen · 2026-07-02T14:20:47Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82dbe475c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

guard_parallel_dispatch scheduled every compatible lane on the pool without checking graph actor distinctness; two lanes sharing one emel::graph::sm would process_event concurrently and break the RTC single-writer contract. Parallel dispatch now requires pairwise-distinct graph refs (statically bounded scan); duplicates fall to the serial path, same algorithm and output. Regression test asserts the serial route (no pool scheduling) and correct completion for shared-actor lanes. Addresses PR #89 review finding (P1).

gabewillen · 2026-07-02T14:39:09Z

@codex

gabewillen · 2026-07-02T14:55:49Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4850c046c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gabewillen added 6 commits July 2, 2026 13:20

feat: ship v1.27 ryzen avx2 fma support

a6b60b3

chore: refresh quality gates timing data

4912bab

Copilot AI review requested due to automatic review settings July 2, 2026 13:22

Copilot AI reviewed Jul 2, 2026

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread .codex/config.toml

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread CMakeLists.txt

Comment thread src/emel/text/generator/decode_wavefront/actions.hpp Outdated

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread src/emel/kernel/x86_64/actions.hpp

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread src/emel/graph/guards.hpp

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread src/emel/text/generator/decode_wavefront/actions.hpp

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread CMakeLists.txt

gabewillen merged commit c0a3918 into main Jul 2, 2026
1 check passed

gabewillen deleted the x86-avx2-quant-kernels-v2 branch July 2, 2026 15:55

Uh oh!

Conversation

gabewillen commented Jul 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (Ryzen 9 5950X)

Gates

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gabewillen commented Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gabewillen commented Jul 2, 2026

Uh oh!

gabewillen commented Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabewillen commented Jul 2, 2026 •

edited by cursor Bot

Loading