Skip to content

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#89

Merged
gabewillen merged 10 commits into
mainfrom
x86-avx2-quant-kernels-v2
Jul 2, 2026
Merged

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#89
gabewillen merged 10 commits into
mainfrom
x86-avx2-quant-kernels-v2

Conversation

@gabewillen

@gabewillen gabewillen commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Resubmission of #87 (reverted in #88 to go through proper review).

Summary

  • Add AVX2+FMA x86_64 kernels: Q4_K×Q8_K matmul, legacy quants Q4_0/Q4_1/Q5_0/Q8_0×Q8_0, FMA F32 blocked GEMM, and a dedicated f32 matrix×vector kernel (the GEMM degenerates to scalar for n==1)
  • Fix latent scalar bug: run_mul_mat validated q4_0/q4_1 but had no compute branch (reported done with untouched dst); scalar branches added
  • Route embeddings, whisper encoder/decoder, and sortformer matmuls through the kernel machine (emel::kernel::any) so sm guards do all arch/ISA routing; per-domain #if aarch64 / scalar dispatch duplicates deleted
  • Cover all whisper linear weight-variant template instantiations with machine-level tests (changed-line coverage gate: 65% → 100%)
  • Includes stacked parent commits: v1.27 ryzen avx2 fma support and the view-sliced parallel matmul cutover

Performance (Ryzen 9 5950X)

  • Sortformer AMI diarization bench: 20.07s → 2.24s (9×), output checksum unchanged
  • Kernel bench vs llama.cpp reference lane: q2_k 0.78x, q3_k 1.09x, f32 mul_mat 0.39x (lower = emel faster)

Gates

  • kernel parity ok, lint clean, changed-line coverage 100%, full test suite green
  • Known red (documented): diarization bench baseline in snapshots/bench/benchmarks.txt was recorded on the aarch64 host and is unreachable on x86 (needs per-arch baselines — separate decision); snapshots/parity/generation_lfm2_5_230m_q8_0_* baselines were never committed (pre-existing gap)

Note

Low Risk
This diff is planning, generated architecture diagrams, and local Codex path retargeting—no runtime C++ changes in the shown patch—so merge risk to production behavior is low unless unstaged kernel commits ship separately.

Overview
Closes v1.27 Ryzen AVX2/FMA Kernel Support in planning artifacts: PROJECT, STATE, ROADMAP, MILESTONES, and RETROSPECTIVE now record v1.27 as shipped (2026-06-25), no active milestone, and v1.26 moved to archived milestone docs including a passed v1.26-MILESTONE-AUDIT and full v1.26-ROADMAP under .planning/milestones/. New v1.27 audit, requirements archive, and roadmap capture the 13/13 requirement pass, optimized benchmark attribution repair (XBN-01), and unary SML rule-debt cleanup.

Generated kernel_x86_64 architecture docs (markdown + mermaid) are updated to show explicit SML transitions on dispatch_op_mul_mat for q2_K/q3_K/q6_K × q8_K SIMD guards/effects and on dispatch_op_flash_attn_ext for the AVX2/FMA one-chunk path vs shared fallback, matching the v1.27 “NEON-style” host contract + attribution story in planning text.

.codex/config.toml rewrites all GSD agent config_file paths from a developer-local tree to /shared/stateforward/emel.cpp/.codex/agents/....

Reviewed by Cursor Bugbot for commit 4850c04. Bugbot is set up for automated code reviews on this repo. Configure here.

…n lanes

Parallel matmul cutover:
- Remove ggml-inherited ith/nth thread-partition fields from all kernel op
  events; tensor views are now the only slice descriptor, with partition
  policy living solely at the orchestration fork site.
- Add 8-lane view-sliced fork/join matmul routes for prefill (GEMM) and
  decode (GEMV) behind explicit guards; lane kernels and thread pool are
  constructed once at backend init, dispatch stays allocation-free.
- Fix thread-pool scheduler fork/join lost-wakeup (Dekker race and
  destroy-during-release) in the join latch.

Decode wavefront:
- New text/generator/decode_wavefront component (co_sm lane orchestration)
  with lifecycle tests, focused bench, and eval tool.

Cross-engine comparison lanes:
- parallel_matmul bench gains a ggml reference lane (warm 8-thread ggml
  threadpool, plain GGUF-native blocks both sides). Evidence at dim 2048:
  EMEL wins prefill GEMM 0.843x; ggml leads GEMV (q8_0 2.7x, q6_k 2.5x,
  q4_k 4.2x, f32 9.6x) on per-kernel arithmetic, not orchestration.
- generation bench gains EMEL_BENCH_REFERENCE_THREADS for matched-core
  end-to-end compares; maintained publication rows stay at 1 thread.
- LFM2.5-230M-Q8_0 fixture wired end to end: fixture registry, workload
  manifests, preserve_thinking ChatML formatter contract (both EMEL and
  reference resolvers), 230M strict model contract.
- LFM2 attention-layer layout is now metadata-driven from per-layer
  lfm2.attention.head_count_kv instead of the hardcoded 1.2B block list,
  with lifecycle tests for the pattern layout and contradiction rejection.

Matched-thread evidence (230M, 8 threads both sides): steady-state decode
within ~1.15x of llama.cpp; end-to-end gap concentrated in a ~443 ms EMEL
first-token path (top follow-up), full decomposition in coroutine-plan.md.

Also: coverage lane fixes for gcovr 8.6 (merge-mode-functions separate,
atomic profile updates, negative-hits tolerance); benchmark and lint
snapshot refreshes pre-approved for this changeset.

Known open items: changed-line coverage for the new parallel decode action
structs is below gate (needs lifecycle tests driving those routes); strict
LFM2 1.2B x86 lane still blocked on the optimized plain-Q4 kernel.
…Q8_0) and FMA F32 GEMM

- Port reference AVX2 arithmetic for Q4_K x Q8_K mul_mat with guard-routed
  sm dispatch and optimized/shared counters
- Add AVX2+FMA row-dot kernels and execute paths for legacy quants
  Q4_0/Q4_1/Q5_0/Q8_0 x Q8_0
- Add FMA variant of the blocked F32 GEMM, selected by explicit guard when
  fma_available; AVX2-only path remains as fallback
- Fix latent scalar gap: run_mul_mat validated q4_0/q4_1 but had no compute
  branch (requests reported done with untouched dst); add scalar branches
- clang-format src/emel/model/data.{cpp,hpp} (unformatted additions from
  parent branch commit; lint lane had been skipped on hosts without
  clang-format)

Gates: kernel parity ok, bench snapshot ok, changed-line coverage 97.6%,
lint clean. Paritychecker generation-baseline failures are pre-existing
(missing snapshots/parity/generation_lfm2_5_230m_q8_0_*).
…tor AVX2+FMA kernel

- Embeddings, whisper encoder/decoder, and sortformer contexts own an
  emel::kernel::sm (kernel::any) and dispatch op_mul_mat via
  process_event; per-domain #if-aarch64/scalar dispatch duplicates deleted.
  Kernel machine guards now do arch/ISA routing for all these domains.
- Add shared emel::kernel::detect_host_kind() helper in any.hpp
- Fix dense src0 views carrying quantized byte-stride (nb[0]) that kernel
  validation would reject
- Add execute_avx2_fma_mul_mat_f32_vector_unchecked: dedicated matrix x
  vector (n==1) AVX2+FMA kernel with guard-routed sm row; the blocked GEMM
  degenerates to scalar tail for n==1
- x86 sortformer AMI bench: 20.07s (pre-refactor scalar) -> 2.24s, output
  checksum unchanged vs baseline

Known outstanding (documented, needs user decision):
- bench lane: snapshots/bench/benchmarks.txt baselines were recorded on the
  aarch64 host (contains kernel/aarch64 entries); x86 cannot meet ARM
  absolute-ns baselines for diarization_sortformer
- coverage lane: whisper decoder forward-pass lines are a pre-existing
  coverage hole now gating because the refactor touched them
…matrix matmul path

- Encoder/decoder lifecycle tests now run q4_0, q4_1, and q8_0+f32-aux
  weight variants end-to-end through the machines with the tiny fixture,
  exercising compute_decoder_cross_cache, run_decoder_layer_sequence, the
  logits paths, and the kernel-machine dispatch wiring per variant
- Root cause: the changed-line coverage gate keys records by line number,
  so never-executed template instantiations shadowed covered q8_0 records
- Add embeddings matmul_f32_matrix multi-column test and pointwise
  lane-rejection test
- sortformer_bench: thread the kernel machine into the stage-profile
  compute_speaker_probabilities call (one-time static construction)

Coverage gate: changed-line 83/83 (100%), branches 55.8%, exit 0.
Speech shard 50/50 cases green.
Copilot AI review requested due to automatic review settings July 2, 2026 13:22

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4912bab. Configure here.

Comment thread .codex/config.toml

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4912bab5bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread CMakeLists.txt
Comment thread src/emel/text/generator/decode_wavefront/actions.hpp Outdated
…ont lanes inline

- CMake: compiler flag acceptance alone allowed -mavx2/-mfma/-mf16c into
  every consumer TU on x86_64 hosts that cannot execute AVX2, bypassing the
  runtime CPUID guards (SIGILL). Add a check_cxx_source_runs cpuid/xgetbv
  probe (no libgcc __cpu_model dependency, works under zig c++) so
  host-tuned codegen is only enabled where it runs.
- decode_wavefront: a rejected try_submit (queue full or caller is a pool
  worker) discarded the lane; it now runs inline per the documented
  backpressure contract used by the parallel matmul helper.

Addresses PR #89 review findings.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd3c26b5e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/emel/kernel/x86_64/actions.hpp
The negated operand is always a quantize_row_q8_0_strided activation block
(clamped to [-127,127]); weight lanes may hold -128 but are only abs'd,
where the u8 reinterpretation is exact. Addresses PR #89 review finding.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adbe964bd3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/emel/graph/guards.hpp
guard_valid_compute_reserved accepted any well-formed compute_reserved as
long as some reservation existed, so a stale or mismatched request could
execute one tensor lifecycle against another reservation's topology. The
guard now requires the request manifest's tensor bindings (pointer and
count) to match the reservation's; per-phase manifests over the same
binding set remain accepted. Regression test added.

Addresses PR #89 review finding.
@gabewillen

Copy link
Copy Markdown
Contributor Author

@codex

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82dbe475c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/emel/text/generator/decode_wavefront/actions.hpp
guard_parallel_dispatch scheduled every compatible lane on the pool without
checking graph actor distinctness; two lanes sharing one emel::graph::sm
would process_event concurrently and break the RTC single-writer contract.
Parallel dispatch now requires pairwise-distinct graph refs (statically
bounded scan); duplicates fall to the serial path, same algorithm and
output. Regression test asserts the serial route (no pool scheduling) and
correct completion for shared-actor lanes.

Addresses PR #89 review finding (P1).
@gabewillen

Copy link
Copy Markdown
Contributor Author

@codex

@gabewillen

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4850c046c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread CMakeLists.txt
@gabewillen gabewillen merged commit c0a3918 into main Jul 2, 2026
1 check passed
@gabewillen gabewillen deleted the x86-avx2-quant-kernels-v2 branch July 2, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants