Skip to content

arena: recast ToT outer-contraction scale as strided BLAS GEMM#557

Merged
evaleev merged 1 commit into
masterfrom
evaleev/feature/tot-scale-strided-gemm
May 26, 2026
Merged

arena: recast ToT outer-contraction scale as strided BLAS GEMM#557
evaleev merged 1 commit into
masterfrom
evaleev/feature/tot-scale-strided-gemm

Conversation

@evaleev
Copy link
Copy Markdown
Member

@evaleev evaleev commented May 26, 2026

The ToT scale outer contraction — result[m,n;a] += Σ_k left[m,k;a]·right[k,n] (left ToT, right plain scalar), and the symmetric left-plain T*ToT — was executed as a triple loop of per-cell AXPYs through the type-erased element op. On tiny inner cells that's memory/dispatch-bound.

This recasts each row (resp. column) as one strided BLAS GEMM directly on the arena slab (zero-copy): the inner-cell extent becomes the GEMM's N dimension, and the SIMD-padded inter-cell stride is the leading dimension (ldb/ldc).

Correctness guards — the fast path applies only when:

  • both operands are NoTranspose, and the scalar types match;
  • the row/column is clean: all cells present, uniform inner size, and laid out as one contiguous single-page arena run (verified by checking the inter-cell stride is constant across all cells — an incrementally-built/uncompacted multi-page tile fails this).

Anything else (multi-page tiles, ragged inner sizes, transposes) falls back to the existing per-cell AXPY loop, so results are unchanged.

Perf: ~1.7× faster on the scale operation itself (the dominant ToT outer-contraction work in CSV-CC / PNO-CCSD). The whole-job wall-time gain is smaller (the scale is bandwidth-bound and parallel scaling, not the per-op kernel, is the current ceiling).

Validation: end-to-end PNO-CCSD correlation energy is bit-identical to the AXPY path (with the fast path 100% triggered after the consumer compacts its coefficient tiles). A dedicated unit test for the scale-via-gemm path can be added as follow-up — happy to include one if preferred.

The ToT "scale" outer contraction -- result[m,n;a] += sum_k left[m,k;a] *
right[k,n], and the symmetric left-plain T*ToT -- was executed as a triple
loop of per-cell AXPYs through the type-erased element op, which is
memory/dispatch bound on tiny inner cells. Recast each row (resp. column)
as one strided BLAS GEMM directly on the arena slab (zero-copy): the inner
cell extent becomes the GEMM's N dimension and the SIMD-padded inter-cell
stride is the leading dimension.

Guards keep it correct: applies only for NoTranspose, matching scalar
type, and "clean" rows/columns -- all cells present, uniform inner size,
and laid out as one contiguous single-page arena run (verified by constant
stride across all cells). Anything else (multi-page/uncompacted tiles,
ragged inner sizes, transposes) falls back to the per-cell AXPY loop.

~1.7x faster on the scale operation itself (the dominant ToT
outer-contraction work in CSV-CC). Validated end-to-end: PNO-CCSD
correlation energy bit-identical to the AXPY path.
@evaleev evaleev merged commit db0bff5 into master May 26, 2026
16 of 17 checks passed
@evaleev evaleev deleted the evaleev/feature/tot-scale-strided-gemm branch May 26, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant