arena: recast ToT outer-contraction scale as strided BLAS GEMM#557
Merged
Conversation
The ToT "scale" outer contraction -- result[m,n;a] += sum_k left[m,k;a] * right[k,n], and the symmetric left-plain T*ToT -- was executed as a triple loop of per-cell AXPYs through the type-erased element op, which is memory/dispatch bound on tiny inner cells. Recast each row (resp. column) as one strided BLAS GEMM directly on the arena slab (zero-copy): the inner cell extent becomes the GEMM's N dimension and the SIMD-padded inter-cell stride is the leading dimension. Guards keep it correct: applies only for NoTranspose, matching scalar type, and "clean" rows/columns -- all cells present, uniform inner size, and laid out as one contiguous single-page arena run (verified by constant stride across all cells). Anything else (multi-page/uncompacted tiles, ragged inner sizes, transposes) falls back to the per-cell AXPY loop. ~1.7x faster on the scale operation itself (the dominant ToT outer-contraction work in CSV-CC). Validated end-to-end: PNO-CCSD correlation energy bit-identical to the AXPY path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The ToT scale outer contraction —
result[m,n;a] += Σ_k left[m,k;a]·right[k,n](left ToT, right plain scalar), and the symmetric left-plainT*ToT— was executed as a triple loop of per-cell AXPYs through the type-erased element op. On tiny inner cells that's memory/dispatch-bound.This recasts each row (resp. column) as one strided BLAS GEMM directly on the arena slab (zero-copy): the inner-cell extent becomes the GEMM's N dimension, and the SIMD-padded inter-cell stride is the leading dimension (
ldb/ldc).Correctness guards — the fast path applies only when:
NoTranspose, and the scalar types match;Anything else (multi-page tiles, ragged inner sizes, transposes) falls back to the existing per-cell AXPY loop, so results are unchanged.
Perf: ~1.7× faster on the scale operation itself (the dominant ToT outer-contraction work in CSV-CC / PNO-CCSD). The whole-job wall-time gain is smaller (the scale is bandwidth-bound and parallel scaling, not the per-op kernel, is the current ceiling).
Validation: end-to-end PNO-CCSD correlation energy is bit-identical to the AXPY path (with the fast path 100% triggered after the consumer compacts its coefficient tiles). A dedicated unit test for the scale-via-gemm path can be added as follow-up — happy to include one if preferred.