arena: recast ToT outer-contraction scale as strided BLAS GEMM by evaleev · Pull Request #557 · ValeevGroup/tiledarray

evaleev · 2026-05-26T02:47:33Z

The ToT scale outer contraction — result[m,n;a] += Σ_k left[m,k;a]·right[k,n] (left ToT, right plain scalar), and the symmetric left-plain T*ToT — was executed as a triple loop of per-cell AXPYs through the type-erased element op. On tiny inner cells that's memory/dispatch-bound.

This recasts each row (resp. column) as one strided BLAS GEMM directly on the arena slab (zero-copy): the inner-cell extent becomes the GEMM's N dimension, and the SIMD-padded inter-cell stride is the leading dimension (ldb/ldc).

Correctness guards — the fast path applies only when:

both operands are NoTranspose, and the scalar types match;
the row/column is clean: all cells present, uniform inner size, and laid out as one contiguous single-page arena run (verified by checking the inter-cell stride is constant across all cells — an incrementally-built/uncompacted multi-page tile fails this).

Anything else (multi-page tiles, ragged inner sizes, transposes) falls back to the existing per-cell AXPY loop, so results are unchanged.

Perf: ~1.7× faster on the scale operation itself (the dominant ToT outer-contraction work in CSV-CC / PNO-CCSD). The whole-job wall-time gain is smaller (the scale is bandwidth-bound and parallel scaling, not the per-op kernel, is the current ceiling).

Validation: end-to-end PNO-CCSD correlation energy is bit-identical to the AXPY path (with the fast path 100% triggered after the consumer compacts its coefficient tiles). A dedicated unit test for the scale-via-gemm path can be added as follow-up — happy to include one if preferred.

The ToT "scale" outer contraction -- result[m,n;a] += sum_k left[m,k;a] * right[k,n], and the symmetric left-plain T*ToT -- was executed as a triple loop of per-cell AXPYs through the type-erased element op, which is memory/dispatch bound on tiny inner cells. Recast each row (resp. column) as one strided BLAS GEMM directly on the arena slab (zero-copy): the inner cell extent becomes the GEMM's N dimension and the SIMD-padded inter-cell stride is the leading dimension. Guards keep it correct: applies only for NoTranspose, matching scalar type, and "clean" rows/columns -- all cells present, uniform inner size, and laid out as one contiguous single-page arena run (verified by constant stride across all cells). Anything else (multi-page/uncompacted tiles, ragged inner sizes, transposes) falls back to the per-cell AXPY loop. ~1.7x faster on the scale operation itself (the dominant ToT outer-contraction work in CSV-CC). Validated end-to-end: PNO-CCSD correlation energy bit-identical to the AXPY path.

evaleev merged commit db0bff5 into master May 26, 2026
16 of 17 checks passed

evaleev deleted the evaleev/feature/tot-scale-strided-gemm branch May 26, 2026 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arena: recast ToT outer-contraction scale as strided BLAS GEMM#557

arena: recast ToT outer-contraction scale as strided BLAS GEMM#557
evaleev merged 1 commit into
masterfrom
evaleev/feature/tot-scale-strided-gemm

evaleev commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evaleev commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant