Batch the non-grouped Conv GEMM with gemmStridedBatched by harz05 · Pull Request #30 · ML4EP/SOFIE

harz05 · 2026-06-03T10:25:06Z

Implements #29

Changes in ROperator_Conv.hxx (non-grouped path only; grouped unchanged):

Initialize: _xcol sized to hold all B samples' im2col instead of one slice.
Generate_GPU_ALPAKA: each sample's im2col writes its own slice; the per-sample matmul loop is replaced by one gemmStridedBatched over the batch; the inter-sample alpaka::wait calls are removed now
GetBlasConfig: returns empty for the non-grouped path (legacy cuBLAS, no cuBLASLt layout).

Test (output is bit-identical to the loop; existing Conv tests + ConvBatch4 pass):

ConvBatchModelGenerator.py (model + numpy reference)
input_models/ConvBatch4.onnx, references/ConvBatch4.ref.hxx, references/ConvBatch4_input.ref.hxx
ConvBatch4 TEST_F in TestCustomModelsFromONNXForAlpakaCuda.cxx

Benchmarked two ways on T4 Colab (baseline = the per-sample loop):

Fixed model, varying batch (8-layer conv stack, C=16, 16x16): for batch 1 the improvement was ~1x (neutral) and for the following batches it was as follows- batch 4 2.6x, batch 8 3.4x, batch 16 4.1x
Single conv layer, varying GEMM size at batch 8: C16 8x8 2.6x, C64 32x32 1.9x, C128 28x28 1.17x

Memory tradeoff (from the code, not separately measured on Colab): _xcol grows from one slice to B, so the extra is (B-1) * colElements * 4 bytes per conv layer, where colElements = gemm_k * gemm_m. For the configs here that is ~19 MB total for the 8-layer stack at batch 16, and ~50 MB for the single C64 56x56 layer at batch 8. Modest on a 16 GB T4, but it scales with batch x spatial x channels, so it can grow at large scale batch configs.

Thus to conclude, the benchmarks point to the gain coming from dropping the per-sample syncs and letting cuBLAS batch the small GEMMs: it grows with batch and shrinks as the GEMM gets large enough to saturate the GPU on its own. Gain is neutral at batch=1, and no regression in any case tested so far.

Colab test notebook- Link

harz05 added 2 commits June 3, 2026 13:34

Add ConvBatch4 batch test for conv

4f63c42

Use strided-batched GEMM for non-grouped Conv batch path

f3f3b36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch the non-grouped Conv GEMM with gemmStridedBatched#30

Batch the non-grouped Conv GEMM with gemmStridedBatched#30
harz05 wants to merge 2 commits into
ML4EP:gpu/alpakafrom
harz05:feat/conv-batched-gemm

harz05 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harz05 commented Jun 3, 2026

Changes in ROperator_Conv.hxx (non-grouped path only; grouped unchanged):

Test (output is bit-identical to the loop; existing Conv tests + ConvBatch4 pass):

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant