Skip to content

Batch the non-grouped Conv GEMM with gemmStridedBatched#30

Open
harz05 wants to merge 2 commits into
ML4EP:gpu/alpakafrom
harz05:feat/conv-batched-gemm
Open

Batch the non-grouped Conv GEMM with gemmStridedBatched#30
harz05 wants to merge 2 commits into
ML4EP:gpu/alpakafrom
harz05:feat/conv-batched-gemm

Conversation

@harz05

@harz05 harz05 commented Jun 3, 2026

Copy link
Copy Markdown

Implements #29

Changes in ROperator_Conv.hxx (non-grouped path only; grouped unchanged):

  • Initialize: _xcol sized to hold all B samples' im2col instead of one slice.
  • Generate_GPU_ALPAKA: each sample's im2col writes its own slice; the per-sample matmul loop is replaced by one gemmStridedBatched over the batch; the inter-sample alpaka::wait calls are removed now
  • GetBlasConfig: returns empty for the non-grouped path (legacy cuBLAS, no cuBLASLt layout).

Test (output is bit-identical to the loop; existing Conv tests + ConvBatch4 pass):

  • ConvBatchModelGenerator.py (model + numpy reference)
  • input_models/ConvBatch4.onnx, references/ConvBatch4.ref.hxx, references/ConvBatch4_input.ref.hxx
  • ConvBatch4 TEST_F in TestCustomModelsFromONNXForAlpakaCuda.cxx

Benchmarked two ways on T4 Colab (baseline = the per-sample loop):

  1. Fixed model, varying batch (8-layer conv stack, C=16, 16x16): for batch 1 the improvement was ~1x (neutral) and for the following batches it was as follows- batch 4 2.6x, batch 8 3.4x, batch 16 4.1x

  2. Single conv layer, varying GEMM size at batch 8: C16 8x8 2.6x, C64 32x32 1.9x, C128 28x28 1.17x

Memory tradeoff (from the code, not separately measured on Colab): _xcol grows from one slice to B, so the extra is (B-1) * colElements * 4 bytes per conv layer, where colElements = gemm_k * gemm_m. For the configs here that is ~19 MB total for the 8-layer stack at batch 16, and ~50 MB for the single C64 56x56 layer at batch 8. Modest on a 16 GB T4, but it scales with batch x spatial x channels, so it can grow at large scale batch configs.

Thus to conclude, the benchmarks point to the gain coming from dropping the per-sample syncs and letting cuBLAS batch the small GEMMs: it grows with batch and shrinks as the GEMM gets large enough to saturate the GPU on its own. Gain is neutral at batch=1, and no regression in any case tested so far.

Colab test notebook- Link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant