Batch the non-grouped Conv GEMM with gemmStridedBatched#30
Open
harz05 wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements #29
Changes in ROperator_Conv.hxx (non-grouped path only; grouped unchanged):
Initialize: _xcol sized to hold all B samples' im2col instead of one slice.Generate_GPU_ALPAKA: each sample's im2col writes its own slice; the per-sample matmul loop is replaced by onegemmStridedBatchedover the batch; the inter-sample alpaka::wait calls are removed nowGetBlasConfig: returns empty for the non-grouped path (legacy cuBLAS, no cuBLASLt layout).Test (output is bit-identical to the loop; existing Conv tests + ConvBatch4 pass):
Benchmarked two ways on T4 Colab (baseline = the per-sample loop):
Fixed model, varying batch (8-layer conv stack, C=16, 16x16): for batch 1 the improvement was ~1x (neutral) and for the following batches it was as follows- batch 4 2.6x, batch 8 3.4x, batch 16 4.1x
Single conv layer, varying GEMM size at batch 8: C16 8x8 2.6x, C64 32x32 1.9x, C128 28x28 1.17x
Memory tradeoff (from the code, not separately measured on Colab):
_xcolgrows from one slice to B, so the extra is(B-1) * colElements * 4 bytesper conv layer, wherecolElements = gemm_k * gemm_m. For the configs here that is ~19 MB total for the 8-layer stack at batch 16, and ~50 MB for the single C64 56x56 layer at batch 8. Modest on a 16 GB T4, but it scales with batch x spatial x channels, so it can grow at large scale batch configs.Thus to conclude, the benchmarks point to the gain coming from dropping the per-sample syncs and letting cuBLAS batch the small GEMMs: it grows with batch and shrinks as the GEMM gets large enough to saturate the GPU on its own. Gain is neutral at batch=1, and no regression in any case tested so far.
Colab test notebook- Link