Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests#27
Open
harz05 wants to merge 4 commits into
Open
Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests#27harz05 wants to merge 4 commits into
harz05 wants to merge 4 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes: #26
This PR adds GPU codegen for MaxPool 1D,2D and 3D via three Generate_GPU_* methods on ROperator_Pool. The kernels are per op constexpr alpaka functors, one thread per output element with row major index decode formula. The window reduction math is same as that of
CPU Generate().OperatorKind::POOLis added to the enum and wired in the constructor.However this is just a baseline implementation and I'll be following up with some optimisations(shared memory tiling or fusion with the upstream Conv, etc.) that I've in my mind and would like to test against this.
Tested on Colab T4: ctest passes 85/85 (3 new MaxPool tests + 82 pre-existing, no regressions). The MaxPool1d/2d/3d gtest cases compare GPU output against the existing PyTorch references
Edit:
Extended this PR to also cover AveragePool and GlobalAveragePool on GPU.
AvgPool reuses MaxPool's exact index math, just swapping the max reduction for a sum-and-divide. To avoid duplicating the 1D/2D/3D bodies I pulled the three mode dependent bits (init, accumulate, finalize) into small lambdas in the kernel
generator, so MaxPool and AvgPool share the same scaffolding. MaxPool's emitted code is unchanged.
The average divisor follows the CPU Generate() exactly: when count_include_pad is 0 (the ONNX default) and there is padding, pad cells are excluded so the divisor is the in-bounds cell count, computed at run time; otherwise it is the constant kernel area (kh*kw). GlobalAveragePool does not need its own kernel, it reaches the GPU path as an
AveragePool with kernel equal to the image size (handled in Initialize), so it takes the constant-divisor path.
Added 4 gtests : AvgPool (reuses the existing model and its trusted CPU reference, no padding), AvgPoolPad count_include_pad 0, the run-time count path), AvgPoolCountIncludePad (count_include_pad 1, constant divisor), and GlobalAvgPool2d. Padding in the new models is symmetric on purpose, so they test the divisor logic without depending on the begin/end pad layout. ctest now passes 89/89 on Colab T4, no regressions
One thing again to note is that this is the inital implementation and I'll be modifying it with any possible optimisation.