Skip to content

Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests#27

Open
harz05 wants to merge 4 commits into
ML4EP:gpu/alpakafrom
harz05:feat/maxpool-gpu
Open

Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests#27
harz05 wants to merge 4 commits into
ML4EP:gpu/alpakafrom
harz05:feat/maxpool-gpu

Conversation

@harz05

@harz05 harz05 commented May 29, 2026

Copy link
Copy Markdown

Closes: #26

This PR adds GPU codegen for MaxPool 1D,2D and 3D via three Generate_GPU_* methods on ROperator_Pool. The kernels are per op constexpr alpaka functors, one thread per output element with row major index decode formula. The window reduction math is same as that of CPU Generate(). OperatorKind::POOL is added to the enum and wired in the constructor.

However this is just a baseline implementation and I'll be following up with some optimisations(shared memory tiling or fusion with the upstream Conv, etc.) that I've in my mind and would like to test against this.

Tested on Colab T4: ctest passes 85/85 (3 new MaxPool tests + 82 pre-existing, no regressions). The MaxPool1d/2d/3d gtest cases compare GPU output against the existing PyTorch references

image

Edit:

Extended this PR to also cover AveragePool and GlobalAveragePool on GPU.

AvgPool reuses MaxPool's exact index math, just swapping the max reduction for a sum-and-divide. To avoid duplicating the 1D/2D/3D bodies I pulled the three mode dependent bits (init, accumulate, finalize) into small lambdas in the kernel
generator, so MaxPool and AvgPool share the same scaffolding. MaxPool's emitted code is unchanged.

The average divisor follows the CPU Generate() exactly: when count_include_pad is 0 (the ONNX default) and there is padding, pad cells are excluded so the divisor is the in-bounds cell count, computed at run time; otherwise it is the constant kernel area (kh*kw). GlobalAveragePool does not need its own kernel, it reaches the GPU path as an
AveragePool with kernel equal to the image size (handled in Initialize), so it takes the constant-divisor path.

Added 4 gtests : AvgPool (reuses the existing model and its trusted CPU reference, no padding), AvgPoolPad count_include_pad 0, the run-time count path), AvgPoolCountIncludePad (count_include_pad 1, constant divisor), and GlobalAvgPool2d. Padding in the new models is symmetric on purpose, so they test the divisor logic without depending on the begin/end pad layout. ctest now passes 89/89 on Colab T4, no regressions

One thing again to note is that this is the inital implementation and I'll be modifying it with any possible optimisation.

@harz05 harz05 force-pushed the feat/maxpool-gpu branch from d01c916 to 22b059c Compare June 7, 2026 10:28
@harz05 harz05 changed the title MaxPool 1D, 2D and 3D gpu support along with tests Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant