Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests by harz05 · Pull Request #27 · ML4EP/SOFIE

harz05 · 2026-05-29T17:55:43Z

Closes: #26

This PR adds GPU codegen for MaxPool 1D,2D and 3D via three Generate_GPU_* methods on ROperator_Pool. The kernels are per op constexpr alpaka functors, one thread per output element with row major index decode formula. The window reduction math is same as that of CPU Generate(). OperatorKind::POOL is added to the enum and wired in the constructor.

However this is just a baseline implementation and I'll be following up with some optimisations(shared memory tiling or fusion with the upstream Conv, etc.) that I've in my mind and would like to test against this.

Tested on Colab T4: ctest passes 85/85 (3 new MaxPool tests + 82 pre-existing, no regressions). The MaxPool1d/2d/3d gtest cases compare GPU output against the existing PyTorch references

Edit:

Extended this PR to also cover AveragePool and GlobalAveragePool on GPU.

AvgPool reuses MaxPool's exact index math, just swapping the max reduction for a sum-and-divide. To avoid duplicating the 1D/2D/3D bodies I pulled the three mode dependent bits (init, accumulate, finalize) into small lambdas in the kernel
generator, so MaxPool and AvgPool share the same scaffolding. MaxPool's emitted code is unchanged.

The average divisor follows the CPU Generate() exactly: when count_include_pad is 0 (the ONNX default) and there is padding, pad cells are excluded so the divisor is the in-bounds cell count, computed at run time; otherwise it is the constant kernel area (kh*kw). GlobalAveragePool does not need its own kernel, it reaches the GPU path as an
AveragePool with kernel equal to the image size (handled in Initialize), so it takes the constant-divisor path.

Added 4 gtests : AvgPool (reuses the existing model and its trusted CPU reference, no padding), AvgPoolPad count_include_pad 0, the run-time count path), AvgPoolCountIncludePad (count_include_pad 1, constant divisor), and GlobalAvgPool2d. Padding in the new models is symmetric on purpose, so they test the divisor logic without depending on the begin/end pad layout. ctest now passes 89/89 on Colab T4, no regressions

One thing again to note is that this is the inital implementation and I'll be modifying it with any possible optimisation.

harz05 added 3 commits May 29, 2026 17:32

2d maxpool gpu kernel and test

c126ada

empty codegen for unsupported pool variants

ae76b18

1d and 3d maxpool gpu kernels and tests

21b0fbb

harz05 mentioned this pull request Jun 1, 2026

fix wrong pad indices in pool for asym padding #28

Open

added AvgPool and GlobalAvgPool GPU kernels + tests

22b059c

harz05 force-pushed the feat/maxpool-gpu branch from d01c916 to 22b059c Compare June 7, 2026 10:28

harz05 changed the title ~~MaxPool 1D, 2D and 3D gpu support along with tests~~ Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests#27

Pool (MaxPool, AvgPool, GlobalAveragePool) GPU support along with tests#27
harz05 wants to merge 4 commits into
ML4EP:gpu/alpakafrom
harz05:feat/maxpool-gpu

harz05 commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harz05 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

harz05 commented May 29, 2026 •

edited

Loading