Add GPU support by hakkelt · Pull Request #40 · kul-optec/AbstractOperators.jl

hakkelt · 2026-04-21T13:34:03Z

High-Level Overview

This PR adds GPU support for all operators except WaveletOperator.

Most of the operators needed only minimal changes because they were implemented with broadcast operations.
The following operators require GPU-specific implementations, which got added as package extensions: Filt, MIMOFilt, DFT, SignAlternation, NFFTOp, GetIndex, Variation, and ZeroPad.
GPU implementation of DCT is quite tricky, and there is already a package for it: AcceleratedDCTs. I didn't want to add this package to the dependencies of the FFTWOperators subpackage, so I implemented an extension that activates in the presence of AcceleratedDCTs. It's an open question whether it's a good design:

using AbstractOperators, FFTWOperators, CUDA

x_gpu = CUDA.randn(Float32, 64)
dct_op = DCT(x_gpu)                         # this line fails
idct_op = IDCT(x_gpu)                       # this also fails

using AcceleratedDCTs
dct_op = DCT(x_gpu)                         # now it's ok
idct_op = IDCT(x_gpu)                       # now it's ok

There is a new operator, called OperatorWrapper, that allows the combination of CPU operators with GPU operators:

using AbstractOperators, WaveletOperators, CUDA

cpu_op = WaveletOp(Float32, wavelet(WT.db4), (16, 16))
gpu_op1 = DiagOp(CUDA.randn(16, 16))
gpu_op2 = DiagOp(CUDA.randn(16, 16))

combined_on_gpu = gpu_op1 * OperatorWrapper(cpu_op, array_type = CuArray) * gpu_op2
y_gpu = combined_on_gpu * CUDA.randn(16, 16) # input and output on GPU

combined_on_cpu = cpu_op * OperatorWrapper(gpu_op2)
g_cpu = combined_on_cpu * randn(Float32, 16, 16) # input and output on CPU

API Changes

All operator constructors got a new keyword argument, array_type, that defaults to Array but allows specifying other computing backends (e.g., CuArray, RocArray, MArray, etc.)
domain_storage_type and codomain_storage_type are renamed to domain_array_type and codomain_array_type

List of Changes

Added GPU support across the operator stack with backend-specific extensions.
- Introduced GpuExt for the core package.
- Added GPU extensions for DSPOperators, FFTWOperators, and NFFTOperators.
- Added CPU/GPU dispatch guards and GPU-aware buffer handling.
- Added AcceleratedDCTs integration so DCT/IDCT can run on GPU when that package is imported.
- Kept WaveletOperators CPU-only and documented that limitation explicitly.
Refactored operator internals to reduce type instability and improve composition behavior.
- Tightened type parameters and constructor signatures in several calculus operators.
- Updated Compose, HCAT, VCAT, DCAT, Ax_mul_Bx, Ax_mul_Bxt, and Axt_mul_Bx.
- Improved storage-type propagation and buffer allocation logic.
- Added/expanded helper abstractions for operator wrapping and property queries.
Reworked FFT and NFFT operator implementations.
- Updated IRDFT, RDFT, DCT, Shift, and FFT combination rules.
- Added GPU-aware NFFT operator support via array_type.
- Added a dedicated NFFT extension and normal-operator updates.
Overhauled tests and test infrastructure.
- Switched GPU tests to use GPUEnv.
- Added more GPU coverage across linear, calculus, batching, FFT, DSP, NFFT, and wavelet operators.
- Refactored tests to use domain_array_type and codomain_array_type.
- Added quality and syntax checks, plus new test helpers and GPU-specific test items.
Added benchmarking support.
- Introduced benchmark/gpu_crossover.jl.
- Updated benchmark setup to handle the new GPU backend matrix.
Expanded and corrected documentation.
- Added a GPU support guide in docs/src/gpu.md.
- Updated the top-level README.md and subpackage READMEs.
- Clarified which operators support CUDA, AMDGPU, oneAPI, and OpenCL.
- Documented backend-specific exceptions and GPU activation requirements.
Updated package metadata.
- Bumped versions and adjusted dependency bounds across the core package and subpackages.
- Updated project manifests for the new GPU-oriented structure.

codecov · 2026-04-21T17:12:44Z

Codecov Report

❌ Patch coverage is 84.91879% with 130 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.13%. Comparing base (0a053c3) to head (42cfa93).

Files with missing lines	Patch %	Lines
src/calculus/OperatorWrapper.jl	66.66%	17 Missing ⚠️
src/linearoperators/Eye.jl	58.82%	14 Missing ⚠️
src/calculus/BroadCast.jl	74.41%	11 Missing ⚠️
src/calculus/DCAT.jl	89.21%	11 Missing ⚠️
src/calculus/Sum.jl	18.18%	9 Missing ⚠️
src/calculus/HadamardProd.jl	61.11%	7 Missing ⚠️
ext/GpuExt/linearoperators/getindex.jl	85.00%	6 Missing ⚠️
ext/GpuExt/linearoperators/variation.jl	89.83%	6 Missing ⚠️
src/properties.jl	83.33%	6 Missing ⚠️
src/linearoperators/DiagOp.jl	77.27%	5 Missing ⚠️
... and 16 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #40      +/-   ##
==========================================
- Coverage   89.28%   87.13%   -2.16%     
==========================================
  Files          45       51       +6     
  Lines        3267     3716     +449     
==========================================
+ Hits         2917     3238     +321     
- Misses        350      478     +128

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add array_type keyword constructors and domain_storage_type/codomain_storage_type storage type traits to all operators. Add core GPU extension (GpuExt) with operator overrides for GetIndex, Variation, and ZeroPad. Add GPU extensions for DSPOperators, FFTWOperators, NFFTOperators, and WaveletOperators subpackages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update test infrastructure to support GPU (JLArray) testing. Add :jlarray tags to relevant testitems. Add gpu_utils.jl helper. Add GPU quality tests. Update operator testitems with proper tags (:linearoperator, :nonlinearoperator, etc.). Rename CpuWrapper tests to OperatorWrapper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Updated tests for Eye, FiniteDiff, GetIndex, L-BFGS, LMatrixOp, MatrixOp, MyLinOp, Variation, Zeros, and nonlinear operators to utilize GPUEnv for backend management. - Removed specific CUDA and AMDGPU checks, replacing them with a loop over available GPU backends. - Simplified test setups by eliminating redundant code and ensuring compatibility with various GPU array types. - Ensured all tests are now tagged appropriately for GPU execution without dependency on specific GPU libraries.

- Updated various test files to replace domain_storage_type and codomain_storage_type with domain_array_type and codomain_array_type for consistency and clarity. - Removed unnecessary verbose print statements in tests to streamline output. - Adjusted GPU-related tests to ensure proper handling of array types. - Ensured that all tests maintain functionality while improving readability and maintainability.

…y backend-specific limitations

…n storage type parameters

… documentation CI job

…iles

Co-authored-by: Copilot <copilot@github.com>

Replace the FFT-based adjoint mul! with a tiled FIR direct convolution on CPU paths (H <: Array{T}). The GPU fallback keeps the FFT-based approach. Algorithm: y[j] = Σ_k h[k] * b[padlen+j-k], unrolled 8-wide so all accumulators (a0..a7) live in registers. Reads b[base:base+7] consecutively per inner k-iteration (cache-friendly), writes y only once. Benchmark (n=32768, h length 21, Float64, 1 FFTW thread): - Before (65536-pt FFT): ~484 μs - After (tiled FIR): ~107 μs (~4.5× speedup) Baseline on benchmark machine was ~418 μs, so this should close the regression seen in PR kul-optec#40.

- Add XcorrAdjFFT helper struct carrying the adjoint FFT buffers and plans - Xcorr.adj_fft is Nothing for CPU arrays (H <: Array) — no adjoint FFT buffers are allocated; the tiled FIR path (mul! on Xcorr{<:Array}) is used - For GPU backends adj_fft is a XcorrAdjFFT; the FFT-based adjoint dispatches on <:XcorrAdjFFT instead of the old top-level struct fields - Add _xcorr_plan_kwargs helper: passes flags=FFTW.MEASURE for CPU Arrays only, no flags for GPU backends — fixes FFTW.MEASURE incompatibility with cuFFT

github-actions · 2026-05-29T14:26:33Z

Benchmark Results (Julia v1.12.6)

🚀 6 benchmarks improved in time · 🚀 1 benchmark use less memory

Time benchmarks

Benchmark	base	head	Ratio (base/head)
batching/SimpleBatchOp/adjoint-single	47.3 ± 2.7 μs	46.5 ± 3.2 μs	1.02 ± 0.18
batching/SimpleBatchOp/forward-single	43.1 ± 4.1 μs	41.3 ± 2.7 μs	1.04 ± 0.24
batching/SpreadingBatchOp/adjoint-single	24.3 ± 1.6 μs	24 ± 1.1 μs	1.01 ± 0.16
batching/SpreadingBatchOp/forward-single	21.1 ± 1.3 μs	22 ± 1.2 μs	0.957 ± 0.16
calculus/AffineAdd/adjoint	6.26 ± 0.42 μs	6.83 ± 0.55 μs	0.916 ± 0.19
calculus/AffineAdd/forward	18 ± 1.2 μs	17 ± 1.2 μs	1.06 ± 0.21
calculus/Ax_mul_Bx/forward	49.7 ± 0.73 μs	50 ± 0.38 μs	0.995 ± 0.033
calculus/Ax_mul_Bx/jacobian-adjoint	72.4 ± 1.1 μs	71.1 ± 1.5 μs	1.02 ± 0.052
calculus/Ax_mul_Bxt/forward	49.8 ± 0.33 μs	49.8 ± 0.34 μs	1 ± 0.019
calculus/Ax_mul_Bxt/jacobian-adjoint	71.7 ± 1.3 μs	70.4 ± 1.7 μs	1.02 ± 0.062
calculus/Axt_mul_Bx/forward	49.6 ± 0.28 μs	49.7 ± 0.32 μs	0.999 ± 0.017
calculus/Axt_mul_Bx/jacobian-adjoint	71.6 ± 1.1 μs	71.2 ± 1.1 μs	1.01 ± 0.043
calculus/BroadCast/identity-single	61 ± 7.1 μs	72.8 ± 9.6 μs	0.839 ± 0.3
calculus/BroadCast/operator-single-adjoint	1.72 ± 0.086 μs	1.72 ± 0.081 μs	0.999 ± 0.14
calculus/BroadCast/operator-single-forward	401 ± 46 ns	391 ± 36 ns	1.03 ± 0.3
calculus/Compose/adjoint	12.7 ± 0.66 μs	13.4 ± 0.96 μs	0.945 ± 0.17
calculus/Compose/forward	15.4 ± 0.89 μs	14.6 ± 0.76 μs	1.05 ± 0.16
calculus/DCAT/adjoint	25 ± 1.5 μs	24.3 ± 1.1 μs	1.03 ± 0.15
calculus/DCAT/forward	26.3 ± 1.3 μs	25.8 ± 1.5 μs	1.02 ± 0.15
calculus/HCAT/adjoint	26.4 ± 1.6 μs	25.2 ± 1.2 μs	1.05 ± 0.16
calculus/HCAT/forward	40.6 ± 2.1 μs	39.9 ± 2.8 μs	1.02 ± 0.18
calculus/HadamardProd/forward	857 ± 4.3 μs	831 ± 6.6 μs	1.03 ± 0.019
calculus/HadamardProd/jacobian-adjoint	928 ± 8.1 μs	896 ± 8.5 μs	1.04 ± 0.027
calculus/Jacobian/sigmoid-adjoint	160 ± 3 μs	159 ± 2.3 μs	1 ± 0.047
calculus/Reshape/adjoint	6.45 ± 0.46 μs	6.57 ± 0.38 μs	0.982 ± 0.18
calculus/Reshape/forward	9.58 ± 0.71 μs	9.09 ± 0.57 μs	1.05 ± 0.2
calculus/Scale/adjoint	10.3 ± 0.66 μs	10.1 ± 0.52 μs	1.02 ± 0.17
calculus/Scale/forward	12.9 ± 0.74 μs	13.2 ± 1.1 μs	0.973 ± 0.2
calculus/Sum/adjoint	35.6 ± 1.8 μs	35.1 ± 3 μs	1.01 ± 0.2
calculus/Sum/forward	37.5 ± 2.5 μs	36.8 ± 3.3 μs	1.02 ± 0.23
calculus/VCAT/adjoint	38 ± 1.7 μs	37.6 ± 2.9 μs	1.01 ± 0.18
calculus/VCAT/forward	25.9 ± 1.3 μs	25.4 ± 1.3 μs	1.02 ± 0.14
dspoperators/Filt/adjoint	222 ± 5.3 μs	221 ± 4.9 μs	1 ± 0.065
dspoperators/Filt/forward	230 ± 4.8 μs	228 ± 5 μs	1.01 ± 0.061
dspoperators/MIMOFilt/adjoint	181 ± 2.1 μs	180 ± 1.9 μs	1.01 ± 0.031
dspoperators/MIMOFilt/forward	184 ± 2 μs	186 ± 2.4 μs	0.992 ± 0.034
dspoperators/Xcorr/adjoint	434 ± 8.8 μs	52.6 ± 1.1 μs	8.26 ± 0.49 🚀
dspoperators/Xcorr/forward	1.3 ± 0.063 ms	475 ± 58 μs	2.74 ± 0.71 🚀
fftwoperators/DFT/adjoint	170 ± 7.8 μs	170 ± 6.9 μs	0.998 ± 0.12
fftwoperators/DFT/forward	149 ± 7.6 μs	149 ± 7.4 μs	1 ± 0.14
linearoperators/DiagOp/adjoint-single	267 ± 15 μs	253 ± 13 μs	1.05 ± 0.16
linearoperators/DiagOp/adjoint-threaded	244 ± 9.8 μs	252 ± 11 μs	0.966 ± 0.12
linearoperators/DiagOp/forward-single	274 ± 16 μs	257 ± 15 μs	1.07 ± 0.17
linearoperators/DiagOp/forward-threaded	245 ± 10 μs	251 ± 11 μs	0.976 ± 0.12
linearoperators/Eye/forward	461 ± 38 μs	399 ± 26 μs	1.16 ± 0.24
linearoperators/FiniteDiff/adjoint	437 ± 17 μs	437 ± 15 μs	1 ± 0.11
linearoperators/FiniteDiff/forward	429 ± 15 μs	435 ± 16 μs	0.986 ± 0.1
linearoperators/GetIndex/adjoint	867 ± 52 μs	800 ± 40 μs	1.08 ± 0.17
linearoperators/GetIndex/forward	592 ± 42 μs	529 ± 41 μs	1.12 ± 0.23
linearoperators/LBFGS/mul	53.5 ± 2.1 μs	53.5 ± 2.3 μs	0.999 ± 0.12
linearoperators/LBFGS/update	12.2 ± 0.99 μs	12.3 ± 0.85 μs	0.987 ± 0.21
linearoperators/LMatrixOp/adjoint	243 ± 15 μs	232 ± 20 μs	1.05 ± 0.22
linearoperators/LMatrixOp/forward	245 ± 24 μs	250 ± 28 μs	0.978 ± 0.29
linearoperators/MatrixOp/adjoint	183 ± 5 μs	184 ± 5.9 μs	0.996 ± 0.084
linearoperators/MatrixOp/forward	188 ± 4.8 μs	189 ± 5.7 μs	0.994 ± 0.078
linearoperators/MyLinOp/adjoint	255 ± 14 μs	257 ± 13 μs	0.99 ± 0.14
linearoperators/MyLinOp/forward	263 ± 17 μs	268 ± 17 μs	0.981 ± 0.18
linearoperators/Variation/adjoint-single	864 ± 2.1 μs	864 ± 2 μs	1 ± 0.0068
linearoperators/Variation/forward-single	403 ± 30 μs	82 ± 7.2 μs	4.91 ± 1.1 🚀
linearoperators/ZeroPad/adjoint	922 ± 8.3 μs	143 ± 1.2 μs	6.45 ± 0.16 🚀
linearoperators/ZeroPad/forward	1.83 ± 0.015 ms	64.3 ± 4.2 μs	28.4 ± 3.7 🚀
linearoperators/Zeros/forward	856 ± 69 μs	763 ± 30 μs	1.12 ± 0.2
nfftoperators/NFFTOp/adjoint	258 ± 7.9 μs	258 ± 8.1 μs	1 ± 0.088
nfftoperators/NFFTOp/forward	214 ± 5.6 μs	214 ± 5.1 μs	0.999 ± 0.071
nonlinearoperators/Atan/forward	442 ± 6.9 μs	441 ± 7.4 μs	1 ± 0.046
nonlinearoperators/Atan/jacobian-adjoint	13.6 ± 0.76 μs	15.9 ± 1.5 μs	0.855 ± 0.19
nonlinearoperators/Cos/forward	397 ± 6 μs	398 ± 6.3 μs	0.996 ± 0.044
nonlinearoperators/Cos/jacobian-adjoint	426 ± 6.9 μs	428 ± 6.9 μs	0.995 ± 0.046
nonlinearoperators/Exp/forward	519 ± 11 μs	514 ± 6.5 μs	1.01 ± 0.05
nonlinearoperators/Exp/jacobian-adjoint	600 ± 7.4 μs	603 ± 6.7 μs	0.995 ± 0.033
nonlinearoperators/Pow/forward	540 ± 5.9 μs	539 ± 6 μs	1 ± 0.031
nonlinearoperators/Pow/jacobian-adjoint	399 ± 6.2 μs	400 ± 6.7 μs	0.997 ± 0.046
nonlinearoperators/Sech/forward	236 ± 4.9 μs	247 ± 4.8 μs	0.957 ± 0.054
nonlinearoperators/Sech/jacobian-adjoint	657 ± 6.4 μs	660 ± 7 μs	0.994 ± 0.028
nonlinearoperators/Sigmoid/forward	349 ± 5.5 μs	344 ± 5.7 μs	1.02 ± 0.047
nonlinearoperators/Sigmoid/jacobian-adjoint	328 ± 6.2 μs	334 ± 5.9 μs	0.984 ± 0.051
nonlinearoperators/Sin/forward	389 ± 5.6 μs	392 ± 5.7 μs	0.993 ± 0.041
nonlinearoperators/Sin/jacobian-adjoint	414 ± 6.2 μs	415 ± 6.4 μs	0.998 ± 0.043
nonlinearoperators/SoftMax/forward	323 ± 6.2 μs	324 ± 6 μs	0.997 ± 0.053
nonlinearoperators/SoftMax/jacobian-adjoint	390 ± 1.8 ms	366 ± 8.6 μs	1.07e+03 ± 51 🚀
nonlinearoperators/SoftPlus/forward	920 ± 2.1 μs	921 ± 1.6 μs	0.999 ± 0.0057
nonlinearoperators/SoftPlus/jacobian-adjoint	411 ± 6.5 μs	407 ± 6.6 μs	1.01 ± 0.046
nonlinearoperators/Tanh/forward	335 ± 5.6 μs	335 ± 5.7 μs	1 ± 0.048
nonlinearoperators/Tanh/jacobian-adjoint	266 ± 5.5 μs	265 ± 5.3 μs	1 ± 0.058
normaloperators/DFT/mul	4.72 ± 0.19 μs	4.59 ± 0.23 μs	1.03 ± 0.13
normaloperators/DiagOp/mul	256 ± 13 μs	263 ± 19 μs	0.973 ± 0.17
normaloperators/NFFTOp/mul	162 ± 10 μs	162 ± 11 μs	0.999 ± 0.19
waveletoperators/WaveletOp/adjoint	3.13 ± 0.089 ms	3.12 ± 0.092 ms	1 ± 0.082
waveletoperators/WaveletOp/forward	1.22 ± 0.018 ms	1.24 ± 0.03 ms	0.986 ± 0.055

Memory benchmarks

Benchmark	base	head	Ratio (base/head)
batching/SimpleBatchOp/adjoint-single	768 allocs (277.00 KiB)	768 allocs (277.00 KiB)	1
batching/SimpleBatchOp/forward-single	384 allocs (265.00 KiB)	384 allocs (265.00 KiB)	1
batching/SpreadingBatchOp/adjoint-single	416 allocs (140.00 KiB)	418 allocs (140.09 KiB)	0.999
batching/SpreadingBatchOp/forward-single	224 allocs (134.00 KiB)	224 allocs (134.00 KiB)	1
calculus/AffineAdd/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/AffineAdd/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Ax_mul_Bx/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Ax_mul_Bx/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Ax_mul_Bxt/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Ax_mul_Bxt/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Axt_mul_Bx/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Axt_mul_Bx/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/BroadCast/identity-single	1 allocs (48 bytes)	1 allocs (48 bytes)	1
calculus/BroadCast/operator-single-adjoint	31 allocs (1.08 KiB)	31 allocs (1.08 KiB)	1
calculus/BroadCast/operator-single-forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Compose/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Compose/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/DCAT/adjoint	1 allocs (32 bytes)	1 allocs (32 bytes)	1
calculus/DCAT/forward	1 allocs (32 bytes)	1 allocs (32 bytes)	1
calculus/HCAT/adjoint	131 allocs (4.70 KiB)	131 allocs (4.70 KiB)	1
calculus/HCAT/forward	70 allocs (2.52 KiB)	70 allocs (2.52 KiB)	1
calculus/HadamardProd/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/HadamardProd/jacobian-adjoint	6 allocs (512.14 KiB)	6 allocs (512.14 KiB)	1
calculus/Jacobian/sigmoid-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Reshape/adjoint	1 allocs (32 bytes)	1 allocs (32 bytes)	1
calculus/Reshape/forward	1 allocs (32 bytes)	1 allocs (32 bytes)	1
calculus/Scale/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Scale/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Sum/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/Sum/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
calculus/VCAT/adjoint	130 allocs (4.67 KiB)	130 allocs (4.67 KiB)	1
calculus/VCAT/forward	71 allocs (2.55 KiB)	71 allocs (2.55 KiB)	1
dspoperators/Filt/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
dspoperators/Filt/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
dspoperators/MIMOFilt/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
dspoperators/MIMOFilt/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
dspoperators/Xcorr/adjoint	27 allocs (772.30 KiB)	0 allocs (0 bytes)	N/A 🚀
dspoperators/Xcorr/forward	36 allocs (3.00 MiB)	0 allocs (0 bytes)	N/A 🚀
fftwoperators/DFT/adjoint	45 allocs (258.34 KiB)	45 allocs (258.34 KiB)	1
fftwoperators/DFT/forward	45 allocs (258.34 KiB)	45 allocs (258.34 KiB)	1
linearoperators/DiagOp/adjoint-single	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/DiagOp/adjoint-threaded	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/DiagOp/forward-single	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/DiagOp/forward-threaded	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/Eye/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/FiniteDiff/adjoint	12 allocs (4.00 MiB)	12 allocs (4.00 MiB)	1
linearoperators/FiniteDiff/forward	6 allocs (4.00 MiB)	6 allocs (4.00 MiB)	1
linearoperators/GetIndex/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/GetIndex/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/LBFGS/mul	65 allocs (1.33 KiB)	65 allocs (1.33 KiB)	1
linearoperators/LBFGS/update	3 allocs (48 bytes)	3 allocs (48 bytes)	1
linearoperators/LMatrixOp/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/LMatrixOp/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/MatrixOp/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/MatrixOp/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/MyLinOp/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/MyLinOp/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/Variation/adjoint-single	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/Variation/forward-single	2304 allocs (3.05 MiB)	512 allocs (16.00 KiB)	195 🚀
linearoperators/ZeroPad/adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/ZeroPad/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
linearoperators/Zeros/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nfftoperators/NFFTOp/adjoint	152 allocs (5.28 KiB)	152 allocs (5.28 KiB)	1
nfftoperators/NFFTOp/forward	149 allocs (5.17 KiB)	149 allocs (5.17 KiB)	1
nonlinearoperators/Atan/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Atan/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Cos/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Cos/jacobian-adjoint	6 allocs (512.14 KiB)	6 allocs (512.14 KiB)	1
nonlinearoperators/Exp/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Exp/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Pow/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Pow/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Sech/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Sech/jacobian-adjoint	6 allocs (512.14 KiB)	6 allocs (512.14 KiB)	1
nonlinearoperators/Sigmoid/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Sigmoid/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Sin/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Sin/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/SoftMax/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/SoftMax/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/SoftPlus/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/SoftPlus/jacobian-adjoint	3 allocs (512.07 KiB)	3 allocs (512.07 KiB)	1
nonlinearoperators/Tanh/forward	0 allocs (0 bytes)	0 allocs (0 bytes)	1
nonlinearoperators/Tanh/jacobian-adjoint	0 allocs (0 bytes)	0 allocs (0 bytes)	1
normaloperators/DFT/mul	0 allocs (0 bytes)	0 allocs (0 bytes)	1
normaloperators/DiagOp/mul	0 allocs (0 bytes)	0 allocs (0 bytes)	1
normaloperators/NFFTOp/mul	7 allocs (208 bytes)	7 allocs (208 bytes)	1
waveletoperators/WaveletOp/adjoint	9 allocs (512.34 KiB)	9 allocs (512.34 KiB)	1
waveletoperators/WaveletOp/forward	9 allocs (512.34 KiB)	9 allocs (512.34 KiB)	1

Ratio interpretation: values > 1 mean the PR is faster; values < 1 mean slower.
🚀 significant speedup · 🐢 significant slowdown

hakkelt · 2026-05-29T16:35:08Z

@lostella @nantonel this PR is ready for review. There will be a separate PR for extending test coverage because this PR is already quite large.

hakkelt force-pushed the gpu branch 2 times, most recently from 14de6d5 to 991a4a0 Compare April 21, 2026 13:46

hakkelt marked this pull request as draft April 21, 2026 14:01

hakkelt force-pushed the gpu branch from 49a98c6 to 3477f6d Compare April 21, 2026 16:56

hakkelt force-pushed the gpu branch 3 times, most recently from e65e941 to 88f5307 Compare April 27, 2026 11:01

hakkelt and others added 5 commits April 27, 2026 16:01

Enhance GPU support documentation across various operators and clarif…

50fa0ce

…y backend-specific limitations

hakkelt force-pushed the gpu branch 2 times, most recently from bb22286 to 5bd4ccf Compare April 28, 2026 08:50

hakkelt and others added 6 commits April 28, 2026 17:13

Fix OperatorWrapper construction by removing unnecessary dimensions i…

8f4cbbd

…n storage type parameters

Fix CI testing for Julia 1.10

054140d

skip persistent_tasks Aqua tests on Julia 1.10 & fix benchmarking and…

3438692

… documentation CI job

rename *_storage_type to *_array_type in documentation and AI agent f…

394e0f4

…iles

fix documentation

75a5ca7

Co-authored-by: Copilot <copilot@github.com>

fix benchmarking CI action

0e2f164

hakkelt force-pushed the gpu branch from 5bd4ccf to 0e2f164 Compare April 28, 2026 15:14

hakkelt added 2 commits April 28, 2026 17:38

remove stale compat entry for AcceleratedDCTs

28b55e5

try zsoerenm/AirspeedVelocity.jl fork for benchmarking CI action

ae50823

hakkelt force-pushed the gpu branch from baf78d4 to ae50823 Compare April 28, 2026 16:30

hakkelt added 5 commits April 28, 2026 18:47

fix commit on zsoerenm/AirspeedVelocity.jl in benchmarking CI action

723b354

fix benchmark CI job

9e1590e

replace AirspeedVelocity with a custom solution

77ecefa

add summary line to benchmark results post

7520d93

Merge branch 'custom-benchmarking' into gpu

9d63d46

hakkelt and others added 5 commits May 19, 2026 22:39

fix regresssions

d8fbd76

fix Xcorr: documenter and type issues

54ce6e2

Merge branch 'master' into gpu

2701027

Revert the unintended change in compare.jl

42cfa93

hakkelt marked this pull request as ready for review May 29, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU support#40

Add GPU support#40
hakkelt wants to merge 24 commits into
kul-optec:masterfrom
hakkelt:gpu

hakkelt commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

hakkelt commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hakkelt commented Apr 21, 2026

High-Level Overview

API Changes

List of Changes

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results (Julia v1.12.6)

Uh oh!

hakkelt commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading