Skip to content

Add GPU support#40

Open
hakkelt wants to merge 24 commits into
kul-optec:masterfrom
hakkelt:gpu
Open

Add GPU support#40
hakkelt wants to merge 24 commits into
kul-optec:masterfrom
hakkelt:gpu

Conversation

@hakkelt
Copy link
Copy Markdown
Collaborator

@hakkelt hakkelt commented Apr 21, 2026

High-Level Overview

This PR adds GPU support for all operators except WaveletOperator.

  • Most of the operators needed only minimal changes because they were implemented with broadcast operations.
  • The following operators require GPU-specific implementations, which got added as package extensions: Filt, MIMOFilt, DFT, SignAlternation, NFFTOp, GetIndex, Variation, and ZeroPad.
  • GPU implementation of DCT is quite tricky, and there is already a package for it: AcceleratedDCTs. I didn't want to add this package to the dependencies of the FFTWOperators subpackage, so I implemented an extension that activates in the presence of AcceleratedDCTs. It's an open question whether it's a good design:
using AbstractOperators, FFTWOperators, CUDA

x_gpu = CUDA.randn(Float32, 64)
dct_op = DCT(x_gpu)                         # this line fails
idct_op = IDCT(x_gpu)                       # this also fails

using AcceleratedDCTs
dct_op = DCT(x_gpu)                         # now it's ok
idct_op = IDCT(x_gpu)                       # now it's ok
  • There is a new operator, called OperatorWrapper, that allows the combination of CPU operators with GPU operators:
using AbstractOperators, WaveletOperators, CUDA

cpu_op = WaveletOp(Float32, wavelet(WT.db4), (16, 16))
gpu_op1 = DiagOp(CUDA.randn(16, 16))
gpu_op2 = DiagOp(CUDA.randn(16, 16))

combined_on_gpu = gpu_op1 * OperatorWrapper(cpu_op, array_type = CuArray) * gpu_op2
y_gpu = combined_on_gpu * CUDA.randn(16, 16) # input and output on GPU

combined_on_cpu = cpu_op * OperatorWrapper(gpu_op2)
g_cpu = combined_on_cpu * randn(Float32, 16, 16) # input and output on CPU

API Changes

  • All operator constructors got a new keyword argument, array_type, that defaults to Array but allows specifying other computing backends (e.g., CuArray, RocArray, MArray, etc.)
  • domain_storage_type and codomain_storage_type are renamed to domain_array_type and codomain_array_type

List of Changes

  • Added GPU support across the operator stack with backend-specific extensions.

    • Introduced GpuExt for the core package.
    • Added GPU extensions for DSPOperators, FFTWOperators, and NFFTOperators.
    • Added CPU/GPU dispatch guards and GPU-aware buffer handling.
    • Added AcceleratedDCTs integration so DCT/IDCT can run on GPU when that package is imported.
    • Kept WaveletOperators CPU-only and documented that limitation explicitly.
  • Refactored operator internals to reduce type instability and improve composition behavior.

    • Tightened type parameters and constructor signatures in several calculus operators.
    • Updated Compose, HCAT, VCAT, DCAT, Ax_mul_Bx, Ax_mul_Bxt, and Axt_mul_Bx.
    • Improved storage-type propagation and buffer allocation logic.
    • Added/expanded helper abstractions for operator wrapping and property queries.
  • Reworked FFT and NFFT operator implementations.

    • Updated IRDFT, RDFT, DCT, Shift, and FFT combination rules.
    • Added GPU-aware NFFT operator support via array_type.
    • Added a dedicated NFFT extension and normal-operator updates.
  • Overhauled tests and test infrastructure.

    • Switched GPU tests to use GPUEnv.
    • Added more GPU coverage across linear, calculus, batching, FFT, DSP, NFFT, and wavelet operators.
    • Refactored tests to use domain_array_type and codomain_array_type.
    • Added quality and syntax checks, plus new test helpers and GPU-specific test items.
  • Added benchmarking support.

    • Introduced benchmark/gpu_crossover.jl.
    • Updated benchmark setup to handle the new GPU backend matrix.
  • Expanded and corrected documentation.

    • Added a GPU support guide in docs/src/gpu.md.
    • Updated the top-level README.md and subpackage READMEs.
    • Clarified which operators support CUDA, AMDGPU, oneAPI, and OpenCL.
    • Documented backend-specific exceptions and GPU activation requirements.
  • Updated package metadata.

    • Bumped versions and adjusted dependency bounds across the core package and subpackages.
    • Updated project manifests for the new GPU-oriented structure.

@hakkelt hakkelt force-pushed the gpu branch 2 times, most recently from 14de6d5 to 991a4a0 Compare April 21, 2026 13:46
@hakkelt hakkelt marked this pull request as draft April 21, 2026 14:01
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 84.91879% with 130 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.13%. Comparing base (0a053c3) to head (42cfa93).

Files with missing lines Patch % Lines
src/calculus/OperatorWrapper.jl 66.66% 17 Missing ⚠️
src/linearoperators/Eye.jl 58.82% 14 Missing ⚠️
src/calculus/BroadCast.jl 74.41% 11 Missing ⚠️
src/calculus/DCAT.jl 89.21% 11 Missing ⚠️
src/calculus/Sum.jl 18.18% 9 Missing ⚠️
src/calculus/HadamardProd.jl 61.11% 7 Missing ⚠️
ext/GpuExt/linearoperators/getindex.jl 85.00% 6 Missing ⚠️
ext/GpuExt/linearoperators/variation.jl 89.83% 6 Missing ⚠️
src/properties.jl 83.33% 6 Missing ⚠️
src/linearoperators/DiagOp.jl 77.27% 5 Missing ⚠️
... and 16 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #40      +/-   ##
==========================================
- Coverage   89.28%   87.13%   -2.16%     
==========================================
  Files          45       51       +6     
  Lines        3267     3716     +449     
==========================================
+ Hits         2917     3238     +321     
- Misses        350      478     +128     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hakkelt hakkelt force-pushed the gpu branch 3 times, most recently from e65e941 to 88f5307 Compare April 27, 2026 11:01
hakkelt and others added 5 commits April 27, 2026 16:01
Add array_type keyword constructors and domain_storage_type/codomain_storage_type
storage type traits to all operators. Add core GPU extension (GpuExt) with operator
overrides for GetIndex, Variation, and ZeroPad. Add GPU extensions for DSPOperators,
FFTWOperators, NFFTOperators, and WaveletOperators subpackages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update test infrastructure to support GPU (JLArray) testing. Add :jlarray tags
to relevant testitems. Add gpu_utils.jl helper. Add GPU quality tests. Update
operator testitems with proper tags (:linearoperator, :nonlinearoperator, etc.).
Rename CpuWrapper tests to OperatorWrapper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Updated tests for Eye, FiniteDiff, GetIndex, L-BFGS, LMatrixOp, MatrixOp, MyLinOp, Variation, Zeros, and nonlinear operators to utilize GPUEnv for backend management.
- Removed specific CUDA and AMDGPU checks, replacing them with a loop over available GPU backends.
- Simplified test setups by eliminating redundant code and ensuring compatibility with various GPU array types.
- Ensured all tests are now tagged appropriately for GPU execution without dependency on specific GPU libraries.
- Updated various test files to replace domain_storage_type and codomain_storage_type with domain_array_type and codomain_array_type for consistency and clarity.
- Removed unnecessary verbose print statements in tests to streamline output.
- Adjusted GPU-related tests to ensure proper handling of array types.
- Ensured that all tests maintain functionality while improving readability and maintainability.
@hakkelt hakkelt force-pushed the gpu branch 2 times, most recently from bb22286 to 5bd4ccf Compare April 28, 2026 08:50
hakkelt and others added 5 commits May 19, 2026 22:39
Replace the FFT-based adjoint mul! with a tiled FIR direct convolution on
CPU paths (H <: Array{T}). The GPU fallback keeps the FFT-based approach.

Algorithm: y[j] = Σ_k h[k] * b[padlen+j-k], unrolled 8-wide so all
accumulators (a0..a7) live in registers. Reads b[base:base+7] consecutively
per inner k-iteration (cache-friendly), writes y only once.

Benchmark (n=32768, h length 21, Float64, 1 FFTW thread):
- Before (65536-pt FFT):  ~484 μs
- After  (tiled FIR):      ~107 μs  (~4.5× speedup)

Baseline on benchmark machine was ~418 μs, so this should close the
regression seen in PR kul-optec#40.
- Add XcorrAdjFFT helper struct carrying the adjoint FFT buffers and plans
- Xcorr.adj_fft is Nothing for CPU arrays (H <: Array) — no adjoint FFT
  buffers are allocated; the tiled FIR path (mul! on Xcorr{<:Array}) is used
- For GPU backends adj_fft is a XcorrAdjFFT; the FFT-based adjoint dispatches
  on <:XcorrAdjFFT instead of the old top-level struct fields
- Add _xcorr_plan_kwargs helper: passes flags=FFTW.MEASURE for CPU Arrays only,
  no flags for GPU backends — fixes FFTW.MEASURE incompatibility with cuFFT
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

Benchmark Results (Julia v1.12.6)

🚀 6 benchmarks improved in time · 🚀 1 benchmark use less memory

Time benchmarks
Benchmark base head Ratio (base/head)
batching/SimpleBatchOp/adjoint-single 47.3 ± 2.7 μs 46.5 ± 3.2 μs 1.02 ± 0.18
batching/SimpleBatchOp/forward-single 43.1 ± 4.1 μs 41.3 ± 2.7 μs 1.04 ± 0.24
batching/SpreadingBatchOp/adjoint-single 24.3 ± 1.6 μs 24 ± 1.1 μs 1.01 ± 0.16
batching/SpreadingBatchOp/forward-single 21.1 ± 1.3 μs 22 ± 1.2 μs 0.957 ± 0.16
calculus/AffineAdd/adjoint 6.26 ± 0.42 μs 6.83 ± 0.55 μs 0.916 ± 0.19
calculus/AffineAdd/forward 18 ± 1.2 μs 17 ± 1.2 μs 1.06 ± 0.21
calculus/Ax_mul_Bx/forward 49.7 ± 0.73 μs 50 ± 0.38 μs 0.995 ± 0.033
calculus/Ax_mul_Bx/jacobian-adjoint 72.4 ± 1.1 μs 71.1 ± 1.5 μs 1.02 ± 0.052
calculus/Ax_mul_Bxt/forward 49.8 ± 0.33 μs 49.8 ± 0.34 μs 1 ± 0.019
calculus/Ax_mul_Bxt/jacobian-adjoint 71.7 ± 1.3 μs 70.4 ± 1.7 μs 1.02 ± 0.062
calculus/Axt_mul_Bx/forward 49.6 ± 0.28 μs 49.7 ± 0.32 μs 0.999 ± 0.017
calculus/Axt_mul_Bx/jacobian-adjoint 71.6 ± 1.1 μs 71.2 ± 1.1 μs 1.01 ± 0.043
calculus/BroadCast/identity-single 61 ± 7.1 μs 72.8 ± 9.6 μs 0.839 ± 0.3
calculus/BroadCast/operator-single-adjoint 1.72 ± 0.086 μs 1.72 ± 0.081 μs 0.999 ± 0.14
calculus/BroadCast/operator-single-forward 401 ± 46 ns 391 ± 36 ns 1.03 ± 0.3
calculus/Compose/adjoint 12.7 ± 0.66 μs 13.4 ± 0.96 μs 0.945 ± 0.17
calculus/Compose/forward 15.4 ± 0.89 μs 14.6 ± 0.76 μs 1.05 ± 0.16
calculus/DCAT/adjoint 25 ± 1.5 μs 24.3 ± 1.1 μs 1.03 ± 0.15
calculus/DCAT/forward 26.3 ± 1.3 μs 25.8 ± 1.5 μs 1.02 ± 0.15
calculus/HCAT/adjoint 26.4 ± 1.6 μs 25.2 ± 1.2 μs 1.05 ± 0.16
calculus/HCAT/forward 40.6 ± 2.1 μs 39.9 ± 2.8 μs 1.02 ± 0.18
calculus/HadamardProd/forward 857 ± 4.3 μs 831 ± 6.6 μs 1.03 ± 0.019
calculus/HadamardProd/jacobian-adjoint 928 ± 8.1 μs 896 ± 8.5 μs 1.04 ± 0.027
calculus/Jacobian/sigmoid-adjoint 160 ± 3 μs 159 ± 2.3 μs 1 ± 0.047
calculus/Reshape/adjoint 6.45 ± 0.46 μs 6.57 ± 0.38 μs 0.982 ± 0.18
calculus/Reshape/forward 9.58 ± 0.71 μs 9.09 ± 0.57 μs 1.05 ± 0.2
calculus/Scale/adjoint 10.3 ± 0.66 μs 10.1 ± 0.52 μs 1.02 ± 0.17
calculus/Scale/forward 12.9 ± 0.74 μs 13.2 ± 1.1 μs 0.973 ± 0.2
calculus/Sum/adjoint 35.6 ± 1.8 μs 35.1 ± 3 μs 1.01 ± 0.2
calculus/Sum/forward 37.5 ± 2.5 μs 36.8 ± 3.3 μs 1.02 ± 0.23
calculus/VCAT/adjoint 38 ± 1.7 μs 37.6 ± 2.9 μs 1.01 ± 0.18
calculus/VCAT/forward 25.9 ± 1.3 μs 25.4 ± 1.3 μs 1.02 ± 0.14
dspoperators/Filt/adjoint 222 ± 5.3 μs 221 ± 4.9 μs 1 ± 0.065
dspoperators/Filt/forward 230 ± 4.8 μs 228 ± 5 μs 1.01 ± 0.061
dspoperators/MIMOFilt/adjoint 181 ± 2.1 μs 180 ± 1.9 μs 1.01 ± 0.031
dspoperators/MIMOFilt/forward 184 ± 2 μs 186 ± 2.4 μs 0.992 ± 0.034
dspoperators/Xcorr/adjoint 434 ± 8.8 μs 52.6 ± 1.1 μs 8.26 ± 0.49 🚀
dspoperators/Xcorr/forward 1.3 ± 0.063 ms 475 ± 58 μs 2.74 ± 0.71 🚀
fftwoperators/DFT/adjoint 170 ± 7.8 μs 170 ± 6.9 μs 0.998 ± 0.12
fftwoperators/DFT/forward 149 ± 7.6 μs 149 ± 7.4 μs 1 ± 0.14
linearoperators/DiagOp/adjoint-single 267 ± 15 μs 253 ± 13 μs 1.05 ± 0.16
linearoperators/DiagOp/adjoint-threaded 244 ± 9.8 μs 252 ± 11 μs 0.966 ± 0.12
linearoperators/DiagOp/forward-single 274 ± 16 μs 257 ± 15 μs 1.07 ± 0.17
linearoperators/DiagOp/forward-threaded 245 ± 10 μs 251 ± 11 μs 0.976 ± 0.12
linearoperators/Eye/forward 461 ± 38 μs 399 ± 26 μs 1.16 ± 0.24
linearoperators/FiniteDiff/adjoint 437 ± 17 μs 437 ± 15 μs 1 ± 0.11
linearoperators/FiniteDiff/forward 429 ± 15 μs 435 ± 16 μs 0.986 ± 0.1
linearoperators/GetIndex/adjoint 867 ± 52 μs 800 ± 40 μs 1.08 ± 0.17
linearoperators/GetIndex/forward 592 ± 42 μs 529 ± 41 μs 1.12 ± 0.23
linearoperators/LBFGS/mul 53.5 ± 2.1 μs 53.5 ± 2.3 μs 0.999 ± 0.12
linearoperators/LBFGS/update 12.2 ± 0.99 μs 12.3 ± 0.85 μs 0.987 ± 0.21
linearoperators/LMatrixOp/adjoint 243 ± 15 μs 232 ± 20 μs 1.05 ± 0.22
linearoperators/LMatrixOp/forward 245 ± 24 μs 250 ± 28 μs 0.978 ± 0.29
linearoperators/MatrixOp/adjoint 183 ± 5 μs 184 ± 5.9 μs 0.996 ± 0.084
linearoperators/MatrixOp/forward 188 ± 4.8 μs 189 ± 5.7 μs 0.994 ± 0.078
linearoperators/MyLinOp/adjoint 255 ± 14 μs 257 ± 13 μs 0.99 ± 0.14
linearoperators/MyLinOp/forward 263 ± 17 μs 268 ± 17 μs 0.981 ± 0.18
linearoperators/Variation/adjoint-single 864 ± 2.1 μs 864 ± 2 μs 1 ± 0.0068
linearoperators/Variation/forward-single 403 ± 30 μs 82 ± 7.2 μs 4.91 ± 1.1 🚀
linearoperators/ZeroPad/adjoint 922 ± 8.3 μs 143 ± 1.2 μs 6.45 ± 0.16 🚀
linearoperators/ZeroPad/forward 1.83 ± 0.015 ms 64.3 ± 4.2 μs 28.4 ± 3.7 🚀
linearoperators/Zeros/forward 856 ± 69 μs 763 ± 30 μs 1.12 ± 0.2
nfftoperators/NFFTOp/adjoint 258 ± 7.9 μs 258 ± 8.1 μs 1 ± 0.088
nfftoperators/NFFTOp/forward 214 ± 5.6 μs 214 ± 5.1 μs 0.999 ± 0.071
nonlinearoperators/Atan/forward 442 ± 6.9 μs 441 ± 7.4 μs 1 ± 0.046
nonlinearoperators/Atan/jacobian-adjoint 13.6 ± 0.76 μs 15.9 ± 1.5 μs 0.855 ± 0.19
nonlinearoperators/Cos/forward 397 ± 6 μs 398 ± 6.3 μs 0.996 ± 0.044
nonlinearoperators/Cos/jacobian-adjoint 426 ± 6.9 μs 428 ± 6.9 μs 0.995 ± 0.046
nonlinearoperators/Exp/forward 519 ± 11 μs 514 ± 6.5 μs 1.01 ± 0.05
nonlinearoperators/Exp/jacobian-adjoint 600 ± 7.4 μs 603 ± 6.7 μs 0.995 ± 0.033
nonlinearoperators/Pow/forward 540 ± 5.9 μs 539 ± 6 μs 1 ± 0.031
nonlinearoperators/Pow/jacobian-adjoint 399 ± 6.2 μs 400 ± 6.7 μs 0.997 ± 0.046
nonlinearoperators/Sech/forward 236 ± 4.9 μs 247 ± 4.8 μs 0.957 ± 0.054
nonlinearoperators/Sech/jacobian-adjoint 657 ± 6.4 μs 660 ± 7 μs 0.994 ± 0.028
nonlinearoperators/Sigmoid/forward 349 ± 5.5 μs 344 ± 5.7 μs 1.02 ± 0.047
nonlinearoperators/Sigmoid/jacobian-adjoint 328 ± 6.2 μs 334 ± 5.9 μs 0.984 ± 0.051
nonlinearoperators/Sin/forward 389 ± 5.6 μs 392 ± 5.7 μs 0.993 ± 0.041
nonlinearoperators/Sin/jacobian-adjoint 414 ± 6.2 μs 415 ± 6.4 μs 0.998 ± 0.043
nonlinearoperators/SoftMax/forward 323 ± 6.2 μs 324 ± 6 μs 0.997 ± 0.053
nonlinearoperators/SoftMax/jacobian-adjoint 390 ± 1.8 ms 366 ± 8.6 μs 1.07e+03 ± 51 🚀
nonlinearoperators/SoftPlus/forward 920 ± 2.1 μs 921 ± 1.6 μs 0.999 ± 0.0057
nonlinearoperators/SoftPlus/jacobian-adjoint 411 ± 6.5 μs 407 ± 6.6 μs 1.01 ± 0.046
nonlinearoperators/Tanh/forward 335 ± 5.6 μs 335 ± 5.7 μs 1 ± 0.048
nonlinearoperators/Tanh/jacobian-adjoint 266 ± 5.5 μs 265 ± 5.3 μs 1 ± 0.058
normaloperators/DFT/mul 4.72 ± 0.19 μs 4.59 ± 0.23 μs 1.03 ± 0.13
normaloperators/DiagOp/mul 256 ± 13 μs 263 ± 19 μs 0.973 ± 0.17
normaloperators/NFFTOp/mul 162 ± 10 μs 162 ± 11 μs 0.999 ± 0.19
waveletoperators/WaveletOp/adjoint 3.13 ± 0.089 ms 3.12 ± 0.092 ms 1 ± 0.082
waveletoperators/WaveletOp/forward 1.22 ± 0.018 ms 1.24 ± 0.03 ms 0.986 ± 0.055
Memory benchmarks
Benchmark base head Ratio (base/head)
batching/SimpleBatchOp/adjoint-single 768 allocs (277.00 KiB) 768 allocs (277.00 KiB) 1
batching/SimpleBatchOp/forward-single 384 allocs (265.00 KiB) 384 allocs (265.00 KiB) 1
batching/SpreadingBatchOp/adjoint-single 416 allocs (140.00 KiB) 418 allocs (140.09 KiB) 0.999
batching/SpreadingBatchOp/forward-single 224 allocs (134.00 KiB) 224 allocs (134.00 KiB) 1
calculus/AffineAdd/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/AffineAdd/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Ax_mul_Bx/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Ax_mul_Bx/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Ax_mul_Bxt/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Ax_mul_Bxt/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Axt_mul_Bx/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Axt_mul_Bx/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/BroadCast/identity-single 1 allocs (48 bytes) 1 allocs (48 bytes) 1
calculus/BroadCast/operator-single-adjoint 31 allocs (1.08 KiB) 31 allocs (1.08 KiB) 1
calculus/BroadCast/operator-single-forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Compose/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Compose/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/DCAT/adjoint 1 allocs (32 bytes) 1 allocs (32 bytes) 1
calculus/DCAT/forward 1 allocs (32 bytes) 1 allocs (32 bytes) 1
calculus/HCAT/adjoint 131 allocs (4.70 KiB) 131 allocs (4.70 KiB) 1
calculus/HCAT/forward 70 allocs (2.52 KiB) 70 allocs (2.52 KiB) 1
calculus/HadamardProd/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/HadamardProd/jacobian-adjoint 6 allocs (512.14 KiB) 6 allocs (512.14 KiB) 1
calculus/Jacobian/sigmoid-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Reshape/adjoint 1 allocs (32 bytes) 1 allocs (32 bytes) 1
calculus/Reshape/forward 1 allocs (32 bytes) 1 allocs (32 bytes) 1
calculus/Scale/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Scale/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Sum/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/Sum/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
calculus/VCAT/adjoint 130 allocs (4.67 KiB) 130 allocs (4.67 KiB) 1
calculus/VCAT/forward 71 allocs (2.55 KiB) 71 allocs (2.55 KiB) 1
dspoperators/Filt/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
dspoperators/Filt/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
dspoperators/MIMOFilt/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
dspoperators/MIMOFilt/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
dspoperators/Xcorr/adjoint 27 allocs (772.30 KiB) 0 allocs (0 bytes) N/A 🚀
dspoperators/Xcorr/forward 36 allocs (3.00 MiB) 0 allocs (0 bytes) N/A 🚀
fftwoperators/DFT/adjoint 45 allocs (258.34 KiB) 45 allocs (258.34 KiB) 1
fftwoperators/DFT/forward 45 allocs (258.34 KiB) 45 allocs (258.34 KiB) 1
linearoperators/DiagOp/adjoint-single 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/DiagOp/adjoint-threaded 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/DiagOp/forward-single 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/DiagOp/forward-threaded 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/Eye/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/FiniteDiff/adjoint 12 allocs (4.00 MiB) 12 allocs (4.00 MiB) 1
linearoperators/FiniteDiff/forward 6 allocs (4.00 MiB) 6 allocs (4.00 MiB) 1
linearoperators/GetIndex/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/GetIndex/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/LBFGS/mul 65 allocs (1.33 KiB) 65 allocs (1.33 KiB) 1
linearoperators/LBFGS/update 3 allocs (48 bytes) 3 allocs (48 bytes) 1
linearoperators/LMatrixOp/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/LMatrixOp/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/MatrixOp/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/MatrixOp/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/MyLinOp/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/MyLinOp/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/Variation/adjoint-single 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/Variation/forward-single 2304 allocs (3.05 MiB) 512 allocs (16.00 KiB) 195 🚀
linearoperators/ZeroPad/adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/ZeroPad/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
linearoperators/Zeros/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nfftoperators/NFFTOp/adjoint 152 allocs (5.28 KiB) 152 allocs (5.28 KiB) 1
nfftoperators/NFFTOp/forward 149 allocs (5.17 KiB) 149 allocs (5.17 KiB) 1
nonlinearoperators/Atan/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Atan/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Cos/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Cos/jacobian-adjoint 6 allocs (512.14 KiB) 6 allocs (512.14 KiB) 1
nonlinearoperators/Exp/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Exp/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Pow/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Pow/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Sech/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Sech/jacobian-adjoint 6 allocs (512.14 KiB) 6 allocs (512.14 KiB) 1
nonlinearoperators/Sigmoid/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Sigmoid/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Sin/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Sin/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/SoftMax/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/SoftMax/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/SoftPlus/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/SoftPlus/jacobian-adjoint 3 allocs (512.07 KiB) 3 allocs (512.07 KiB) 1
nonlinearoperators/Tanh/forward 0 allocs (0 bytes) 0 allocs (0 bytes) 1
nonlinearoperators/Tanh/jacobian-adjoint 0 allocs (0 bytes) 0 allocs (0 bytes) 1
normaloperators/DFT/mul 0 allocs (0 bytes) 0 allocs (0 bytes) 1
normaloperators/DiagOp/mul 0 allocs (0 bytes) 0 allocs (0 bytes) 1
normaloperators/NFFTOp/mul 7 allocs (208 bytes) 7 allocs (208 bytes) 1
waveletoperators/WaveletOp/adjoint 9 allocs (512.34 KiB) 9 allocs (512.34 KiB) 1
waveletoperators/WaveletOp/forward 9 allocs (512.34 KiB) 9 allocs (512.34 KiB) 1

Ratio interpretation: values > 1 mean the PR is faster; values < 1 mean slower.
🚀 significant speedup · 🐢 significant slowdown

@hakkelt hakkelt marked this pull request as ready for review May 29, 2026 16:00
@hakkelt
Copy link
Copy Markdown
Collaborator Author

hakkelt commented May 29, 2026

@lostella @nantonel this PR is ready for review. There will be a separate PR for extending test coverage because this PR is already quite large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant