fix(tests): dynamic tolerance for cuDNN TF32 precision on Ampere+ GPUs by Zhaoxian-Wu · Pull Request #771 · IBM/aihwkit

Zhaoxian-Wu · 2026-06-08T23:32:09Z

Summary

This PR addresses issue #766, the CUDA test failures on Ampere+ GPUs caused by the precision mismatch between cuDNN's default TF32 Tensor Core path and the RPU backend's FP32 CUBLAS path.

Instead of globally disabling TF32, this follows the hardware-adaptive direction discussed in the issue:

add cached CUDA probes for Conv3d and RNN numerical divergence
derive assert_array_almost_equal decimal tolerances from the measured divergence
keep CPU tests at the existing strict decimal=6
relax CUDA tolerances only when the current GPU/cuDNN behavior requires it

Testing

The affected CUDA tests now pass on my setup, including the previously failing RNN/LSTM and Conv3d cases on TF32-capable GPUs.

Hi @PabloCarmona, this PR helps me pass the full test suite locally. If the tests on the Git CLI pass as well, I will update the code in my other PRs accordingly. Thanks for your effort in maintaining the library.

PabloCarmona · 2026-06-09T11:48:04Z

Thanks @Zhaoxian-Wu for your work! Alongside this one, look at the others ones and sync them with master, so the actions triggers again and should be working now since I added the fix to some of the lint errors were arising in the past.

Also pass in the linting tool to address this:

tests/helpers/testcases.py:56:53: E261 at least two spaces before inline comment
tests/helpers/testcases.py:58:55: E261 at least two spaces before inline comment

Zhaoxian-Wu · 2026-06-10T18:22:43Z

Got it. I forgot to test the pycodestyle. This commit should be okay. Could you please trigger the test again to see whether it's correct? @PabloCarmona

PabloCarmona

Seems ok for me, please update rest of PRs with the tests accordantly and we can start to finally merge them. I supposed this doesn't need to be merged once the other PRs has its test cases updated.

cuDNN defaults to TF32 Tensor Cores on Ampere+ GPUs (sm>=80), causing ~1e-3 divergence vs the RPU backend's FP32 CUBLAS path. The existing hard-coded tolerances (decimal=4/6) fail on H100 (RNN) and Blackwell (RNN + Conv3d). Add hardware-adaptive probes that measure the actual cuDNN-vs-non-cuDNN divergence at test session start, then derive tolerances from the measured value. CPU tests remain at decimal=6; CUDA tests relax only as much as the current GPU requires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>

Zhaoxian-Wu · 2026-06-20T15:35:40Z

Sure, I've updated

Zhaoxian-Wu force-pushed the fix/test-numerical-precision branch from c55b024 to 76a8e81 Compare June 8, 2026 23:34

Zhaoxian-Wu force-pushed the fix/test-numerical-precision branch from 76a8e81 to 0078e37 Compare June 10, 2026 18:18

PabloCarmona approved these changes Jun 19, 2026

View reviewed changes

Zhaoxian-Wu force-pushed the fix/test-numerical-precision branch from 0078e37 to dc10cef Compare June 20, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tests): dynamic tolerance for cuDNN TF32 precision on Ampere+ GPUs#771

fix(tests): dynamic tolerance for cuDNN TF32 precision on Ampere+ GPUs#771
Zhaoxian-Wu wants to merge 1 commit into
IBM:masterfrom
Zhaoxian-Wu:fix/test-numerical-precision

Zhaoxian-Wu commented Jun 8, 2026

Uh oh!

PabloCarmona commented Jun 9, 2026 •

edited

Loading

Uh oh!

Zhaoxian-Wu commented Jun 10, 2026

Uh oh!

PabloCarmona left a comment

Uh oh!

Zhaoxian-Wu commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zhaoxian-Wu commented Jun 8, 2026

Summary

Testing

Uh oh!

PabloCarmona commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zhaoxian-Wu commented Jun 10, 2026

Uh oh!

PabloCarmona left a comment

Choose a reason for hiding this comment

Uh oh!

Zhaoxian-Wu commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PabloCarmona commented Jun 9, 2026 •

edited

Loading