Skip to content

apex.contrib.clip_grad.clip_grad_norm_ crashes with PyTorch 2.10 #8737

@KumoLiu

Description

@KumoLiu

Environment
NGC Base Image: nvcr.io/nvidia/pytorch:25.12-py3
PyTorch: 2.10.0a0+b4e4ee81d3.nv25.12
CUDA: 13
Python: 3.12
apex: as shipped in NGC 25.12

This happens because apex's multi_tensor_applier attempts to access the raw data pointer of gradient tensors, but in PyTorch 2.10 some tensors (e.g., those with lazy/functional storage) no longer expose a traditional storage, causing the operation to fail.


[2026-02-11T15:17:23.401Z] ======================================================================

[2026-02-11T15:17:23.401Z] ERROR: test_ensemble (tests.integration.test_auto3dseg_ensemble.TestEnsembleBuilder.test_ensemble)

[2026-02-11T15:17:23.401Z] ----------------------------------------------------------------------

[2026-02-11T15:17:23.401Z] Traceback (most recent call last):

[2026-02-11T15:17:23.401Z]   File "/home/jenkins/agent/workspace/YunLiu-Monai-pytorch-versions/monai/utils/misc.py", line 894, in run_cmd

[2026-02-11T15:17:23.401Z]     return subprocess.run(cmd_list, **kwargs)

[2026-02-11T15:17:23.401Z]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2026-02-11T15:17:23.401Z]   File "/usr/lib/python3.12/subprocess.py", line 571, in run

[2026-02-11T15:17:23.401Z]     raise CalledProcessError(retcode, process.args,

[2026-02-11T15:17:23.401Z] subprocess.CalledProcessError: Command '['python', '/tmp/tmp7gsbt0qp/workdir/dints_0/scripts/train.py', 'run', "--config_file='/tmp/tmp7gsbt0qp/workdir/dints_0/configs/hyper_parameters.yaml,/tmp/tmp7gsbt0qp/workdir/dints_0/configs/hyper_parameters_search.yaml,/tmp/tmp7gsbt0qp/workdir/dints_0/configs/network.yaml,/tmp/tmp7gsbt0qp/workdir/dints_0/configs/network_search.yaml,/tmp/tmp7gsbt0qp/workdir/dints_0/configs/transforms_infer.yaml,/tmp/tmp7gsbt0qp/workdir/dints_0/configs/transforms_train.yaml,/tmp/tmp7gsbt0qp/workdir/dints_0/configs/transforms_validate.yaml'", '--training#num_images_per_batch=2', '--training#num_epochs=2', '--training#num_epochs_per_validation=1']' returned non-zero exit status 1.

[2026-02-11T15:17:23.401Z] 

[2026-02-11T15:17:23.401Z] The above exception was the direct cause of the following exception:

[2026-02-11T15:17:23.401Z] 

[2026-02-11T15:17:23.401Z] Traceback (most recent call last):

[2026-02-11T15:17:23.401Z]   File "/home/jenkins/agent/workspace/YunLiu-Monai-pytorch-versions/tests/integration/test_auto3dseg_ensemble.py", line 167, in test_ensemble

[2026-02-11T15:17:23.401Z]     algo.train(_train_param)

[2026-02-11T15:17:23.401Z]   File "/tmp/tmpiqosipwp/workdir/algorithm_templates/dints/scripts/algo.py", line 497, in train

[2026-02-11T15:17:23.401Z]   File "/home/jenkins/agent/workspace/YunLiu-Monai-pytorch-versions/monai/apps/auto3dseg/bundle_gen.py", line 277, in _run_cmd

[2026-02-11T15:17:23.401Z]     return run_cmd(cmd.split(), run_cmd_verbose=True, env=ps_environ, check=True)

[2026-02-11T15:17:23.401Z]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2026-02-11T15:17:23.401Z]   File "/home/jenkins/agent/workspace/YunLiu-Monai-pytorch-versions/monai/utils/misc.py", line 898, in run_cmd

[2026-02-11T15:17:23.401Z]     raise RuntimeError(f"subprocess call error {e.returncode}: {errors}, {output}") from e

[2026-02-11T15:17:23.401Z] RuntimeError: subprocess call error 1: ERROR:torch_tensorrt._utils:CUDA 13 is not currently supported for TRT-LLM plugins. Please install pytorch with CUDA 12.x support

[2026-02-11T15:17:23.401Z] monai.transforms.spatial.dictionary Orientationd.__init__:labels: Current default value of argument `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` was changed in version None from `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` to `labels=None`. Default value changed to None meaning that the transform now uses the 'space' of a meta-tensor, if applicable, to determine appropriate axis labels.

[2026-02-11T15:17:23.401Z] The filesystem tracking backend (e.g., './mlruns') will be deprecated in February 2026. Consider transitioning to a database backend (e.g., 'sqlite:///mlflow.db') to take advantage of the latest MLflow features. See https://github.com/mlflow/mlflow/issues/18534 for more details and migration guidance. For migrating existing data, https://github.com/mlflow/mlflow-export-import can be used.

[2026-02-11T15:17:23.401Z] 2026/02/11 14:47:11 INFO mlflow.tracking.fluent: Experiment with name 'Auto3DSeg' does not exist. Creating a new experiment.

[2026-02-11T15:17:23.401Z] 
[2026-02-11T15:17:23.401Z] dints_0 - training ...:   0%|          | 0/2 [00:00<?, ?round/s]
[2026-02-11T15:17:23.401Z] dints_0 - training ...:   0%|          | 0/2 [00:01<?, ?round/s]

[2026-02-11T15:17:23.401Z] Traceback (most recent call last):

[2026-02-11T15:17:23.401Z]   File "/tmp/tmp7gsbt0qp/workdir/dints_0/scripts/train.py", line 1001, in <module>

[2026-02-11T15:17:23.401Z]     fire.Fire()

[2026-02-11T15:17:23.401Z]   File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 135, in Fire

[2026-02-11T15:17:23.401Z]     component_trace = _Fire(component, args, parsed_flag_args, context, name)

[2026-02-11T15:17:23.401Z]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2026-02-11T15:17:23.401Z]   File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 468, in _Fire

[2026-02-11T15:17:23.401Z]     component, remaining_args = _CallAndUpdateTrace(

[2026-02-11T15:17:23.401Z]                                 ^^^^^^^^^^^^^^^^^^^^

[2026-02-11T15:17:23.401Z]   File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace

[2026-02-11T15:17:23.401Z]     component = fn(*varargs, **kwargs)

[2026-02-11T15:17:23.401Z]                 ^^^^^^^^^^^^^^^^^^^^^^

[2026-02-11T15:17:23.401Z]   File "/tmp/tmp7gsbt0qp/workdir/dints_0/scripts/train.py", line 606, in run

[2026-02-11T15:17:23.401Z]     clip_grad_norm_(model.parameters(), 0.5)

[2026-02-11T15:17:23.401Z]   File "/usr/local/lib/python3.12/dist-packages/apex/contrib/clip_grad/clip_grad.py", line 80, in clip_grad_norm_

[2026-02-11T15:17:23.401Z]     multi_tensor_applier(

[2026-02-11T15:17:23.401Z]   File "/usr/local/lib/python3.12/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__

[2026-02-11T15:17:23.401Z]     return op(self.chunk_size,

[2026-02-11T15:17:23.401Z]            ^^^^^^^^^^^^^^^^^^^

[2026-02-11T15:17:23.401Z] RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

[2026-02-11T15:17:23.401Z] , 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions