From 67b327cd58234851ed8e13cc6b6eb4967e3d6521 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 21:28:31 +0000
Subject: [PATCH 01/15] draft design

---
 docs/CUDA_cuDNN_Optional_Design.md | 722 +++++++++++++++++++++++++++++
 1 file changed, 722 insertions(+)
 create mode 100644 docs/CUDA_cuDNN_Optional_Design.md

diff --git a/docs/CUDA_cuDNN_Optional_Design.md b/docs/CUDA_cuDNN_Optional_Design.md
new file mode 100644
index 0000000000000..129e2d0fe1922
--- /dev/null
+++ b/docs/CUDA_cuDNN_Optional_Design.md
@@ -0,0 +1,722 @@
+# Making cuDNN Optional for the CUDA Execution Provider
+
+Status: Draft / Proposal
+Owner: (CUDA EP)
+Scope: `onnxruntime/core/providers/cuda` (main static CUDA EP), CUDA Plugin EP
+(`BUILD_CUDA_EP_AS_PLUGIN`), and the CUDA unit tests. **Out of scope:** TensorRT EP and
+NV‑TensorRT‑RTX EP (they create and own their own cuDNN handles and inherently depend on
+cuDNN/TensorRT). The existing build-time `USE_CUDA_MINIMAL` path is also out of scope; it is
+used by TensorRT / NV‑TensorRT‑RTX integration and should remain available.
+
+---
+
+## 1. Motivation
+
+Today the CUDA EP has a **hard link‑time and load‑time dependency** on cuDNN
+(`libcudnn*.so` / `cudnn64_*.dll`) and on the header‑only `cudnn_frontend` library. If the
+cuDNN shared libraries are not present on the machine, the ORT shared library
+(`libonnxruntime.so` / `onnxruntime.dll`, or the CUDA provider DLL) **fails to load at all** —
+even for models that use no cuDNN‑backed operators.
+
+cuDNN is large (hundreds of MB across its sub‑libraries) and is only needed by a subset of
+operators (Conv, Pooling, BatchNorm, LRN, RNN/LSTM/GRU, the cuDNN reduction path, the cuDNN
+softmax path, etc.). Many transformer / LLM models do not require any of these.
+
+**Goal:** Ship a *single* CUDA EP binary that
+
+1. loads and runs **without** cuDNN present, and
+2. **lazily** attempts to load cuDNN at first use; if cuDNN is available, cuDNN‑backed
+   operators work exactly as today; if it is absent, those operators fail with a clear,
+   actionable error (Phase 1) and, incrementally, fall back to native CUDA kernels
+   (Phases 2 and 3).
+
+This is delivered in **three phases**:
+
+- **Phase 1 — Make cuDNN optional.** No new compute kernels. When cuDNN is missing, any op
+  that needs it throws a clear `NOT_IMPLEMENTED` ("cuDNN is required for operator X but was
+  not found") error at `Run` time. Everything that does not need cuDNN runs normally.
+- **Phase 2 — Remove LLM‑relevant cuDNN dependencies.** Replace the cuDNN Softmax /
+  LogSoftmax and reduction paths with existing native CUDA / CUB paths. Phase 1 + Phase 2 is
+  the first milestone because LLM models may use these ops.
+- **Phase 3 — Replace the remaining cuDNN‑backed NN ops.** Add native CUDA / CUTLASS /
+  Triton‑cubin fallbacks for Pooling, normalization, LRN, Conv, ConvTranspose, and FusedConv.
+
+---
+
+## 2. Current state (as of this writing)
+
+### 2.1 How cuDNN is linked
+
+- `cmake/onnxruntime_providers_cuda.cmake`:
+  `target_link_libraries(... CUDNN::cudnn_all cudnn_frontend ...)` — a normal link dependency,
+  resolved at process load time.
+- `cmake/onnxruntime_providers_cuda_plugin.cmake`: same (`CUDNN::cudnn_all`, `cudnn_frontend`).
+- `cmake/onnxruntime_unittests.cmake`: links `cudnn_frontend` and includes `CUDNN_INCLUDE_DIR`.
+- `cmake/deps.txt`: pins `cudnn_frontend` v1.x (header‑only C++ wrapper over the cuDNN v9
+  *backend* API).
+- `cmake/external/cudnn_frontend.cmake` fetches `cudnn_frontend`, sets `CUDNN_PATH` from
+  `onnxruntime_CUDNN_HOME`, disables its samples/tests/python bindings, and marks its headers
+  as system includes. In ORT this is a **compile-time header dependency**, not a runtime DLL.
+- There is **no delay‑load** configured for cuDNN today (delay‑load is only used for DML /
+  WebGPU / a few Win32 API sets).
+
+### 2.2 How the handle is created and reached
+
+- The cuDNN handle is created **eagerly**:
+  - `CudaStream` constructor — `cudnnCreate(&cudnn_handle_)` / `cudnnSetStream(...)`
+    (`onnxruntime/core/providers/cuda/cuda_stream_handle.cc`).
+  - `CUDAExecutionProvider::PerThreadContext` — `cudnnCreate(&cudnn_handle_)`
+    (`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`).
+  - CUDA Plugin EP — `cuda_stream_plugin.cc` and `cuda_kernel_adapter.h`.
+- Kernels obtain the handle through:
+  - `CudaKernel::GetCudnnHandle(context)` → `stream->cudnn_handle_`
+    (`onnxruntime/core/providers/cuda/cuda_kernel.h`).
+  - `CUDAExecutionProvider::PerThreadDefaultCudnnHandle()`.
+  - The public `CudaContext` resource API (`cuda_context.h`,
+    `CudaResource::cudnn_handle_t`) — used by custom ops.
+
+### 2.3 Call sites and macros
+
+- `CUDNN_CALL` / `CUDNN_CALL_THROW` (in `shared_inc/cuda_call.h`) and
+  `CUDNN_RETURN_IF_ERROR` / `CUDNN2_RETURN_IF_ERROR` (in `cuda_common.h`).
+- `CUDNN_FE_CALL` / `CUDNN_FE_RETURN_IF_ERROR` for the frontend (`cudnn_fe_call.*`).
+- `CudaErrString<cudnnStatus_t>` calls `cudnnGetErrorString` (`cuda_call.cc`).
+- All of the above are already gated by `#ifndef USE_CUDA_MINIMAL` in shared CUDA code. That
+  build‑time path is used by TensorRT / NV‑TensorRT‑RTX related builds and is a useful
+  inventory of cuDNN‑touching code, but it is *not* the CUDA EP runtime behavior we want.
+
+### 2.4 Operators / components that depend on cuDNN
+
+| Area | Files | cuDNN usage |
+|---|---|---|
+| Conv / ConvTranspose (v9 graph) | `nn/conv.cc`, `nn/conv_transpose.cc` | `cudnn_frontend` graph API + `cudnnAddTensor` |
+| Conv / ConvTranspose (legacy) | `nn/conv_8.h`, `nn/conv_transpose_8.h` | `cudnnConvolutionForward`, `cudnnConvolutionBackwardData`, algo search |
+| FusedConv (contrib) | `contrib_ops/cuda/fused_conv.cc` | `cudnnConvolutionBiasActivationForward`, activation desc |
+| Pooling | `nn/pool.cc` | `cudnnPoolingForward`, pooling desc |
+| BatchNormalization | `nn/batch_norm.cc` | `cudnnBatchNormalizationForwardInference/Training` |
+| InstanceNormalization | `nn/instance_norm.cc` | BatchNorm training helper |
+| LRN | `nn/lrn.cc` | `cudnnLRNCrossChannelForward` |
+| RNN / LSTM / GRU | `rnn/cudnn_rnn_base.*`, `rnn/{rnn,lstm,gru}.h` | `cudnnRNNForward`, RNN/dropout descriptors |
+| Reductions | `reduction/reduction_ops.*` | `cudnnReduceTensor` (Reduce\*, ArgMax/ArgMin) |
+| Softmax (cuDNN path) | `math/softmax_common.cc` | `cudnnSoftmaxForward/Backward` |
+| Einsum | `math/einsum_utils/*` | passes cuDNN handle into helpers |
+| Dropout descriptor | `cudnn_common.h` (`CudnnDropout`) | used by RNN |
+| Tensor/Filter descriptors | `cudnn_common.*` | `cudnnCreate*Descriptor`, `cudnnSet*Descriptor` |
+| Contrib attention (optional) | `contrib_ops/cuda/bert/group_query_attention.cc`, `quantization/attention_quantization.cc`, `math/bias_softmax.cc` | optional cuDNN flash attention / handle passthrough |
+
+> Note: Several of these ops already have, or can trivially get, a **non‑cuDNN** path
+> (e.g. Softmax has a native warp/block kernel; reductions have CUB‑based paths; pooling and
+> simple elementwise norms are straightforward). Softmax and reductions are Phase‑2
+> candidates; the remaining NN ops are Phase‑3 candidates.
+
+### 2.5 `cudnn_frontend` usage
+
+`cudnn_frontend` is currently used as a header-only C++ graph API wrapper around cuDNN's
+backend API:
+
+- `onnxruntime/core/providers/cuda/nn/conv.h` includes `<cudnn_frontend.h>` and stores
+  `cudnn_frontend::graph::Graph`, `Tensor_attributes`, `Pointwise_attributes`, and variant
+  packs in `CudnnConvState`.
+- `onnxruntime/core/providers/cuda/nn/conv.cc` builds cuDNN frontend graphs for v9 Conv,
+  optional bias fusion, optional activation fusion, heuristic selection, support checks, plan
+  building, workspace sizing, and graph execution.
+- `onnxruntime/core/providers/cuda/nn/conv_transpose.cc` does the same for ConvTranspose
+  using `Conv_dgrad_attributes`.
+- `onnxruntime/core/providers/cuda/cudnn_common.{h,cc}` defines `CudnnFeTensor`, a small ORT
+  helper that maps ORT tensor shapes/types into `cudnn_frontend::graph::Tensor_attributes`.
+- `onnxruntime/core/providers/cuda/shared_inc/cudnn_fe_call.h` and
+  `onnxruntime/core/providers/cuda/cudnn_fe_call.cc` adapt `cudnn_frontend::error_t` into
+  ORT's `CudaCall` error-handling path.
+- `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` has a cuDNN SDPA feature path
+  selected through kernel options, but it does not directly include the high-level frontend
+  graph headers in the same way Conv/ConvTranspose do.
+
+Important frontend detail: `cudnn_frontend` already has a dynamic-loading mode gated by
+`NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING`. In that mode, `cudnn_frontend_shim.h` does **not**
+link directly against `cudnnBackend*` symbols. Instead, it expects the embedding library to
+define `cudnn_frontend::cudnn_dlhandle`, then resolves symbols from that handle with
+`dlsym` / `GetProcAddress`.
+
+The frontend backend-symbol surface used by ORT's current graph path includes at least:
+
+- `cudnnGetVersion`, `cudnnGetErrorString`, and for cuDNN 9, `cudnnGetLastErrorString`.
+- `cudnnBackendCreateDescriptor`, `cudnnBackendDestroyDescriptor`,
+  `cudnnBackendSetAttribute`, `cudnnBackendGetAttribute`, and `cudnnBackendFinalize`.
+- `cudnnBackendExecute`.
+- Version-gated helpers such as `cudnnBackendPopulateCudaGraph`,
+  `cudnnBackendUpdateCudaGraph`, and `cudnnGetExecutionPlanWorkspaceSize` if ORT starts using
+  frontend features that require them.
+
+Therefore, `cudnn_frontend` should stay in the build while ORT still has cuDNN frontend
+Conv/ConvTranspose paths, but it should be compiled in dynamic-loading mode and wired to the
+same ORT-owned cuDNN loader used by direct cuDNN calls.
+
+---
+
+## 3. Design overview
+
+The core idea: **break the hard dependency by routing every cuDNN symbol through a thin,
+lazily‑resolved trampoline layer**, plus an availability flag the EP and kernels consult.
+
+```mermaid
+flowchart TD
+    K[CUDA kernel<br/>e.g. Conv, Pool] -->|cudnnXxx(...)| S[cuDNN shim<br/>trampolines]
+  FE[cudnn_frontend<br/>header-only] -->|dynamic mode dlsym(cudnn_dlhandle)| L
+    S -->|first call: dlopen/LoadLibrary| L[cuDNN loader]
+    L -->|present| R[(real libcudnn*)]
+    L -->|absent| U[mark unavailable<br/>return sentinel status]
+    K -.->|enable_cudnn && IsCudnnAvailable()?| L
+```
+
+### 3.1 The shim (no hard link)
+
+We **stop linking** `CUDNN::cudnn_all`. In its place we compile a generated translation unit,
+`cudnn_stub.cc`, that **defines every direct cuDNN entry point ORT references**. Each
+definition is a trampoline:
+
+```cpp
+// Pseudocode for one entry
+cudnnStatus_t cudnnConvolutionForward(cudnnHandle_t h, /* ... */) {
+  auto fn = CudnnLibrary::Get().convolution_forward;  // resolved lazily
+  if (fn == nullptr) return CUDNN_STATUS_NOT_INITIALIZED;  // cuDNN unavailable
+  return fn(h, /* ... */);
+}
+```
+
+Because the trampolines have the **exact** cuDNN symbol names and signatures, ORT's direct
+calls link against *our* definitions — so the final binary has **no `NEEDED`/import entry for
+libcudnn** from those calls.
+
+`cudnn_frontend` is handled separately: compile it with
+`NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` and define `cudnn_frontend::cudnn_dlhandle` in ORT.
+When the ORT loader successfully opens cuDNN, it sets that handle to the loaded cuDNN library
+handle. `cudnn_frontend` then resolves `cudnnBackend*` symbols from the same handle. When
+`enable_cudnn=0` or cuDNN is unavailable, ORT must guard all frontend graph-build/execute
+entry points before calling frontend APIs so the frontend shim never tries to resolve symbols
+from a null handle.
+
+This means `cudnn_frontend` stays a compile‑time‑only dependency. We still need its headers
+and the cuDNN headers to build, but not cuDNN import libraries at link time.
+
+A single loader object owns the `dlopen`/`LoadLibrary` handles and the resolved function
+pointers:
+
+```cpp
+class CudnnLibrary {            // onnxruntime/core/providers/cuda/cudnn_loader.{h,cc}
+ public:
+  static CudnnLibrary& Get();   // thread-safe singleton (std::call_once)
+  bool Available() const;       // true iff all required libs + symbols resolved
+  // ... function pointer members, one per cuDNN entry ORT/frontend uses ...
+};
+```
+
+Loader responsibilities:
+
+- On first use, attempt to load the cuDNN runtime. For cuDNN **9**, this is a small set of
+  sub‑libraries (`libcudnn.so.9` umbrella plus, depending on packaging,
+  `libcudnn_graph`, `libcudnn_ops`, `libcudnn_cnn`, `libcudnn_engines_*`). On Windows the
+  corresponding `cudnn*64_9.dll` set. We load the umbrella `libcudnn` first; cuDNN itself
+  dlopens its sub‑libraries.
+- Resolve each required symbol with `dlsym` / `GetProcAddress`.
+- If the umbrella library or any **required** symbol is missing, set `available_ = false`.
+- Honor an optional runtime directory provider option, `cudnn_path`, before falling back to
+  the default OS/library search paths. This mirrors the role of `onnxruntime_CUDNN_HOME` at
+  build time, but is deliberately runtime-only and points to the directory that contains the
+  cuDNN shared libraries (or to a cuDNN root directory with `bin` / `lib` children).
+
+On Windows, cuDNN 9 is split into multiple DLLs. The loader should not rely on the process
+working directory or global `PATH`. If `cudnn_path` is set, first add that directory to the
+DLL search path for this load, or load the required cuDNN DLLs from that directory in a known
+order before loading `cudnn64_9.dll`. This is the C++ equivalent of the Python package's
+`preload_dlls()` behavior. On Linux, prefer loading the umbrella `libcudnn.so.9` from
+`cudnn_path` and let cuDNN resolve its own sub-libraries, matching the current Python preload
+behavior.
+
+The loader must not run when the CUDA provider option `enable_cudnn=0` is set (see §3.3).
+This keeps "force no cuDNN" tests deterministic even on machines where cuDNN is installed.
+
+**Symbol manifest.** The set of symbols is finite and enumerable (see §2.4 plus
+`cudnn_common.*` and `cudnn_rnn_base.*`). We maintain direct ORT cuDNN calls as a single
+header list (`cudnn_symbols.inc`, an X‑macro list) consumed by both the loader and the stub
+generator so the two never drift. Frontend backend symbols are resolved by
+`cudnn_frontend`'s own dynamic-loading shim from the same ORT-owned cuDNN handle; maintain a
+separate frontend-symbol audit list for testing and version checks.
+
+> **Alternative considered (delay‑load only):** Windows `/DELAYLOAD:cudnn*.dll` gets us lazy
+> load on Windows, but Linux has no equivalent, and `cudnn_frontend` requires explicit dynamic
+> loading support to avoid backend API imports. Rejected because the request requires
+> identical behavior on Linux and Windows from a single binary. The direct-call trampoline
+> plus frontend dynamic-loading approach is uniform across both.
+
+### 3.2 Availability flag and handle lifecycle
+
+- `cudnnCreate` is only invoked through the shim. The eager `cudnnCreate` calls in
+  `CudaStream` / `PerThreadContext` / plugin stream become **conditional and non‑fatal**:
+  - Attempt `CudnnLibrary::Get().Available()`; if false, leave `cudnn_handle_ == nullptr` and
+    **do not throw**.
+  - If true, create the handle as today.
+- `GetCudnnHandle(context)` returns `nullptr` when cuDNN is unavailable (it already returns a
+  raw handle; today it's never null).
+- New helpers in `cuda_common.h`:
+
+```cpp
+bool CudnnAvailable(const OpKernelContext* context);  // provider option && runtime availability
+
+// For kernels: fail fast with a clear message if cuDNN is required but missing.
+#define ORT_RETURN_IF_CUDNN_UNAVAILABLE(context, op_name)                         \
+  ORT_RETURN_IF_ERROR(::onnxruntime::cuda::CheckCudnnAvailable(context, op_name))
+```
+
+- `CudaErrString<cudnnStatus_t>` must not call `cudnnGetErrorString` when cuDNN is unavailable
+  (route through the shim, which returns a static string in that case).
+
+### 3.3 CUDA provider options: `enable_cudnn` and `cudnn_path`
+
+cuDNN can be disabled explicitly with a CUDA provider option:
+
+```text
+enable_cudnn = 1  # default: try to load and use cuDNN when it is present
+enable_cudnn = 0  # do not load cuDNN; force native CUDA paths / Phase-1 NOT_IMPLEMENTED
+cudnn_path = /path/to/cudnn/lib-or-bin  # optional: runtime search directory for cuDNN DLLs/SOs
+```
+
+`enable_cudnn` and `cudnn_path` serve different purposes:
+
+- `enable_cudnn` is the policy switch. When it is `0`, ORT must not attempt to load cuDNN,
+  even if `cudnn_path` is set.
+- `cudnn_path` is a location hint. When `enable_cudnn=1`, the lazy loader searches this
+  directory first, then falls back to default OS/library paths. It should not force cuDNN to
+  be required; if the directory is missing cuDNN, the loader reports cuDNN unavailable and
+  Phase-1/2/3 behavior proceeds normally.
+- `cudnn_path` accepts a directory, not a single library file. The implementation may accept
+  either the directory that directly contains the shared libraries (`bin` on Windows, `lib`
+  on Linux package layouts) or a cuDNN root directory and internally probe common children
+  such as `bin`, `lib`, and `lib64`.
+
+Implementation details:
+
+- Add `constexpr const char* kEnableCudnn = "enable_cudnn"` in
+  `cuda::provider_option_names`.
+- Add `constexpr const char* kCudnnPath = "cudnn_path"` in `cuda::provider_option_names`.
+- Add `bool enable_cudnn{true};` to `CUDAExecutionProviderInfo`.
+- Add `std::string cudnn_path;` to `CUDAExecutionProviderInfo`.
+- Parse it with `ProviderOptionsParser::AddAssignmentToReference(...)` in
+  `CUDAExecutionProviderInfo::FromProviderOptions(...)`.
+- Emit both values from `CUDAExecutionProviderInfo::ToProviderOptions(...)`.
+- Include both values in `std::hash<CUDAExecutionProviderInfo>` because they change the EP
+  behavior.
+- Do **not** add a field to `OrtCUDAProviderOptionsV2` for Phase 1. That struct is public C
+  ABI surface; string-key provider options are sufficient and can be set through existing
+  provider-options APIs.
+- Add an EP helper such as `CUDAExecutionProvider::IsCudnnEnabled()` or
+  `CudaKernel::IsCudnnEnabled()` so kernels can distinguish:
+  - cuDNN disabled by user (`enable_cudnn=0`), and
+  - cuDNN enabled but unavailable at runtime.
+
+The effective condition for cuDNN use is:
+
+```text
+effective_cudnn_available = info.enable_cudnn && CudnnLibrary::Get().Available()
+```
+
+If `enable_cudnn=0`, ORT must not call `dlopen` / `LoadLibrary` for cuDNN and must not create
+a cuDNN handle. If `enable_cudnn=1` and `cudnn_path` is empty, ORT uses the default search
+behavior. If `cudnn_path` is non-empty, ORT searches it first.
+
+### 3.4 Phase 1 fallback behavior (chosen: throw at Run time)
+
+Per the agreed design, in Phase 1 cuDNN‑dependent kernels **remain registered** but **fail
+fast** when executed without cuDNN. The check is added at the top of each cuDNN op's
+`ComputeInternal` (or centralized in shared base helpers such as `CudnnConvState` setup,
+`CudnnRnnBase`, the reduction helper, etc.):
+
+```cpp
+Status Conv<T>::ComputeInternal(OpKernelContext* context) const {
+  ORT_RETURN_IF_CUDNN_UNAVAILABLE(context, "Conv");
+  // ... existing cuDNN path ...
+}
+```
+
+The guard should include the reason in the message:
+
+- `enable_cudnn=0`: "Operator 'Conv' on the CUDA EP requires cuDNN, but cuDNN was disabled
+  by the CUDA provider option 'enable_cudnn'."
+- cuDNN missing: "Operator 'Conv' on the CUDA EP requires cuDNN, but cuDNN was not found at
+  runtime. Install cuDNN, or disable CUDA execution for this op/model."
+
+Rationale for "throw" over "don't register / fall back to CPU":
+
+- Keeps kernel registration tables identical regardless of runtime cuDNN presence (no
+  divergence between build/load configurations; simpler, lower‑risk).
+- Produces a clear, attributable error instead of silent CPU fallback that can mask perf
+  cliffs.
+- CPU fallback for individual nodes is still achievable by the user via EP assignment; we are
+  not removing that option, only not making it implicit.
+
+### 3.5 Builds in scope
+
+- **Main static CUDA EP** — primary target; shim + loader compiled in.
+- **CUDA Plugin EP** (`BUILD_CUDA_EP_AS_PLUGIN`) — same shim/loader; the plugin's
+  `cuda_stream_plugin.cc` / `cuda_kernel_adapter.h` handle creation becomes conditional.
+- **Unit tests** — link the shim instead of cuDNN; add tests that exercise both
+  cuDNN‑present and cuDNN‑absent behavior (the latter by forcing the loader into the
+  unavailable state, see §7).
+
+Python wheel packaging is unchanged by this design: cuDNN DLLs are not packed in the wheel
+today, so Phase 1 is not introducing a new "CUDA-minimal" wheel flavor. The runtime loader
+simply makes the existing package tolerant of environments where cuDNN is absent.
+
+TensorRT / NV‑RTX EPs are untouched and continue to link cuDNN as before. (If both a TRT EP
+and the shimmed CUDA EP are in the same process, symbol collision must be avoided — see
+§8 Risks.)
+
+---
+
+## 4. Phase 1 — Make cuDNN optional (no new kernels)
+
+**Outcome:** ORT CUDA EP loads and runs without cuDNN. cuDNN‑backed ops throw a clear
+`NOT_IMPLEMENTED` error when cuDNN is absent; everything else runs normally. When cuDNN *is*
+present, behavior is byte‑for‑byte identical to today.
+
+### 4.1 Task breakdown
+
+1. **Symbol inventory & manifest.**
+   - Enumerate every direct `cudnn*` symbol referenced by ORT code (`cudnnCreate`,
+     descriptor APIs, legacy conv APIs, BN/LRN/pooling/reduction/softmax APIs, RNN APIs,
+     etc.).
+   - Capture direct calls as `cudnn_symbols.inc` (X‑macro: name, return type, signature).
+   - Separately audit the `cudnn_frontend` backend-symbol surface resolved by its dynamic
+     shim: `cudnnGetVersion`, `cudnnGetErrorString`, `cudnnBackend*` descriptor APIs,
+     `cudnnBackendExecute`, and version-gated graph helpers such as
+     `cudnnBackendPopulateCudaGraph`, `cudnnBackendUpdateCudaGraph`, and
+     `cudnnGetExecutionPlanWorkspaceSize`.
+   - *Verification:* link a probe binary that references only the direct-call manifest and
+     diff against `nm -D`/`dumpbin` of the real cuDNN to ensure completeness; add a frontend
+     graph-build/execute smoke test to prove `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` resolves
+     backend symbols through ORT's loaded cuDNN handle.
+
+2. **Loader (`cudnn_loader.{h,cc}`).**
+   - `dlopen`/`LoadLibrary` of the cuDNN umbrella lib with versioned name candidates
+     (`libcudnn.so.9`, `libcudnn.so`, `cudnn64_9.dll`, …).
+   - Search `cudnn_path` first when it is set. Accept a directory that directly contains the
+     cuDNN runtime libraries, and optionally probe `bin`, `lib`, and `lib64` if the value is a
+     cuDNN root directory.
+   - On Windows, handle cuDNN 9 sub-DLL discovery explicitly: either add the chosen cuDNN
+     directory to the DLL search path for the duration of the load, or preload required
+     cuDNN sub-DLLs in dependency order before loading `cudnn64_9.dll`.
+   - Resolve all manifest symbols; populate function‑pointer table.
+   - `Available()` + thread‑safe one‑time init; report whether cuDNN was loaded from
+     `cudnn_path` or from the default search path for diagnostics.
+   - Expose the raw library handle to `cudnn_frontend` dynamic-loading mode.
+   - Define and maintain `cudnn_frontend::cudnn_dlhandle` in one ORT translation unit when
+     `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` is enabled. Set it to the loader's cuDNN handle
+     after a successful load; keep it null when cuDNN is disabled or unavailable.
+   - Add an explicit "disabled" path: when the EP has `enable_cudnn=0`, skip loader
+     initialization entirely and report "disabled by provider option" to the error helper.
+
+3. **Stub/trampoline TU (`cudnn_stub.cc`).**
+   - Generate one trampoline per manifest entry forwarding to the loader's pointer; return a
+     sentinel `cudnnStatus_t` when unavailable.
+   - Handle the few non‑`cudnnStatus_t` entries (`cudnnGetErrorString`, `cudnnGetVersion`).
+
+  The stubs must be compiled into the same target that currently links cuDNN. For Linux,
+  prefer hidden visibility for the stub definitions where possible to avoid exporting cuDNN
+  names from ORT provider binaries. The loader should use `RTLD_LOCAL` when opening cuDNN.
+
+4. **CMake changes.**
+   - Remove `CUDNN::cudnn_all` from `target_link_libraries` for the CUDA EP, plugin EP, and
+     tests; **keep** `CUDNN_INCLUDE_DIR` (headers) and `cudnn_frontend` (headers).
+   - Compile `cudnn_stub.cc` + `cudnn_loader.cc` into the EP.
+   - Compile CUDA EP targets that include `cudnn_frontend` with
+     `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING`. This is required so frontend graph code uses
+     `dlsym` / `GetProcAddress` on `cudnn_frontend::cudnn_dlhandle` instead of creating
+     link-time imports for `cudnnBackend*` symbols.
+   - Keep `USE_CUDA_MINIMAL` working. It is used by TensorRT / NV‑TensorRT‑RTX related
+     builds and is not replaced by the optional-cuDNN runtime shim.
+
+   Current CMake anchor points:
+
+   - `cmake/onnxruntime_providers_cuda.cmake`: replace `CUDNN::cudnn_all` with the shim
+     sources/library while retaining `include(cudnn_frontend)` and the cuDNN include dirs;
+     add `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` for the provider target.
+   - `cmake/onnxruntime_providers_cuda_plugin.cmake`: same for
+     `onnxruntime_providers_cuda_plugin`.
+   - `cmake/onnxruntime_unittests.cmake`: unit tests should link against the shim path, not
+     against cuDNN import libraries, and should use the same frontend dynamic-loading define.
+   - `cmake/onnxruntime_python.cmake`: on Windows, the generated `version_info.py` currently
+     searches for `cudnn64_*.dll` and fails if it is missing. That fatal check must be
+     relaxed because cuDNN is optional; `cudnn_version` should be omitted, set to `None`, or
+     set to `"optional"` when no DLL is found.
+
+5. **Conditional handle creation.**
+   - `cuda_stream_handle.cc`, `cuda_execution_provider.cc`, `cuda_stream_plugin.cc`,
+     `cuda_kernel_adapter.h`: create the handle only if
+     `info.enable_cudnn && CudnnLibrary::Get().Available()`, otherwise leave it null and do
+     not throw.
+   - When `info.enable_cudnn` is false, skip the loader call entirely.
+   - External/custom-op resource behavior: `CudaResource::cudnn_handle_t` may be `nullptr`.
+     Any internal custom-op adapter that assumes a non-null handle must return the same clear
+     cuDNN-required error.
+
+6. **Guard all cuDNN op entry points.**
+   - Add `ORT_RETURN_IF_CUDNN_UNAVAILABLE(context, "<Op>")` or an equivalent helper to the
+     `ComputeInternal` of every op in the §2.4 table (centralize in shared bases where
+     possible: `CudnnRnnBase`, conv state setup, reduction helper, pooling,
+     batch/instance norm, LRN, cuDNN softmax path).
+   - For the cuDNN **softmax** and **reduction** paths that *already* have native
+     alternatives, prefer routing to the native path when cuDNN is absent instead of
+     throwing (Phase‑2 work; see §5). Otherwise throw.
+   - Guard frontend graph creation as well as frontend graph execution. `cudnn_frontend`'s
+     dynamic shim throws if it cannot resolve backend symbols, so ORT should fail with the
+     clearer provider-option / cuDNN-missing message before calling `validate()`,
+     `build_operation_graph()`, `create_execution_plans()`, `check_support()`,
+     `build_plans()`, or `execute()`.
+
+7. **Provider-option plumbing.**
+  - Add and parse `enable_cudnn` and `cudnn_path` in `CUDAExecutionProviderInfo`.
+  - Return them from `GetProviderOptions()` / `ToProviderOptions()`.
+  - Include them in the EP hash.
+  - Add tests for parsing: `enable_cudnn` default true, `"0"` false, `"1"` true, invalid
+    values rejected; `cudnn_path` default empty and round-trips as a string.
+
+8. **Error‑string safety.**
+   - Make `CudaErrString<cudnnStatus_t>` shim‑safe.
+  - Make `CudaErrString<cudnn_frontend::error_t>` report frontend dynamic-loading failures
+    without assuming cuDNN is available.
+
+9. **Docs & messaging.**
+   - Document the new behavior, `enable_cudnn`, and `cudnn_path`.
+   - Update Python package guidance for `onnxruntime.preload_dlls(cuda=True, cudnn=True,
+     directory=...)`: users can still preload a known cuDNN directory, but preloading is now
+     optional for CUDA EP load because the provider itself lazy-loads cuDNN.
+   - Update `onnxruntime/__init__.py` behavior as needed so missing cuDNN does not produce a
+     scary install warning by default in optional-cuDNN packages. If the user explicitly calls
+     `preload_dlls(cudnn=True)`, keep diagnostics useful and include the missing DLL name.
+
+10. **CI workflow for no-cuDNN builds.**
+    - Add a focused CI workflow/job that configures and builds the CUDA EP without cuDNN
+      import libraries available at link time. It should still provide cuDNN headers, because
+      the optional-cuDNN design remains source-compatible with cuDNN APIs and
+      `cudnn_frontend` headers.
+    - The job should verify the produced CUDA provider binary has no direct cuDNN runtime
+      dependency (`readelf -d` / `ldd` on Linux, `dumpbin /dependents` on Windows).
+    - Run at least a smoke test that imports ORT, initializes the CUDA EP, and executes a
+      non-cuDNN CUDA model with cuDNN runtime libraries absent from the runtime library path.
+    - Run a negative smoke test for one cuDNN-backed op, such as Conv, and assert the clear
+      `NOT_IMPLEMENTED` error rather than a dynamic-loader failure.
+    - Start with Linux CUDA CI, then add the equivalent Windows CUDA CI leg once the Windows
+      sub-DLL search behavior is implemented and stable.
+
+### 4.2 Acceptance criteria (Phase 1)
+
+- With cuDNN **removed** from the system:
+  - `libonnxruntime`/CUDA provider loads; a model with no cuDNN ops runs correctly on CUDA.
+  - A model with a cuDNN op (e.g. Conv) fails with the clear `NOT_IMPLEMENTED` message, not a
+    crash or loader error.
+- With cuDNN **present** and `enable_cudnn=0`:
+  - ORT does not load cuDNN or create a cuDNN handle.
+  - Phase‑1 cuDNN ops fail with the "disabled by provider option" `NOT_IMPLEMENTED` message.
+  - Phase‑2 / Phase‑3 native fallback ops run through the native path once implemented.
+- With cuDNN **present**: full existing test suite passes unchanged (no perf/accuracy
+  regression).
+- Both main CUDA EP and plugin EP build and pass.
+- A dedicated no-cuDNN CI job builds the CUDA EP without cuDNN import libraries, confirms the
+  provider binary has no direct cuDNN runtime dependency, and runs the no-cuDNN smoke tests.
+
+---
+
+## 5. Phase 2 — Replace LLM‑relevant cuDNN paths
+
+**Outcome:** LLM‑focused CUDA workloads can run without cuDNN for common Softmax /
+LogSoftmax and reduction patterns. This phase is part of the first milestone with Phase 1:
+Phase 1 makes cuDNN optional, and Phase 2 removes the cuDNN dependency from ops that LLM
+models may still use.
+
+### 5.1 Scope
+
+1. **Softmax / LogSoftmax** — native warp/block kernels already exist; make them the default
+   and drop the cuDNN path (or keep cuDNN only as an opt‑in fast path).
+2. **Reductions / ArgMax / ArgMin** — CUB‑based implementations; remove the
+   `cudnnReduceTensor` dependency.
+
+### 5.2 Mechanism
+
+For these ops, prefer the native implementation regardless of cuDNN availability, unless a
+specific cuDNN fast path is intentionally kept behind an opt‑in provider option:
+
+```text
+use native CUDA / CUB implementation
+optional: if provider option requests cuDNN fast path and cuDNN is available, use cuDNN
+```
+
+The important Phase‑2 property is that these ops must not require a non-null cuDNN handle.
+If `enable_cudnn=0`, or if cuDNN is absent, they should still run through the native path.
+
+### 5.3 Acceptance criteria (Phase 2, per op)
+
+- Native path matches cuDNN within tolerance on the op's existing unit tests.
+- With cuDNN absent, the op runs (no `NOT_IMPLEMENTED`).
+- With cuDNN present, no regression in correctness; any retained cuDNN fast path is explicit
+  and test-covered.
+- With `enable_cudnn=0`, no dynamic cuDNN load is attempted.
+
+---
+
+## 6. Phase 3 — Replace remaining cuDNN‑backed NN ops
+
+**Outcome:** Broader CNN / vision-style CUDA workloads can run without cuDNN where practical.
+These ops are outside the first Phase 1 + Phase 2 milestone because they are less central to
+LLM workloads and, for convolution, much more expensive to replace well.
+
+### 6.1 Scope
+
+1. **Pooling** (Max/Average, global variants) — straightforward native kernels.
+2. **BatchNormalization / InstanceNormalization (inference)** — elementwise affine over
+  precomputed stats; native kernel is simple.
+3. **LRN** — native kernel.
+4. **Conv / ConvTranspose / FusedConv** — the hard part. Options:
+  - implicit‑GEMM CUDA kernels for common cases,
+  - CUTLASS conv,
+  - precompiled Triton conv cubins,
+  - or im2col + existing GEMM as a correctness fallback.
+  Keep cuDNN as the preferred fast path when available; native kernel as fallback.
+
+RNN / LSTM / GRU are intentionally not part of the Phase‑3 scope for now. They are the
+heaviest to replace and may remain cuDNN‑only with a clear `NOT_IMPLEMENTED` when cuDNN is
+absent unless there is product demand.
+
+### 6.2 Mechanism
+
+For each Phase‑3 op, introduce a dispatch at `ComputeInternal`:
+
+```text
+if info.enable_cudnn and CudnnLibrary::Get().Available() and (cuDNN path preferred): use cuDNN
+else: use native CUDA / CUTLASS / Triton-cubin fallback
+```
+
+This preserves cuDNN performance where present while removing the hard requirement.
+Triton cubins (see `docs/ORT_Use_Triton_Kernel.md`) are an option for fused/normalization
+kernels; CUTLASS (already vendored) for conv/GEMM‑shaped work.
+
+### 6.3 Acceptance criteria (Phase 3, per op)
+
+- Native path matches cuDNN within tolerance on the op's existing unit tests.
+- With cuDNN absent, the op runs (no `NOT_IMPLEMENTED`).
+- With cuDNN present, no regression (cuDNN path still selected unless configured otherwise).
+- With `enable_cudnn=0`, the op uses the native fallback and does not load cuDNN.
+
+---
+
+## 7. Testing strategy
+
+- **Loader unit tests:** simulate cuDNN present vs absent.
+  - "Absent" is forced via a test hook on `CudnnLibrary` (e.g. an internal
+    `SetForceUnavailableForTest(true)` compiled only in test/internal builds), avoiding the
+    need to physically remove cuDNN in CI.
+  - "Disabled" is tested with CUDA provider option `enable_cudnn=0`; this should not touch
+    the dynamic loader at all.
+  - `cudnn_path` is tested with a temporary directory / fake loader hook to verify search
+    precedence, missing-directory handling, and that `enable_cudnn=0` suppresses all loads
+    even when `cudnn_path` is set.
+- **Op‑level tests:** for each cuDNN op, assert the clear `NOT_IMPLEMENTED` error in the
+  forced‑absent mode (Phase 1), and correctness in present mode.
+- **cuDNN frontend tests:** add a Conv / ConvTranspose test that exercises frontend graph
+  creation and execution with cuDNN present, and verifies that `enable_cudnn=0` fails before
+  `cudnn_frontend` attempts to resolve backend symbols.
+- **Binary dependency tests:** inspect the CUDA provider binary (`ldd` / `readelf -d` on
+  Linux, `dumpbin /dependents` on Windows) and confirm there is no direct dependency on
+  `libcudnn*` / `cudnn64_*.dll` even though `cudnn_frontend` headers are compiled in.
+- **Phase‑2 native path tests:** run Softmax / LogSoftmax and reduction tests with
+  `enable_cudnn=0` and with the forced‑absent loader hook. These should pass, not throw.
+- **Phase‑3 native path tests:** as each Phase‑3 op is implemented, add the same
+  `enable_cudnn=0` / forced‑absent coverage for that op.
+- **Load test:** a process that initializes the CUDA EP with the cuDNN libs unavailable and
+  runs a non‑cuDNN model end‑to‑end.
+- **No-cuDNN CI workflow:** add a workflow/job, initially in Linux CUDA CI, that removes cuDNN
+  import libraries from the link/runtime environment while keeping cuDNN headers available.
+  It should build the CUDA EP, inspect dynamic dependencies, run the non-cuDNN smoke model,
+  and verify a cuDNN-backed op fails with ORT's `NOT_IMPLEMENTED` message. Add the Windows
+  equivalent after the Windows cuDNN sub-DLL loading path is covered.
+- **Python preload tests:** verify `onnxruntime.preload_dlls(cudnn=True, directory=...)`
+  still preloads a user-provided cuDNN directory, while normal `import onnxruntime` and CUDA
+  EP initialization do not fail or print install guidance solely because cuDNN is missing.
+- **Regression:** full existing CUDA suite with cuDNN present must stay green.
+- CI legs that genuinely lack cuDNN can additionally validate the real (not forced) path.
+
+---
+
+## 8. Risks and mitigations
+
+- **Symbol‑manifest completeness for direct cuDNN calls.** Missing a direct-call symbol in
+  `cudnn_symbols.inc` would cause an unresolved-symbol link error or a runtime trampoline
+  failure. *Mitigation:* the probe-binary diff in §4.1-1, and CI that builds with cuDNN
+  import libs unavailable.
+- **`cudnn_frontend` dynamic-loading integration.** If
+  `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` is not applied consistently, frontend graph code may
+  still create imports for `cudnnBackend*` symbols. If `cudnn_frontend::cudnn_dlhandle` is not
+  defined/set by ORT, frontend calls will throw while resolving backend symbols. *Mitigation:*
+  apply the compile definition to every target that includes frontend headers, define the
+  global handle once in ORT, and add binary dependency plus Conv/ConvTranspose frontend smoke
+  tests.
+- **`cudnn_frontend` may add new backend symbols across dependency bumps.** A future
+  `cudnn_frontend` version may resolve additional backend, CUDA, CUDART, NVRTC, or experimental
+  symbols. *Mitigation:* after every `cudnn_frontend` update, grep/audit
+  `cudnn_frontend_shim.h` and experimental shims, then update the frontend-symbol audit tests.
+- **cuDNN sub‑library packaging differences (v9 split libs; distro/conda/pip layouts).**
+  *Mitigation:* provide the `cudnn_path` provider option. On Linux, load only the umbrella
+  `libcudnn` and let cuDNN resolve its own sub‑libs. On Windows, add the chosen directory to
+  the DLL search path or preload required cuDNN sub-DLLs before `cudnn64_9.dll`, following the
+  same ordering already encoded in Python `preload_dlls()`.
+- **Python preload behavior can conflict with optional cuDNN.** Today
+  `onnxruntime.preload_dlls(cudnn=True)` tries to load cuDNN and prints installation guidance
+  on failure. That is useful when the user requested preloading, but too alarming if cuDNN is
+  optional. *Mitigation:* keep explicit preloading available, but avoid invoking or requiring
+  cuDNN preload as part of normal optional-cuDNN package import / CUDA EP load. Update docs so
+  users who need a specific cuDNN directory can use either provider option `cudnn_path` or
+  `preload_dlls(cudnn=True, directory=...)` before creating the session.
+- **Python version metadata currently assumes cuDNN on Windows CUDA builds.**
+  `cmake/onnxruntime_python.cmake` fails if no `cudnn64_*.dll` is found when generating
+  `version_info.py`. *Mitigation:* make `cudnn_version` optional in that generated file.
+- **Symbol collision when a cuDNN‑linking EP (TensorRT) and the shimmed CUDA EP coexist in
+  one process.** Our trampolines define real cuDNN symbol names; if TRT's cuDNN is also
+  loaded, the dynamic linker could bind either. *Mitigation:* keep the shim symbols with
+  internal/hidden visibility where possible and resolve the real library explicitly via the
+  loader (`RTLD_LOCAL`); document the constraint; consider a build option that keeps the
+  classic hard‑link behavior for TRT‑combined packages.
+- **ABI/version skew** (built against cuDNN headers vN, loads runtime vM). *Mitigation:*
+  check `cudnnGetVersion` in the loader and refuse (mark unavailable) on incompatible major
+  versions.
+- **Performance:** one extra indirect call per cuDNN entry — negligible relative to kernel
+  cost.
+
+---
+
+## 9. Backward compatibility
+
+- Default behavior with cuDNN installed is unchanged (same kernels, same perf).
+- The build still requires cuDNN **headers** (and `cudnn_frontend` headers) to compile; it no
+  longer requires cuDNN **import libraries** to link the in‑scope targets.
+- `cudnn_frontend` remains a compile-time dependency while ORT uses frontend graph APIs for
+  Conv / ConvTranspose. It should not introduce a runtime cuDNN dependency when compiled with
+  `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` and wired to ORT's loader handle.
+- The wheel package is unchanged: cuDNN DLLs are not packed into the wheel today, and this
+  design does not introduce a separate CUDA-minimal wheel.
+- `onnxruntime.preload_dlls()` remains supported for users who want Python to preload CUDA /
+  cuDNN libraries from PyTorch, NVIDIA site packages, or an explicit directory. It becomes an
+  optional convenience path for cuDNN, not a requirement for importing ORT or initializing the
+  CUDA EP without cuDNN.
+- No public API change is required for Phase 1. The custom‑op `CudaContext::cudnn_handle`
+  may now be `nullptr`; this is documented, and `FetchResource` returns null gracefully.
+
+---
+
+## 10. Resolved decisions
+
+- Force-disabling cuDNN is a **CUDA provider option**, not a session option. Use
+  `enable_cudnn=0`.
+- Providing a custom cuDNN runtime directory is also a **CUDA provider option**. Use
+  `cudnn_path=<directory>` while keeping `enable_cudnn=1`.
+- No new "CUDA-minimal" wheel is required for Phase 1. cuDNN DLLs are not packed in the wheel
+  today.
+- Keep the existing `USE_CUDA_MINIMAL` build-time path. It is used by RTX/TensorRT-related EP
+  builds and is not replaced by the CUDA EP runtime shim.

From 41457e3276067a03f22dc181e3db42df1fcba75e Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 23:43:28 +0000
Subject: [PATCH 02/15] draft of phase 1

---
 .github/workflows/linux_cuda_no_cudnn.yml     | 118 ++++++
 cmake/onnxruntime_providers_cuda.cmake        |   4 +-
 cmake/onnxruntime_providers_cuda_plugin.cmake |   9 +-
 cmake/onnxruntime_python.cmake                |  11 +-
 onnxruntime/__init__.py                       |  16 +-
 onnxruntime/core/providers/cuda/cuda_call.cc  |  19 +
 .../providers/cuda/cuda_execution_provider.cc |  25 +-
 .../providers/cuda/cuda_execution_provider.h  |   3 +-
 .../cuda/cuda_execution_provider_info.cc      |   8 +
 .../cuda/cuda_execution_provider_info.h       |   6 +
 onnxruntime/core/providers/cuda/cuda_kernel.h |  16 +-
 .../core/providers/cuda/cuda_stream_handle.cc |  15 +-
 .../core/providers/cuda/cudnn_fe_call.cc      |  13 +
 .../core/providers/cuda/cudnn_loader.cc       | 211 +++++++++
 .../core/providers/cuda/cudnn_loader.h        |  50 +++
 onnxruntime/core/providers/cuda/cudnn_stub.cc | 401 ++++++++++++++++++
 .../core/providers/cuda/plugin/cuda_ep.cc     |   6 +-
 .../core/providers/cuda/plugin/cuda_ep.h      |   2 +
 .../providers/cuda/plugin/cuda_ep_factory.cc  |  22 +-
 .../cuda/plugin/cuda_kernel_adapter.h         |  39 +-
 .../providers/cuda/plugin/cuda_plugin_utils.h |  17 +
 .../cuda/plugin/cuda_stream_plugin.cc         |  14 +-
 .../cuda/plugin/cuda_stream_plugin.h          |   3 +-
 .../test/python/onnxruntime_test_python.py    |   4 +
 24 files changed, 978 insertions(+), 54 deletions(-)
 create mode 100644 .github/workflows/linux_cuda_no_cudnn.yml
 create mode 100644 onnxruntime/core/providers/cuda/cudnn_loader.cc
 create mode 100644 onnxruntime/core/providers/cuda/cudnn_loader.h
 create mode 100644 onnxruntime/core/providers/cuda/cudnn_stub.cc

diff --git a/.github/workflows/linux_cuda_no_cudnn.yml b/.github/workflows/linux_cuda_no_cudnn.yml
new file mode 100644
index 0000000000000..deee62a643578
--- /dev/null
+++ b/.github/workflows/linux_cuda_no_cudnn.yml
@@ -0,0 +1,118 @@
+name: Linux CUDA No cuDNN CI
+
+on:
+  pull_request:
+    branches: [main, 'rel-*']
+    paths:
+      - '.github/workflows/linux_cuda_no_cudnn.yml'
+      - 'cmake/onnxruntime_providers_cuda.cmake'
+      - 'cmake/onnxruntime_python.cmake'
+      - 'onnxruntime/__init__.py'
+      - 'onnxruntime/core/providers/cuda/**'
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event_name == 'pull_request' && github.ref || github.sha }}
+  cancel-in-progress: true
+
+permissions:
+  contents: read
+  packages: write
+  attestations: write
+  id-token: write
+
+jobs:
+  build-linux-cuda-no-cudnn-x64-release:
+    name: Build Linux CUDA x64 Release without cuDNN link
+    uses: ./.github/workflows/reusable_linux_build.yml
+    with:
+      pool_name: "onnxruntime-github-Ubuntu2204-AMD-CPU"
+      build_config: Release
+      architecture: x64
+      dockerfile_path: tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda
+      docker_build_args: '--build-arg BASEIMAGE=onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda13_x64_almalinux8_gcc14:20251107.1'
+      docker_image_repo: onnxruntimecuda13manylinuxbuild
+      extra_build_flags: '--use_binskim_compliant_compile_flags --parallel --nvcc_threads 4 --flash_nvcc_threads 4 --cuda_version=13.0 --cuda_home=/usr/local/cuda-13.0 --cudnn_home=/usr/local/cuda-13.0 --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86 onnxruntime_BUILD_UNIT_TESTS=ON'
+      python_path_prefix: 'PATH=/opt/python/cp310-cp310/bin:$PATH'
+      run_tests: false
+      upload_build_output: true
+      execution_providers: 'cuda'
+      job_identifier: build-linux-cuda-no-cudnn-x64-release
+    secrets:
+      GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+  smoke-linux-cuda-no-cudnn-x64-release:
+    name: Smoke Linux CUDA x64 Release without cuDNN runtime use
+    needs: build-linux-cuda-no-cudnn-x64-release
+    runs-on:
+      - self-hosted
+      - "1ES.Pool=onnxruntime-github-linux-a10"
+      - "1ES.ImageOverride=onnxruntime-ubuntu2204-CUDA-A10-Test"
+      - "JobId=smoke-linux-cuda-no-cudnn-x64-release-${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}"
+    permissions:
+      contents: read
+      packages: read
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v6
+
+      - uses: microsoft/onnxruntime-github-actions/build-docker-image@8bad63a3c05d448311dfa8e5f531171c97471aa1
+        id: build_docker_image_step
+        with:
+          dockerfile: ${{ github.workspace }}/tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda
+          image-name: ghcr.io/microsoft/onnxruntime/onnxruntimecuda13manylinuxbuild
+          build-args: '--build-arg BASEIMAGE=onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda13_x64_almalinux8_gcc14:20251107.1'
+          push: true
+          azure-container-registry-name: onnxruntimebuildcache
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Download Build Artifact
+        uses: actions/download-artifact@v7
+        with:
+          name: build-output-x64-Release
+          path: ${{ runner.temp }}/Release
+
+      - name: Restore Executable Permissions
+        working-directory: ${{ runner.temp }}/Release
+        run: |
+          if [ -f perms.txt ]; then
+            while IFS= read -r file; do
+              if [ -f "$file" ]; then
+                chmod +x "$file"
+              fi
+            done < perms.txt
+          fi
+
+      - name: Verify CUDA provider has no direct cuDNN dependency
+        run: |
+          docker run --rm --gpus all \
+            -v "${{ runner.temp }}/Release:/onnxruntime_src/build/Release" \
+            "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
+            bash -lc 'ldd /onnxruntime_src/build/Release/libonnxruntime_providers_cuda.so | tee /tmp/ldd.txt && ! grep -i cudnn /tmp/ldd.txt'
+
+      - name: Run no-cuDNN CUDA EP smoke test
+        run: |
+          docker run --rm --gpus all \
+            -v "${{ runner.temp }}/Release:/onnxruntime_src/build/Release" \
+            "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
+            bash -lc 'PATH=/opt/python/cp310-cp310/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python - <<"PY"
+          import numpy as np
+          import onnx
+          import onnxruntime as ort
+          from onnx import TensorProto, helper
+
+          x = helper.make_tensor_value_info("x", TensorProto.FLOAT, [2, 3])
+          y = helper.make_tensor_value_info("y", TensorProto.FLOAT, [2, 3])
+          node = helper.make_node("Add", ["x", "x"], ["y"])
+          graph = helper.make_graph([node], "cuda_no_cudnn_smoke", [x], [y])
+          model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 21)])
+          model.ir_version = 10
+
+          providers = [("CUDAExecutionProvider", {"enable_cudnn": "0"})]
+          sess = ort.InferenceSession(model.SerializeToString(), providers=providers)
+          data = np.arange(6, dtype=np.float32).reshape(2, 3)
+          result = sess.run(None, {"x": data})[0]
+          np.testing.assert_allclose(result, data + data)
+          print("CUDA no-cuDNN smoke test passed")
+          PY'
diff --git a/cmake/onnxruntime_providers_cuda.cmake b/cmake/onnxruntime_providers_cuda.cmake
index f692f1f5e0a57..fdc3fe8786425 100644
--- a/cmake/onnxruntime_providers_cuda.cmake
+++ b/cmake/onnxruntime_providers_cuda.cmake
@@ -399,7 +399,9 @@
           message( WARNING "To compile with NHWC ops enabled please compile against cuDNN 9 or newer." )
         endif()
       endif()
-      target_link_libraries(${target} PRIVATE CUDA::cublasLt CUDA::cublas CUDNN::cudnn_all cudnn_frontend CUDA::curand CUDA::cufft CUDA::cudart CUDA::nvrtc CUDA::cuda_driver
+      target_compile_definitions(${target} PRIVATE NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING)
+      target_include_directories(${target} PRIVATE ${CUDNN_INCLUDE_DIR})
+      target_link_libraries(${target} PRIVATE CUDA::cublasLt CUDA::cublas cudnn_frontend CUDA::curand CUDA::cufft CUDA::cudart CUDA::nvrtc CUDA::cuda_driver
               ${ABSEIL_LIBS} ${ONNXRUNTIME_PROVIDERS_SHARED} Boost::mp11 safeint_interface)
     endif()
 
diff --git a/cmake/onnxruntime_providers_cuda_plugin.cmake b/cmake/onnxruntime_providers_cuda_plugin.cmake
index 86e5579eb6761..f2c53aab31b39 100644
--- a/cmake/onnxruntime_providers_cuda_plugin.cmake
+++ b/cmake/onnxruntime_providers_cuda_plugin.cmake
@@ -348,12 +348,12 @@ endif()
 set(CUDA_PLUGIN_CUDNN_INCLUDE_DIR ${CUDNN_INCLUDE_DIR})
 set(CUDA_PLUGIN_CUDNN_LIBRARY ${cudnn_LIBRARY})
 
-if(NOT CUDA_PLUGIN_CUDNN_INCLUDE_DIR OR NOT CUDA_PLUGIN_CUDNN_LIBRARY)
-  message(FATAL_ERROR "cuDNN not found (from main ORT search) for CUDA Plugin EP.")
+if(NOT CUDA_PLUGIN_CUDNN_INCLUDE_DIR)
+  message(FATAL_ERROR "cuDNN headers not found (from main ORT search) for CUDA Plugin EP.")
 endif()
 
 message(STATUS "CUDA Plugin EP: cuDNN include: ${CUDA_PLUGIN_CUDNN_INCLUDE_DIR}")
-message(STATUS "CUDA Plugin EP: cuDNN library: ${CUDA_PLUGIN_CUDNN_LIBRARY}")
+message(STATUS "CUDA Plugin EP: cuDNN runtime library: ${CUDA_PLUGIN_CUDNN_LIBRARY}")
 
 # Include directories — only public ORT headers + CUDA toolkit + cuDNN + internal headers for adapter
 target_include_directories(onnxruntime_providers_cuda_plugin PRIVATE
@@ -388,7 +388,6 @@ target_link_libraries(onnxruntime_providers_cuda_plugin PRIVATE
     CUDA::cufft
     CUDA::nvrtc
     CUDA::cuda_driver
-    CUDNN::cudnn_all
     cudnn_frontend
     Boost::mp11
     safeint_interface
@@ -403,6 +402,8 @@ target_link_libraries(onnxruntime_providers_cuda_plugin PRIVATE
     ${PROTOBUF_LIB}
 )
 
+  target_compile_definitions(onnxruntime_providers_cuda_plugin PRIVATE NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING)
+
 if (onnxruntime_ENABLE_CUDA_PROFILING)
     target_link_libraries(onnxruntime_providers_cuda_plugin PRIVATE CUDA::cupti)
     target_compile_definitions(onnxruntime_providers_cuda_plugin PRIVATE ENABLE_CUDA_PROFILING)
diff --git a/cmake/onnxruntime_python.cmake b/cmake/onnxruntime_python.cmake
index de1d7559a1572..2b30ef0f239d5 100644
--- a/cmake/onnxruntime_python.cmake
+++ b/cmake/onnxruntime_python.cmake
@@ -339,16 +339,19 @@ if (WIN32)
         endif()
       endforeach()
       if(NOT CUDNN_DLL_PATH)
-        message(FATAL_ERROR "cuDNN not found in ${onnxruntime_CUDNN_HOME}")
+        message(STATUS "cuDNN not found in ${onnxruntime_CUDNN_HOME}. Python package metadata will record an empty cuDNN version.")
       endif()
     else()
       file(GLOB CUDNN_DLL_PATH "${onnxruntime_CUDA_HOME}/bin/cudnn64_*.dll")
       if (NOT CUDNN_DLL_PATH)
-        message(FATAL_ERROR "cuDNN not found in ${onnxruntime_CUDA_HOME}")
+        message(STATUS "cuDNN not found in ${onnxruntime_CUDA_HOME}. Python package metadata will record an empty cuDNN version.")
       endif()
     endif()
-    get_filename_component(CUDNN_DLL_NAME ${CUDNN_DLL_PATH} NAME_WE)
-    string(REPLACE "cudnn64_" "" CUDNN_VERSION "${CUDNN_DLL_NAME}")
+    set(CUDNN_VERSION "")
+    if(CUDNN_DLL_PATH)
+      get_filename_component(CUDNN_DLL_NAME ${CUDNN_DLL_PATH} NAME_WE)
+      string(REPLACE "cudnn64_" "" CUDNN_VERSION "${CUDNN_DLL_NAME}")
+    endif()
     if(NOT onnxruntime_CUDA_VERSION)
       set(onnxruntime_CUDA_VERSION ${CUDAToolkit_VERSION})
       message("onnxruntime_CUDA_VERSION=${onnxruntime_CUDA_VERSION}")
diff --git a/onnxruntime/__init__.py b/onnxruntime/__init__.py
index df14bc8c57f24..00a737243ddaf 100644
--- a/onnxruntime/__init__.py
+++ b/onnxruntime/__init__.py
@@ -291,14 +291,14 @@ def is_target_dll(path: str):
 
 
 def preload_dlls(cuda: bool = True, cudnn: bool = True, msvc: bool = True, directory=None):
-    """Preload CUDA 12.x+ and cuDNN 9.x DLLs in Windows or Linux, and MSVC runtime DLLs in Windows.
+    """Preload CUDA 12.x+ and optional cuDNN 9.x DLLs in Windows or Linux, and MSVC runtime DLLs in Windows.
 
        When the installed PyTorch is compatible (using same major version of CUDA and cuDNN),
        there is no need to call this function if `import torch` is done before `import onnxruntime`.
 
     Args:
         cuda (bool, optional): enable loading CUDA DLLs. Defaults to True.
-        cudnn (bool, optional): enable loading cuDNN DLLs. Defaults to True.
+        cudnn (bool, optional): enable loading cuDNN DLLs. Defaults to True. Missing cuDNN DLLs are ignored.
         msvc (bool, optional): enable loading MSVC DLLs in Windows. Defaults to True.
         directory(str, optional): a directory contains CUDA or cuDNN DLLs. It can be an absolute path,
            or a path relative to the directory of this file.
@@ -391,6 +391,7 @@ def preload_dlls(cuda: bool = True, cudnn: bool = True, msvc: bool = True, direc
 
     # Try load DLLs from nvidia site packages.
     dll_paths = _get_nvidia_dll_paths(is_windows, cuda, cudnn)
+    optional_dll_filenames = {relative_path[-1] for relative_path in _get_nvidia_dll_paths(is_windows, False, cudnn)}
     loaded_dlls = []
     for relative_path in dll_paths:
         dll_path = (
@@ -403,11 +404,8 @@ def preload_dlls(cuda: bool = True, cudnn: bool = True, msvc: bool = True, direc
                 _ = ctypes.CDLL(dll_path)
                 loaded_dlls.append(relative_path[-1])
             except Exception as e:
-                print(f"Failed to load {dll_path}: {e}")
-
-    # cuDNN DLLs that only exist in newer cuDNN releases (e.g. >= 9.23) and are
-    # optional for inference. Missing them on older cuDNN must not be treated as a failure.
-    _optional_dll_filenames = {"cudnn_engines_tensor_ir64_9.dll"}
+                if relative_path[-1] not in optional_dll_filenames:
+                    print(f"Failed to load {dll_path}: {e}")
 
     # Try load DLLs with default path settings.
     has_failure = False
@@ -417,9 +415,9 @@ def preload_dlls(cuda: bool = True, cudnn: bool = True, msvc: bool = True, direc
             try:
                 _ = ctypes.CDLL(dll_filename)
             except Exception as e:
-                if dll_filename not in _optional_dll_filenames:
+                if dll_filename not in optional_dll_filenames:
                     has_failure = True
                     print(f"Failed to load {dll_filename}: {e}")
 
     if has_failure:
-        print("Please follow https://onnxruntime.ai/docs/install/#cuda-and-cudnn to install CUDA and CuDNN.")
+        print("Please follow https://onnxruntime.ai/docs/install/#cuda-and-cudnn to install CUDA.")
diff --git a/onnxruntime/core/providers/cuda/cuda_call.cc b/onnxruntime/core/providers/cuda/cuda_call.cc
index b9e909714f1c0..d27dbf266d183 100644
--- a/onnxruntime/core/providers/cuda/cuda_call.cc
+++ b/onnxruntime/core/providers/cuda/cuda_call.cc
@@ -9,6 +9,11 @@
 #else
 #include <core/platform/env.h>
 #endif
+#ifndef USE_CUDA_MINIMAL
+#include "core/providers/cuda/cudnn_loader.h"
+#endif
+
+#include <type_traits>
 
 #ifdef _WIN32
 #else  // POSIX
@@ -89,6 +94,20 @@ std::conditional_t<THRW, void, Status> CudaCall(
     ERRTYPE retCode, const char* exprString, const char* libName, SUCCTYPE successCode, const char* msg,
     const char* file, const int line) {
   if (retCode != successCode) {
+#ifndef USE_CUDA_MINIMAL
+    if constexpr (std::is_same_v<ERRTYPE, cudnnStatus_t>) {
+      if (!cuda::CudnnLibrary::Get().Available()) {
+        auto status = ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
+                                      "cuDNN is unavailable for CUDA Execution Provider: ",
+                                      cuda::CudnnLibrary::Get().Error());
+        if constexpr (THRW) {
+          ORT_THROW(status.ErrorMessage());
+        } else {
+          return status;
+        }
+      }
+    }
+#endif
     try {
 #ifdef _WIN32
       std::string hostname_str = GetEnvironmentVar("COMPUTERNAME");
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
index db24f43223613..db7d278bc454d 100755
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
@@ -18,6 +18,7 @@
 #include "core/providers/cuda/gpu_data_transfer.h"
 #include "core/providers/cuda/cuda_profiler.h"
 #include "core/providers/cuda/cuda_mempool_arena.h"
+#include "core/providers/cuda/cudnn_loader.h"
 #include "core/session/onnxruntime_run_options_config_keys.h"
 
 #ifndef USE_CUDA_MINIMAL
@@ -223,18 +224,19 @@ AllocatorPtr CUDAExecutionProvider::CreateCudaPinnedAllocator(const CUDAAllocato
   return CreateAllocator(pinned_memory_info);
 }
 
-CUDAExecutionProvider::PerThreadContext::PerThreadContext(OrtDevice::DeviceId device_id, cudaStream_t stream, size_t /*gpu_mem_limit*/,
-                                                          ArenaExtendStrategy /*arena_extend_strategy*/, CUDAExecutionProviderExternalAllocatorInfo /*external_allocator_info*/,
-                                                          OrtArenaCfg* /*default_memory_arena_cfg*/) {
+CUDAExecutionProvider::PerThreadContext::PerThreadContext(OrtDevice::DeviceId device_id, cudaStream_t stream,
+                                                          const CUDAExecutionProviderInfo& info) {
   CUDA_CALL_THROW(cudaSetDevice(device_id));
 #ifndef USE_CUDA_MINIMAL
   CUBLAS_CALL_THROW(cublasCreate(&cublas_handle_));
   CUBLAS_CALL_THROW(cublasLtCreate(&cublas_lt_handle_));
   CUBLAS_CALL_THROW(cublasSetStream(cublas_handle_, stream));
 
-  CUDNN_CALL_THROW(cudnnCreate(&cudnn_handle_));
-  CUDNN_CALL_THROW(cudnnSetStream(cudnn_handle_, stream));
-  LOGS_DEFAULT(INFO) << "cuDNN version: " << cudnnGetVersion();
+  if (info.enable_cudnn && cuda::CudnnLibrary::Get().Available()) {
+    CUDNN_CALL_THROW(cudnnCreate(&cudnn_handle_));
+    CUDNN_CALL_THROW(cudnnSetStream(cudnn_handle_, stream));
+    LOGS_DEFAULT(INFO) << "cuDNN version: " << cudnnGetVersion();
+  }
 #endif
   cuda_graph_.SetStream(stream);
 }
@@ -243,7 +245,9 @@ CUDAExecutionProvider::PerThreadContext::~PerThreadContext() {
 #ifndef USE_CUDA_MINIMAL
   ORT_IGNORE_RETURN_VALUE(CUBLAS_CALL(cublasDestroy(cublas_handle_)));
   ORT_IGNORE_RETURN_VALUE(CUBLAS_CALL(cublasLtDestroy(cublas_lt_handle_)));
-  ORT_IGNORE_RETURN_VALUE(CUDNN_CALL(cudnnDestroy(cudnn_handle_)));
+  if (cudnn_handle_ != nullptr) {
+    ORT_IGNORE_RETURN_VALUE(CUDNN_CALL(cudnnDestroy(cudnn_handle_)));
+  }
 #endif
 }
 
@@ -334,6 +338,10 @@ CUDAExecutionProvider::CUDAExecutionProvider(const CUDAExecutionProviderInfo& in
   ORT_ENFORCE(info_.prefer_nhwc == 0, "This build does not support NHWC layout");
 #endif
 
+#ifndef USE_CUDA_MINIMAL
+  cuda::CudnnLibrary::Get().Configure(info_.enable_cudnn, info_.cudnn_path);
+#endif
+
   CUDA_CALL_THROW(cudaSetDevice(info_.device_id));
 
   // must wait GPU idle, otherwise cudaGetDeviceProperties might fail
@@ -446,8 +454,7 @@ CUDAExecutionProvider::PerThreadContext& CUDAExecutionProvider::GetPerThreadCont
 
     // get or create a context
     if (context_state_.retired_context_pool.empty()) {
-      context = std::make_shared<PerThreadContext>(info_.device_id, stream_, info_.gpu_mem_limit,
-                                                   info_.arena_extend_strategy, info_.external_allocator_info, info_.default_memory_arena_cfg);
+      context = std::make_shared<PerThreadContext>(info_.device_id, stream_, info_);
     } else {
       context = context_state_.retired_context_pool.back();
       context_state_.retired_context_pool.pop_back();
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider.h b/onnxruntime/core/providers/cuda/cuda_execution_provider.h
index a3a7c2dd13c52..4c61f68c158a6 100644
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider.h
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider.h
@@ -143,8 +143,7 @@ class CUDAExecutionProvider : public IExecutionProvider {
 
   class PerThreadContext final {
    public:
-    PerThreadContext(OrtDevice::DeviceId device_id, cudaStream_t stream, size_t cuda_mem_limit, ArenaExtendStrategy arena_extend_strategy,
-                     CUDAExecutionProviderExternalAllocatorInfo external_alloc_info, OrtArenaCfg* arena_cfg);
+    PerThreadContext(OrtDevice::DeviceId device_id, cudaStream_t stream, const CUDAExecutionProviderInfo& info);
     ~PerThreadContext();
     ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(PerThreadContext);
 
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc
index 14195703d5963..71e7ba73e68e4 100644
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc
@@ -27,6 +27,8 @@ constexpr const char* kGpuExternalEmptyCache = "gpu_external_empty_cache";
 constexpr const char* kCudnnConvUseMaxWorkspace = "cudnn_conv_use_max_workspace";
 constexpr const char* kEnableCudaGraph = "enable_cuda_graph";
 constexpr const char* kCudnnConv1dPadToNc1d = "cudnn_conv1d_pad_to_nc1d";
+constexpr const char* kEnableCudnn = "enable_cudnn";
+constexpr const char* kCudnnPath = "cudnn_path";
 constexpr const char* kTunableOpEnable = "tunable_op_enable";
 constexpr const char* kTunableOpTuningEnable = "tunable_op_tuning_enable";
 constexpr const char* kTunableOpMaxTuningDurationMs = "tunable_op_max_tuning_duration_ms";
@@ -115,6 +117,8 @@ CUDAExecutionProviderInfo CUDAExecutionProviderInfo::FromProviderOptions(const P
           .AddAssignmentToReference(cuda::provider_option_names::kCudnnConvUseMaxWorkspace, info.cudnn_conv_use_max_workspace)
           .AddAssignmentToReference(cuda::provider_option_names::kEnableCudaGraph, info.enable_cuda_graph)
           .AddAssignmentToReference(cuda::provider_option_names::kCudnnConv1dPadToNc1d, info.cudnn_conv1d_pad_to_nc1d)
+          .AddAssignmentToReference(cuda::provider_option_names::kEnableCudnn, info.enable_cudnn)
+          .AddAssignmentToReference(cuda::provider_option_names::kCudnnPath, info.cudnn_path)
           .AddAssignmentToReference(cuda::provider_option_names::kEnableSkipLayerNormStrictMode, info.enable_skip_layer_norm_strict_mode)
           .AddAssignmentToReference(cuda::provider_option_names::kPreferNHWCMode, info.prefer_nhwc)
           .AddAssignmentToReference(cuda::provider_option_names::kUseEPLevelUnifiedStream, info.use_ep_level_unified_stream)
@@ -167,6 +171,8 @@ ProviderOptions CUDAExecutionProviderInfo::ToProviderOptions(const CUDAExecution
       {cuda::provider_option_names::kCudnnConvUseMaxWorkspace, MakeStringWithClassicLocale(info.cudnn_conv_use_max_workspace)},
       {cuda::provider_option_names::kEnableCudaGraph, MakeStringWithClassicLocale(info.enable_cuda_graph)},
       {cuda::provider_option_names::kCudnnConv1dPadToNc1d, MakeStringWithClassicLocale(info.cudnn_conv1d_pad_to_nc1d)},
+      {cuda::provider_option_names::kEnableCudnn, MakeStringWithClassicLocale(info.enable_cudnn)},
+      {cuda::provider_option_names::kCudnnPath, info.cudnn_path},
       {cuda::provider_option_names::kTunableOpEnable, MakeStringWithClassicLocale(info.tunable_op.enable)},
       {cuda::provider_option_names::kTunableOpTuningEnable, MakeStringWithClassicLocale(info.tunable_op.tuning_enable)},
       {cuda::provider_option_names::kTunableOpMaxTuningDurationMs, MakeStringWithClassicLocale(info.tunable_op.max_tuning_duration_ms)},
@@ -192,6 +198,8 @@ ProviderOptions CUDAExecutionProviderInfo::ToProviderOptions(const OrtCUDAProvid
       {cuda::provider_option_names::kDoCopyInDefaultStream, MakeStringWithClassicLocale(info.do_copy_in_default_stream)},
       {cuda::provider_option_names::kCudnnConvUseMaxWorkspace, MakeStringWithClassicLocale(info.cudnn_conv_use_max_workspace)},
       {cuda::provider_option_names::kCudnnConv1dPadToNc1d, MakeStringWithClassicLocale(info.cudnn_conv1d_pad_to_nc1d)},
+      {cuda::provider_option_names::kEnableCudnn, MakeStringWithClassicLocale(true)},
+      {cuda::provider_option_names::kCudnnPath, ""},
       {cuda::provider_option_names::kTunableOpEnable, MakeStringWithClassicLocale(info.tunable_op_enable)},
       {cuda::provider_option_names::kTunableOpTuningEnable, MakeStringWithClassicLocale(info.tunable_op_tuning_enable)},
       {cuda::provider_option_names::kTunableOpMaxTuningDurationMs, MakeStringWithClassicLocale(info.tunable_op_max_tuning_duration_ms)},
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h
index bfd50ca8d40a1..a3095c8afb23b 100644
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h
@@ -6,6 +6,7 @@
 
 #include <functional>
 #include <limits>
+#include <string>
 
 #include "core/common/hash_combine.h"
 #include "core/framework/arena_extend_strategy.h"
@@ -69,6 +70,9 @@ struct CUDAExecutionProviderInfo {
   // By default, for Conv1D, will pad [N,C,D] to [N,C,D,1], if turn on, will pad to [N,C,1,D].
   bool cudnn_conv1d_pad_to_nc1d{false};
 
+  bool enable_cudnn{true};
+  std::string cudnn_path;
+
   cuda::TunableOpInfo tunable_op{};
 
   bool enable_skip_layer_norm_strict_mode{false};
@@ -115,6 +119,8 @@ struct std::hash<::onnxruntime::CUDAExecutionProviderInfo> {
     onnxruntime::HashCombine(info.gpu_mem_limit, value);
     onnxruntime::HashCombine(info.tunable_op.max_tuning_duration_ms, value);
     onnxruntime::HashCombine(info.sdpa_kernel, value);
+    onnxruntime::HashCombine(info.enable_cudnn, value);
+    onnxruntime::HashCombine(info.cudnn_path, value);
 
     // Memory pointers
     onnxruntime::HashCombine(reinterpret_cast<size_t>(info.user_compute_stream), value);
diff --git a/onnxruntime/core/providers/cuda/cuda_kernel.h b/onnxruntime/core/providers/cuda/cuda_kernel.h
index 1d891f204b9bd..b2d6d69bb6eca 100644
--- a/onnxruntime/core/providers/cuda/cuda_kernel.h
+++ b/onnxruntime/core/providers/cuda/cuda_kernel.h
@@ -7,6 +7,7 @@
 
 #include "core/providers/cuda/cuda_common.h"
 #include "core/providers/cuda/cuda_execution_provider.h"
+#include "core/providers/cuda/cudnn_loader.h"
 #include "core/providers/cuda/cuda_fwd.h"
 #include <mutex>
 #include "core/providers/cuda/cuda_stream_handle.h"
@@ -129,7 +130,7 @@ class CudaKernel : public OpKernel {
   }
 
   inline cudnnHandle_t GetCudnnHandle(OpKernelContext* ctx) const {
-    return GetCudnnHandle(static_cast<CudaStream*>(ctx->GetComputeStream()));
+    return RequireCudnnHandle(GetCudnnHandle(static_cast<CudaStream*>(ctx->GetComputeStream())));
   }
 
   static inline cudnnHandle_t GetCudnnHandle(onnxruntime::CudaStream* stream) {
@@ -143,7 +144,7 @@ class CudaKernel : public OpKernel {
   }
 
   inline cudnnHandle_t GetCudnnHandleOrDefault(onnxruntime::Stream* stream) const {
-    return stream ? GetCudnnHandle(stream) : DefaultCudnnHandle();
+    return stream ? RequireCudnnHandle(GetCudnnHandle(stream)) : DefaultCudnnHandle();
   }
 
   inline cublasHandle_t GetCublasHandle(OpKernelContext* ctx) const {
@@ -258,7 +259,16 @@ class CudaKernel : public OpKernel {
   }
 
   inline cudnnHandle_t DefaultCudnnHandle() const {
-    return provider_->PerThreadDefaultCudnnHandle();
+    return RequireCudnnHandle(provider_->PerThreadDefaultCudnnHandle());
+  }
+
+  static inline cudnnHandle_t RequireCudnnHandle(cudnnHandle_t handle) {
+    if (handle == nullptr) {
+      ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
+                                         "cuDNN is unavailable or disabled for CUDA Execution Provider: ",
+                                         cuda::CudnnLibrary::Get().Error()));
+    }
+    return handle;
   }
 
   inline cudaStream_t DefaultCudaStream() const {
diff --git a/onnxruntime/core/providers/cuda/cuda_stream_handle.cc b/onnxruntime/core/providers/cuda/cuda_stream_handle.cc
index c4e3bd7e63e5c..5ab6c5b08acab 100644
--- a/onnxruntime/core/providers/cuda/cuda_stream_handle.cc
+++ b/onnxruntime/core/providers/cuda/cuda_stream_handle.cc
@@ -3,6 +3,7 @@
 #include "core/providers/cuda/cuda_resource.h"
 #include "core/providers/cuda/cuda_stream_handle.h"
 #include "core/providers/cuda/cuda_common.h"
+#include "core/providers/cuda/cudnn_loader.h"
 #include "core/common/spin_pause.h"
 
 namespace onnxruntime {
@@ -77,13 +78,17 @@ CudaStream::CudaStream(cudaStream_t stream,
   if (own_flag) {
     CUBLAS_CALL_THROW(cublasCreate(&cublas_handle_));
     CUBLAS_CALL_THROW(cublasSetStream(cublas_handle_, stream));
-    CUDNN_CALL_THROW(cudnnCreate(&cudnn_handle_));
-    CUDNN_CALL_THROW(cudnnSetStream(cudnn_handle_, stream));
+    if (ep_info_.enable_cudnn && cuda::CudnnLibrary::Get().Available()) {
+      CUDNN_CALL_THROW(cudnnCreate(&cudnn_handle_));
+      CUDNN_CALL_THROW(cudnnSetStream(cudnn_handle_, stream));
+    }
   } else {
     cublas_handle_ = external_cublas_handle;
     CUBLAS_CALL_THROW(cublasSetStream(cublas_handle_, stream));
     cudnn_handle_ = external_cudnn_handle;
-    CUDNN_CALL_THROW(cudnnSetStream(cudnn_handle_, stream));
+    if (cudnn_handle_ != nullptr) {
+      CUDNN_CALL_THROW(cudnnSetStream(cudnn_handle_, stream));
+    }
   }
 #else
   (void)(external_cudnn_handle);
@@ -96,7 +101,9 @@ CudaStream::~CudaStream() {
 #ifndef USE_CUDA_MINIMAL
   if (own_stream_) {
     cublasDestroy(cublas_handle_);
-    cudnnDestroy(cudnn_handle_);
+    if (cudnn_handle_ != nullptr) {
+      cudnnDestroy(cudnn_handle_);
+    }
     auto* handle = GetHandle();
     if (handle)
       cudaStreamDestroy(static_cast<cudaStream_t>(handle));
diff --git a/onnxruntime/core/providers/cuda/cudnn_fe_call.cc b/onnxruntime/core/providers/cuda/cudnn_fe_call.cc
index 60d6b85544269..ce48c696425e8 100644
--- a/onnxruntime/core/providers/cuda/cudnn_fe_call.cc
+++ b/onnxruntime/core/providers/cuda/cudnn_fe_call.cc
@@ -11,6 +11,7 @@
 #endif
 #if !defined(__CUDACC__) && !defined(USE_CUDA_MINIMAL)
 #include <cudnn_frontend.h>
+#include "core/providers/cuda/cudnn_loader.h"
 #endif
 #ifdef _WIN32
 #else  // POSIX
@@ -71,6 +72,18 @@ std::conditional_t<THRW, void, Status> CudaCall(
     ERRTYPE retCode, const char* exprString, const char* libName, SUCCTYPE successCode, const char* msg,
     const char* file, const int line) {
   if (retCode != successCode) {
+#if !defined(__CUDACC__) && !defined(USE_CUDA_MINIMAL)
+    if (!cuda::CudnnLibrary::Get().Available()) {
+      auto status = ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
+                                    "cuDNN is unavailable for CUDA Execution Provider: ",
+                                    cuda::CudnnLibrary::Get().Error());
+      if constexpr (THRW) {
+        ORT_THROW(status.ErrorMessage());
+      } else {
+        return status;
+      }
+    }
+#endif
     try {
 #ifdef _WIN32
       std::string hostname_str = GetEnvironmentVar("COMPUTERNAME");
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.cc b/onnxruntime/core/providers/cuda/cudnn_loader.cc
new file mode 100644
index 0000000000000..ffdf322efd317
--- /dev/null
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.cc
@@ -0,0 +1,211 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/cuda/cudnn_loader.h"
+
+#ifndef USE_CUDA_MINIMAL
+
+#include <utility>
+#include <vector>
+
+#ifdef _WIN32
+#define NOMINMAX
+#include <windows.h>
+#else
+#include <dlfcn.h>
+#endif
+
+namespace {
+
+std::string JoinPath(std::string_view base, std::string_view child) {
+  if (base.empty()) {
+    return std::string(child);
+  }
+  const char last = base.back();
+  if (last == '/' || last == '\\') {
+    return std::string(base) + std::string(child);
+  }
+#ifdef _WIN32
+  return std::string(base) + "\\" + std::string(child);
+#else
+  return std::string(base) + "/" + std::string(child);
+#endif
+}
+
+std::vector<std::string> GetCandidateLibraryNames(std::string_view cudnn_path) {
+#ifdef _WIN32
+  constexpr const char* kCudnnLibraryName = "cudnn64_9.dll";
+#else
+  constexpr const char* kCudnnLibraryName = "libcudnn.so.9";
+  constexpr const char* kCudnnUnversionedLibraryName = "libcudnn.so";
+#endif
+
+  std::vector<std::string> candidates;
+  if (!cudnn_path.empty()) {
+    candidates.push_back(JoinPath(cudnn_path, kCudnnLibraryName));
+#ifdef _WIN32
+    candidates.push_back(JoinPath(JoinPath(cudnn_path, "bin"), kCudnnLibraryName));
+#else
+    candidates.push_back(JoinPath(cudnn_path, kCudnnUnversionedLibraryName));
+    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib"), kCudnnLibraryName));
+    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib"), kCudnnUnversionedLibraryName));
+    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib64"), kCudnnLibraryName));
+    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib64"), kCudnnUnversionedLibraryName));
+#endif
+  }
+
+  candidates.push_back(kCudnnLibraryName);
+#ifndef _WIN32
+  candidates.push_back(kCudnnUnversionedLibraryName);
+#endif
+  return candidates;
+}
+
+void* LoadLibraryCandidate(const std::string& candidate, std::string& error) {
+#ifdef _WIN32
+  HMODULE handle = LoadLibraryA(candidate.c_str());
+  if (handle == nullptr) {
+    error = "LoadLibrary failed for " + candidate + " with error " + std::to_string(GetLastError());
+  }
+  return reinterpret_cast<void*>(handle);
+#else
+  dlerror();
+  void* handle = dlopen(candidate.c_str(), RTLD_NOW | RTLD_LOCAL);
+  if (handle == nullptr) {
+    const char* dl_error = dlerror();
+    error = "dlopen failed for " + candidate + ": " + (dl_error != nullptr ? dl_error : "unknown error");
+  }
+  return handle;
+#endif
+}
+
+void* GetLibrarySymbol(void* handle, const char* symbol, std::string& error) {
+#ifdef _WIN32
+  void* address = reinterpret_cast<void*>(GetProcAddress(reinterpret_cast<HMODULE>(handle), symbol));
+  if (address == nullptr) {
+    error = "GetProcAddress failed for " + std::string(symbol) + " with error " + std::to_string(GetLastError());
+  }
+  return address;
+#else
+  dlerror();
+  void* address = dlsym(handle, symbol);
+  const char* dl_error = dlerror();
+  if (address == nullptr || dl_error != nullptr) {
+    error = "dlsym failed for " + std::string(symbol) + ": " + (dl_error != nullptr ? dl_error : "unknown error");
+  }
+  return address;
+#endif
+}
+
+}  // namespace
+
+#if defined(NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING)
+namespace cudnn_frontend {
+#ifdef _WIN32
+HMODULE cudnn_dlhandle = nullptr;
+#else
+void* cudnn_dlhandle = nullptr;
+#endif
+}  // namespace cudnn_frontend
+#endif
+
+namespace onnxruntime::cuda {
+
+CudnnLibrary& CudnnLibrary::Get() {
+  static CudnnLibrary library;
+  return library;
+}
+
+void CudnnLibrary::Configure(bool enabled, std::string cudnn_path) {
+  std::lock_guard<std::mutex> lock(mutex_);
+  if (!enabled) {
+    return;
+  }
+
+  enabled_ = enabled;
+  if (!load_attempted_) {
+    cudnn_path_ = std::move(cudnn_path);
+  }
+}
+
+bool CudnnLibrary::Available() {
+  return EnsureLoaded();
+}
+
+const char* CudnnLibrary::Error() const {
+  std::lock_guard<std::mutex> lock(mutex_);
+  return error_.empty() ? CudnnUnavailableErrorString() : error_.c_str();
+}
+
+void* CudnnLibrary::Handle() {
+  return EnsureLoaded() ? handle_ : nullptr;
+}
+
+bool CudnnLibrary::EnsureLoaded() {
+  std::lock_guard<std::mutex> lock(mutex_);
+  if (!enabled_) {
+    available_ = false;
+    error_ = "cuDNN was disabled by CUDA provider option enable_cudnn=0";
+    return false;
+  }
+
+  if (load_attempted_) {
+    return available_;
+  }
+
+  load_attempted_ = true;
+  std::string last_error;
+  for (const auto& candidate : GetCandidateLibraryNames(cudnn_path_)) {
+    handle_ = LoadLibraryCandidate(candidate, last_error);
+    if (handle_ != nullptr) {
+      available_ = true;
+      error_.clear();
+#if defined(NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING)
+#ifdef _WIN32
+      cudnn_frontend::cudnn_dlhandle = reinterpret_cast<HMODULE>(handle_);
+#else
+      cudnn_frontend::cudnn_dlhandle = handle_;
+#endif
+#endif
+      return true;
+    }
+  }
+
+  available_ = false;
+  error_ = last_error.empty() ? "cuDNN library was not found" : last_error;
+  return false;
+}
+
+void* CudnnLibrary::ResolveSymbol(const char* symbol) {
+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    auto it = symbols_.find(symbol);
+    if (it != symbols_.end()) {
+      return it->second;
+    }
+  }
+
+  if (!EnsureLoaded()) {
+    return nullptr;
+  }
+
+  std::lock_guard<std::mutex> lock(mutex_);
+  std::string symbol_error;
+  void* address = GetLibrarySymbol(handle_, symbol, symbol_error);
+  if (address == nullptr) {
+    available_ = false;
+    error_ = symbol_error;
+    return nullptr;
+  }
+
+  symbols_.emplace(symbol, address);
+  return address;
+}
+
+const char* CudnnUnavailableErrorString() {
+  return "cuDNN is not available. Install cuDNN, set CUDA provider option cudnn_path, or set enable_cudnn=0 to force native CUDA paths where available.";
+}
+
+}  // namespace onnxruntime::cuda
+
+#endif  // USE_CUDA_MINIMAL
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.h b/onnxruntime/core/providers/cuda/cudnn_loader.h
new file mode 100644
index 0000000000000..47f78fa75f006
--- /dev/null
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.h
@@ -0,0 +1,50 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#ifndef USE_CUDA_MINIMAL
+
+#include <mutex>
+#include <string>
+#include <unordered_map>
+
+#include "core/providers/cuda/cuda_pch.h"
+
+namespace onnxruntime::cuda {
+
+class CudnnLibrary {
+ public:
+  static CudnnLibrary& Get();
+
+  void Configure(bool enabled, std::string cudnn_path);
+  bool Available();
+  const char* Error() const;
+  void* Handle();
+
+  template <typename T>
+  T Resolve(const char* symbol) {
+    return reinterpret_cast<T>(ResolveSymbol(symbol));
+  }
+
+ private:
+  CudnnLibrary() = default;
+
+  bool EnsureLoaded();
+  void* ResolveSymbol(const char* symbol);
+
+  mutable std::mutex mutex_;
+  bool enabled_{true};
+  bool load_attempted_{false};
+  bool available_{false};
+  std::string cudnn_path_;
+  std::string error_;
+  void* handle_{nullptr};
+  std::unordered_map<std::string, void*> symbols_;
+};
+
+const char* CudnnUnavailableErrorString();
+
+}  // namespace onnxruntime::cuda
+
+#endif  // USE_CUDA_MINIMAL
diff --git a/onnxruntime/core/providers/cuda/cudnn_stub.cc b/onnxruntime/core/providers/cuda/cudnn_stub.cc
new file mode 100644
index 0000000000000..a02a00fb7fba1
--- /dev/null
+++ b/onnxruntime/core/providers/cuda/cudnn_stub.cc
@@ -0,0 +1,401 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/cuda/cudnn_loader.h"
+
+#ifndef USE_CUDA_MINIMAL
+
+#define ORT_CUDNN_FORWARD_STATUS(name, ...)              \
+  using Fn = decltype(&name);                            \
+  auto fn = onnxruntime::cuda::CudnnLibrary::Get().Resolve<Fn>(#name); \
+  return fn != nullptr ? fn(__VA_ARGS__) : CUDNN_STATUS_NOT_INITIALIZED
+
+extern "C" {
+
+size_t CUDNNWINAPI cudnnGetVersion(void) {
+  using Fn = decltype(&cudnnGetVersion);
+  auto fn = onnxruntime::cuda::CudnnLibrary::Get().Resolve<Fn>("cudnnGetVersion");
+  return fn != nullptr ? fn() : 0;
+}
+
+const char* CUDNNWINAPI cudnnGetErrorString(cudnnStatus_t status) {
+  using Fn = decltype(&cudnnGetErrorString);
+  auto fn = onnxruntime::cuda::CudnnLibrary::Get().Resolve<Fn>("cudnnGetErrorString");
+  return fn != nullptr ? fn(status) : onnxruntime::cuda::CudnnUnavailableErrorString();
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreate(cudnnHandle_t* handle) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreate, handle);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroy(cudnnHandle_t handle) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroy, handle);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetStream(cudnnHandle_t handle, cudaStream_t streamId) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetStream, handle, streamId);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateTensorDescriptor(cudnnTensorDescriptor_t* tensorDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateTensorDescriptor, tensorDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyTensorDescriptor, tensorDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetTensorNdDescriptor(cudnnTensorDescriptor_t tensorDesc, cudnnDataType_t dataType,
+                                                     int nbDims, const int dimA[], const int strideA[]) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetTensorNdDescriptor, tensorDesc, dataType, nbDims, dimA, strideA);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetTensor4dDescriptor(cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+                                                     cudnnDataType_t dataType, int n, int c, int h, int w) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetTensor4dDescriptor, tensorDesc, format, dataType, n, c, h, w);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetTensorNdDescriptor(const cudnnTensorDescriptor_t tensorDesc, int nbDimsRequested,
+                                                     cudnnDataType_t* dataType, int* nbDims, int dimA[], int strideA[]) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetTensorNdDescriptor, tensorDesc, nbDimsRequested, dataType, nbDims, dimA, strideA);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateFilterDescriptor(cudnnFilterDescriptor_t* filterDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateFilterDescriptor, filterDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyFilterDescriptor(cudnnFilterDescriptor_t filterDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyFilterDescriptor, filterDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetFilterNdDescriptor(cudnnFilterDescriptor_t filterDesc, cudnnDataType_t dataType,
+                                                     cudnnTensorFormat_t format, int nbDims, const int filterDimA[]) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetFilterNdDescriptor, filterDesc, dataType, format, nbDims, filterDimA);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetFilter4dDescriptor(cudnnFilterDescriptor_t filterDesc, cudnnDataType_t dataType,
+                                                     cudnnTensorFormat_t format, int k, int c, int h, int w) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetFilter4dDescriptor, filterDesc, dataType, format, k, c, h, w);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateConvolutionDescriptor(cudnnConvolutionDescriptor_t* convDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateConvolutionDescriptor, convDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyConvolutionDescriptor(cudnnConvolutionDescriptor_t convDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyConvolutionDescriptor, convDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetConvolutionNdDescriptor(cudnnConvolutionDescriptor_t convDesc, int arrayLength,
+                                                          const int padA[], const int filterStrideA[],
+                                                          const int dilationA[], cudnnConvolutionMode_t mode,
+                                                          cudnnDataType_t computeType) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetConvolutionNdDescriptor, convDesc, arrayLength, padA, filterStrideA, dilationA, mode,
+                           computeType);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetConvolutionGroupCount(cudnnConvolutionDescriptor_t convDesc, int groupCount) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetConvolutionGroupCount, convDesc, groupCount);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetConvolutionMathType(cudnnConvolutionDescriptor_t convDesc,
+                                                      cudnnMathType_t mathType) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetConvolutionMathType, convDesc, mathType);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateActivationDescriptor(cudnnActivationDescriptor_t* activationDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateActivationDescriptor, activationDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyActivationDescriptor(cudnnActivationDescriptor_t activationDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyActivationDescriptor, activationDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetActivationDescriptor(cudnnActivationDescriptor_t activationDesc,
+                                                       cudnnActivationMode_t mode,
+                                                       cudnnNanPropagation_t reluNanOpt, double coef) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetActivationDescriptor, activationDesc, mode, reluNanOpt, coef);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreatePoolingDescriptor(cudnnPoolingDescriptor_t* poolingDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreatePoolingDescriptor, poolingDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyPoolingDescriptor(cudnnPoolingDescriptor_t poolingDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyPoolingDescriptor, poolingDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetPoolingNdDescriptor(cudnnPoolingDescriptor_t poolingDesc,
+                                                      const cudnnPoolingMode_t mode,
+                                                      const cudnnNanPropagation_t maxpoolingNanOpt, int nbDims,
+                                                      const int windowDimA[], const int paddingA[],
+                                                      const int strideA[]) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetPoolingNdDescriptor, poolingDesc, mode, maxpoolingNanOpt, nbDims, windowDimA,
+                           paddingA, strideA);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateLRNDescriptor(cudnnLRNDescriptor_t* normDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateLRNDescriptor, normDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyLRNDescriptor(cudnnLRNDescriptor_t lrnDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyLRNDescriptor, lrnDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetLRNDescriptor(cudnnLRNDescriptor_t normDesc, unsigned lrnN, double lrnAlpha,
+                                                double lrnBeta, double lrnK) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetLRNDescriptor, normDesc, lrnN, lrnAlpha, lrnBeta, lrnK);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateReduceTensorDescriptor(cudnnReduceTensorDescriptor_t* reduceTensorDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateReduceTensorDescriptor, reduceTensorDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyReduceTensorDescriptor(cudnnReduceTensorDescriptor_t reduceTensorDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyReduceTensorDescriptor, reduceTensorDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetReduceTensorDescriptor(cudnnReduceTensorDescriptor_t reduceTensorDesc,
+                                                         cudnnReduceTensorOp_t reduceTensorOp,
+                                                         cudnnDataType_t reduceTensorCompType,
+                                                         cudnnNanPropagation_t reduceTensorNanOpt,
+                                                         cudnnReduceTensorIndices_t reduceTensorIndices,
+                                                         cudnnIndicesType_t reduceTensorIndicesType) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetReduceTensorDescriptor, reduceTensorDesc, reduceTensorOp, reduceTensorCompType,
+                           reduceTensorNanOpt, reduceTensorIndices, reduceTensorIndicesType);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateRNNDescriptor(cudnnRNNDescriptor_t* rnnDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateRNNDescriptor, rnnDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyRNNDescriptor(cudnnRNNDescriptor_t rnnDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyRNNDescriptor, rnnDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetRNNDescriptor_v8(cudnnRNNDescriptor_t rnnDesc, cudnnRNNAlgo_t algo,
+                                                   cudnnRNNMode_t cellMode, cudnnRNNBiasMode_t biasMode,
+                                                   cudnnDirectionMode_t dirMode, cudnnRNNInputMode_t inputMode,
+                                                   cudnnDataType_t dataType, cudnnDataType_t mathPrec,
+                                                   cudnnMathType_t mathType, int32_t inputSize, int32_t hiddenSize,
+                                                   int32_t projSize, int32_t numLayers,
+                                                   cudnnDropoutDescriptor_t dropoutDesc, uint32_t auxFlags) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetRNNDescriptor_v8, rnnDesc, algo, cellMode, biasMode, dirMode, inputMode, dataType,
+                           mathPrec, mathType, inputSize, hiddenSize, projSize, numLayers, dropoutDesc, auxFlags);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateRNNDataDescriptor(cudnnRNNDataDescriptor_t* rnnDataDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateRNNDataDescriptor, rnnDataDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyRNNDataDescriptor(cudnnRNNDataDescriptor_t rnnDataDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyRNNDataDescriptor, rnnDataDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetRNNDataDescriptor(cudnnRNNDataDescriptor_t rnnDataDesc, cudnnDataType_t dataType,
+                                                    cudnnRNNDataLayout_t layout, int maxSeqLength, int batchSize,
+                                                    int vectorSize, const int seqLengthArray[], void* paddingFill) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetRNNDataDescriptor, rnnDataDesc, dataType, layout, maxSeqLength, batchSize,
+                           vectorSize, seqLengthArray, paddingFill);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnCreateDropoutDescriptor(cudnnDropoutDescriptor_t* dropoutDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnCreateDropoutDescriptor, dropoutDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDestroyDropoutDescriptor(cudnnDropoutDescriptor_t dropoutDesc) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDestroyDropoutDescriptor, dropoutDesc);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSetDropoutDescriptor(cudnnDropoutDescriptor_t dropoutDesc, cudnnHandle_t handle,
+                                                    float dropout, void* states, size_t stateSizeInBytes,
+                                                    unsigned long long seed) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSetDropoutDescriptor, dropoutDesc, handle, dropout, states, stateSizeInBytes, seed);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDropoutGetStatesSize(cudnnHandle_t handle, size_t* sizeInBytes) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDropoutGetStatesSize, handle, sizeInBytes);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnDeriveBNTensorDescriptor(cudnnTensorDescriptor_t derivedBnDesc,
+                                                        const cudnnTensorDescriptor_t xDesc,
+                                                        cudnnBatchNormMode_t mode) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnDeriveBNTensorDescriptor, derivedBnDesc, xDesc, mode);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnAddTensor(cudnnHandle_t handle, const void* alpha,
+                                         const cudnnTensorDescriptor_t aDesc, const void* A, const void* beta,
+                                         const cudnnTensorDescriptor_t cDesc, void* C) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnAddTensor, handle, alpha, aDesc, A, beta, cDesc, C);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnActivationForward(cudnnHandle_t handle, cudnnActivationDescriptor_t activationDesc,
+                                                 const void* alpha, const cudnnTensorDescriptor_t xDesc, const void* x,
+                                                 const void* beta, const cudnnTensorDescriptor_t yDesc, void* y) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnActivationForward, handle, activationDesc, alpha, xDesc, x, beta, yDesc, y);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnPoolingForward(cudnnHandle_t handle, const cudnnPoolingDescriptor_t poolingDesc,
+                                              const void* alpha, const cudnnTensorDescriptor_t xDesc, const void* x,
+                                              const void* beta, const cudnnTensorDescriptor_t yDesc, void* y) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnPoolingForward, handle, poolingDesc, alpha, xDesc, x, beta, yDesc, y);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnLRNCrossChannelForward(cudnnHandle_t handle, cudnnLRNDescriptor_t normDesc,
+                                                      cudnnLRNMode_t lrnMode, const void* alpha,
+                                                      const cudnnTensorDescriptor_t xDesc, const void* x,
+                                                      const void* beta, const cudnnTensorDescriptor_t yDesc, void* y) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnLRNCrossChannelForward, handle, normDesc, lrnMode, alpha, xDesc, x, beta, yDesc, y);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSoftmaxForward(cudnnHandle_t handle, cudnnSoftmaxAlgorithm_t algo,
+                                              cudnnSoftmaxMode_t mode, const void* alpha,
+                                              const cudnnTensorDescriptor_t xDesc, const void* x, const void* beta,
+                                              const cudnnTensorDescriptor_t yDesc, void* y) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSoftmaxForward, handle, algo, mode, alpha, xDesc, x, beta, yDesc, y);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnSoftmaxBackward(cudnnHandle_t handle, cudnnSoftmaxAlgorithm_t algo,
+                                               cudnnSoftmaxMode_t mode, const void* alpha,
+                                               const cudnnTensorDescriptor_t yDesc, const void* y,
+                                               const cudnnTensorDescriptor_t dyDesc, const void* dy, const void* beta,
+                                               const cudnnTensorDescriptor_t dxDesc, void* dx) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnSoftmaxBackward, handle, algo, mode, alpha, yDesc, y, dyDesc, dy, beta, dxDesc, dx);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnBatchNormalizationForwardInference(
+    cudnnHandle_t handle, cudnnBatchNormMode_t mode, const void* alpha, const void* beta,
+    const cudnnTensorDescriptor_t xDesc, const void* x, const cudnnTensorDescriptor_t yDesc, void* y,
+    const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc, const void* bnScale, const void* bnBias,
+    const void* estimatedMean, const void* estimatedVariance, double epsilon) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnBatchNormalizationForwardInference, handle, mode, alpha, beta, xDesc, x, yDesc, y,
+                           bnScaleBiasMeanVarDesc, bnScale, bnBias, estimatedMean, estimatedVariance, epsilon);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnBatchNormalizationForwardTraining(
+    cudnnHandle_t handle, cudnnBatchNormMode_t mode, const void* alpha, const void* beta,
+    const cudnnTensorDescriptor_t xDesc, const void* x, const cudnnTensorDescriptor_t yDesc, void* y,
+    const cudnnTensorDescriptor_t bnScaleBiasMeanVarDesc, const void* bnScale, const void* bnBias,
+    double exponentialAverageFactor, void* resultRunningMean, void* resultRunningVariance, double epsilon,
+    void* resultSaveMean, void* resultSaveInvVariance) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnBatchNormalizationForwardTraining, handle, mode, alpha, beta, xDesc, x, yDesc, y,
+                           bnScaleBiasMeanVarDesc, bnScale, bnBias, exponentialAverageFactor, resultRunningMean,
+                           resultRunningVariance, epsilon, resultSaveMean, resultSaveInvVariance);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnConvolutionForward(
+    cudnnHandle_t handle, const void* alpha, const cudnnTensorDescriptor_t xDesc, const void* x,
+    const cudnnFilterDescriptor_t wDesc, const void* w, const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionFwdAlgo_t algo, void* workSpace, size_t workSpaceSizeInBytes, const void* beta,
+    const cudnnTensorDescriptor_t yDesc, void* y) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnConvolutionForward, handle, alpha, xDesc, x, wDesc, w, convDesc, algo, workSpace,
+                           workSpaceSizeInBytes, beta, yDesc, y);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnConvolutionBiasActivationForward(
+    cudnnHandle_t handle, const void* alpha1, const cudnnTensorDescriptor_t xDesc, const void* x,
+    const cudnnFilterDescriptor_t wDesc, const void* w, const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionFwdAlgo_t algo, void* workSpace, size_t workSpaceSizeInBytes, const void* alpha2,
+    const cudnnTensorDescriptor_t zDesc, const void* z, const cudnnTensorDescriptor_t biasDesc, const void* bias,
+    const cudnnActivationDescriptor_t activationDesc, const cudnnTensorDescriptor_t yDesc, void* y) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnConvolutionBiasActivationForward, handle, alpha1, xDesc, x, wDesc, w, convDesc, algo,
+                           workSpace, workSpaceSizeInBytes, alpha2, zDesc, z, biasDesc, bias, activationDesc, yDesc,
+                           y);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnConvolutionBackwardData(
+    cudnnHandle_t handle, const void* alpha, const cudnnFilterDescriptor_t wDesc, const void* w,
+    const cudnnTensorDescriptor_t dyDesc, const void* dy, const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionBwdDataAlgo_t algo, void* workSpace, size_t workSpaceSizeInBytes, const void* beta,
+    const cudnnTensorDescriptor_t dxDesc, void* dx) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnConvolutionBackwardData, handle, alpha, wDesc, w, dyDesc, dy, convDesc, algo,
+                           workSpace, workSpaceSizeInBytes, beta, dxDesc, dx);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnFindConvolutionForwardAlgorithmEx(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc, const void* x, const cudnnFilterDescriptor_t wDesc,
+    const void* w, const cudnnConvolutionDescriptor_t convDesc, const cudnnTensorDescriptor_t yDesc, void* y,
+    const int requestedAlgoCount, int* returnedAlgoCount, cudnnConvolutionFwdAlgoPerf_t* perfResults,
+    void* workSpace, size_t workSpaceSizeInBytes) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnFindConvolutionForwardAlgorithmEx, handle, xDesc, x, wDesc, w, convDesc, yDesc, y,
+                           requestedAlgoCount, returnedAlgoCount, perfResults, workSpace, workSpaceSizeInBytes);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnFindConvolutionBackwardDataAlgorithmEx(
+    cudnnHandle_t handle, const cudnnFilterDescriptor_t wDesc, const void* w, const cudnnTensorDescriptor_t dyDesc,
+    const void* dy, const cudnnConvolutionDescriptor_t convDesc, const cudnnTensorDescriptor_t dxDesc, void* dx,
+    const int requestedAlgoCount, int* returnedAlgoCount, cudnnConvolutionBwdDataAlgoPerf_t* perfResults,
+    void* workSpace, size_t workSpaceSizeInBytes) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnFindConvolutionBackwardDataAlgorithmEx, handle, wDesc, w, dyDesc, dy, convDesc, dxDesc,
+                           dx, requestedAlgoCount, returnedAlgoCount, perfResults, workSpace, workSpaceSizeInBytes);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetConvolutionForwardAlgorithm_v7(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t srcDesc, const cudnnFilterDescriptor_t filterDesc,
+    const cudnnConvolutionDescriptor_t convDesc, const cudnnTensorDescriptor_t destDesc, const int requestedAlgoCount,
+    int* returnedAlgoCount, cudnnConvolutionFwdAlgoPerf_t* perfResults) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetConvolutionForwardAlgorithm_v7, handle, srcDesc, filterDesc, convDesc, destDesc,
+                           requestedAlgoCount, returnedAlgoCount, perfResults);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetConvolutionForwardWorkspaceSize(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc, const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc, const cudnnTensorDescriptor_t yDesc, cudnnConvolutionFwdAlgo_t algo,
+    size_t* sizeInBytes) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetConvolutionForwardWorkspaceSize, handle, xDesc, wDesc, convDesc, yDesc, algo,
+                           sizeInBytes);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetReductionIndicesSize(cudnnHandle_t handle,
+                                                       const cudnnReduceTensorDescriptor_t reduceTensorDesc,
+                                                       const cudnnTensorDescriptor_t aDesc,
+                                                       const cudnnTensorDescriptor_t cDesc, size_t* sizeInBytes) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetReductionIndicesSize, handle, reduceTensorDesc, aDesc, cDesc, sizeInBytes);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetReductionWorkspaceSize(cudnnHandle_t handle,
+                                                         const cudnnReduceTensorDescriptor_t reduceTensorDesc,
+                                                         const cudnnTensorDescriptor_t aDesc,
+                                                         const cudnnTensorDescriptor_t cDesc, size_t* sizeInBytes) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetReductionWorkspaceSize, handle, reduceTensorDesc, aDesc, cDesc, sizeInBytes);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnReduceTensor(cudnnHandle_t handle, const cudnnReduceTensorDescriptor_t reduceTensorDesc,
+                                            void* indices, size_t indicesSizeInBytes, void* workspace,
+                                            size_t workspaceSizeInBytes, const void* alpha,
+                                            const cudnnTensorDescriptor_t aDesc, const void* A, const void* beta,
+                                            const cudnnTensorDescriptor_t cDesc, void* C) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnReduceTensor, handle, reduceTensorDesc, indices, indicesSizeInBytes, workspace,
+                           workspaceSizeInBytes, alpha, aDesc, A, beta, cDesc, C);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetRNNTempSpaceSizes(cudnnHandle_t handle, cudnnRNNDescriptor_t rnnDesc,
+                                                    cudnnForwardMode_t fwdMode, cudnnRNNDataDescriptor_t xDesc,
+                                                    size_t* workSpaceSize, size_t* reserveSpaceSize) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetRNNTempSpaceSizes, handle, rnnDesc, fwdMode, xDesc, workSpaceSize,
+                           reserveSpaceSize);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnGetRNNWeightParams(
+    cudnnHandle_t handle, cudnnRNNDescriptor_t rnnDesc, int32_t pseudoLayer, size_t weightSpaceSize,
+    const void* weightSpace, int32_t linLayerID, cudnnTensorDescriptor_t mDesc, void** mAddr,
+    cudnnTensorDescriptor_t bDesc, void** bAddr) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnGetRNNWeightParams, handle, rnnDesc, pseudoLayer, weightSpaceSize, weightSpace,
+                           linLayerID, mDesc, mAddr, bDesc, bAddr);
+}
+
+cudnnStatus_t CUDNNWINAPI cudnnRNNForward(
+    cudnnHandle_t handle, cudnnRNNDescriptor_t rnnDesc, cudnnForwardMode_t fwdMode, const int32_t devSeqLengths[],
+    cudnnRNNDataDescriptor_t xDesc, const void* x, cudnnRNNDataDescriptor_t yDesc, void* y,
+    cudnnTensorDescriptor_t hDesc, const void* hx, void* hy, cudnnTensorDescriptor_t cDesc, const void* cx, void* cy,
+    size_t weightSpaceSize, const void* weightSpace, size_t workSpaceSize, void* workSpace, size_t reserveSpaceSize,
+    void* reserveSpace) {
+  ORT_CUDNN_FORWARD_STATUS(cudnnRNNForward, handle, rnnDesc, fwdMode, devSeqLengths, xDesc, x, yDesc, y, hDesc, hx,
+                           hy, cDesc, cx, cy, weightSpaceSize, weightSpace, workSpaceSize, workSpace,
+                           reserveSpaceSize, reserveSpace);
+}
+
+}  // extern "C"
+
+#undef ORT_CUDNN_FORWARD_STATUS
+
+#endif  // USE_CUDA_MINIMAL
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc b/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
index 13f7bfd7a40cf..6555f55541c4c 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
@@ -3,6 +3,7 @@
 
 #include "cuda_ep.h"
 #include "cuda_ep_factory.h"
+#include "core/providers/cuda/cudnn_loader.h"
 #include "cuda_stream_plugin.h"
 #include "cuda_graph_plugin.h"
 #include "core/providers/cuda/plugin/cuda_kernel_adapter.h"
@@ -127,6 +128,8 @@ CudaEp::CudaEp(CudaEpFactory& factory, const Config& config, const OrtLogger& lo
   // ORT uses it to avoid reading OrtEp struct fields that did not exist when the plugin was compiled.
   ort_version_supported = ORT_API_VERSION;
 
+  onnxruntime::cuda::CudnnLibrary::Get().Configure(config_.enable_cudnn, config_.cudnn_path);
+
   // The plugin is compiled against the latest ORT headers (ORT_API_VERSION) but may be loaded by an
   // older ORT runtime, down to the floor declared in plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION. Some
   // OrtEp callbacks below — and the OrtEpApi functions their implementations call — only exist in
@@ -206,6 +209,7 @@ CudaEp::CudaEp(CudaEpFactory& factory, const Config& config, const OrtLogger& lo
   adapter_config.cudnn_conv_algo = config_.cudnn_conv_algo;
   adapter_config.cudnn_conv_use_max_workspace = config_.cudnn_conv_use_max_workspace;
   adapter_config.cudnn_conv1d_pad_to_nc1d = config_.cudnn_conv1d_pad_to_nc1d;
+  adapter_config.enable_cudnn = config_.enable_cudnn;
   adapter_config.fuse_conv_bias = config_.fuse_conv_bias;
   adapter_config.sdpa_kernel = config_.sdpa_kernel;
   adapter_config.device_id = config_.device_id;
@@ -402,7 +406,7 @@ OrtStatus* ORT_API_CALL CudaEp::CreateSyncStreamForDeviceImpl(
     return Ort::GetApi().CreateStatus(ORT_INVALID_ARGUMENT, error.c_str());
   }
 
-  auto cuda_stream = std::make_unique<CudaSyncStream>(ep->factory_, device_id, this_ptr);
+  auto cuda_stream = std::make_unique<CudaSyncStream>(ep->factory_, device_id, ep->config_.enable_cudnn, this_ptr);
 
   if (ep->config_.has_user_compute_stream) {
     // A user-provided compute stream is honored for kernels regardless of whether CUDA graph
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep.h b/onnxruntime/core/providers/cuda/plugin/cuda_ep.h
index 346f73a4cfa0b..38917cd1be07a 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep.h
@@ -31,6 +31,8 @@ class CudaEp : public onnxruntime::ep::adapter::Ep {
     int cudnn_conv_algo = 0;                          ///< cuDNN convolution algorithm selection.
     bool cudnn_conv_use_max_workspace = true;         ///< Use maximum workspace for cuDNN conv algo search.
     bool cudnn_conv1d_pad_to_nc1d = false;            ///< Pad 1D convolutions to NC1D format.
+    bool enable_cudnn = true;                         ///< Enable runtime loading and use of cuDNN-backed kernels.
+    std::string cudnn_path;                           ///< Optional directory containing cuDNN runtime libraries.
     bool fuse_conv_bias = false;                      ///< Enable cuDNN frontend conv+bias fusion.
     int sdpa_kernel = 0;                              ///< Attention backend bitmask override.
     bool enable_cuda_graph = false;                   ///< Enable CUDA graph capture and replay.
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc b/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
index d445d8bab033c..3b79d3ca4ef97 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
@@ -420,6 +420,18 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateEpImpl(
     }
   };
 
+  auto read_session_config_string = [&](std::initializer_list<std::string_view> keys, std::string& value) {
+    for (const auto& key : keys) {
+      auto raw_value = try_get_session_config(key);
+      if (!raw_value.has_value()) {
+        continue;
+      }
+
+      value = std::move(*raw_value);
+      return;
+    }
+  };
+
   auto read_cudnn_conv_algo = [&](std::initializer_list<std::string_view> keys, int& value) {
     for (const auto& key : keys) {
       auto raw_value = try_get_session_config(key);
@@ -487,6 +499,8 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateEpImpl(
   const std::string cudnn_conv1d_pad_key = ep_options_prefix + "cudnn_conv1d_pad_to_nc1d";
   const std::string cudnn_conv_algo_key = ep_options_prefix + "cudnn_conv_algo";
   const std::string cudnn_conv_algo_search_key = ep_options_prefix + "cudnn_conv_algo_search";
+  const std::string enable_cudnn_key = ep_options_prefix + "enable_cudnn";
+  const std::string cudnn_path_key = ep_options_prefix + "cudnn_path";
   const std::string fuse_conv_bias_key = ep_options_prefix + "fuse_conv_bias";
   const std::string sdpa_kernel_key = ep_options_prefix + "sdpa_kernel";
   const std::string enable_cuda_graph_key = ep_options_prefix + "enable_cuda_graph";
@@ -518,6 +532,12 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateEpImpl(
       {cudnn_conv_algo_search_key, cudnn_conv_algo_key, "ep.cuda.cudnn_conv_algo_search", "ep.cuda.cudnn_conv_algo",
        "cudnn_conv_algo_search", "cudnn_conv_algo"},
       config.cudnn_conv_algo);
+    read_session_config_bool(
+      {enable_cudnn_key, "ep.cuda.enable_cudnn", "enable_cudnn"},
+      config.enable_cudnn);
+    read_session_config_string(
+      {cudnn_path_key, "ep.cuda.cudnn_path", "cudnn_path"},
+      config.cudnn_path);
   read_session_config_bool(
       {fuse_conv_bias_key, "ep.cuda.fuse_conv_bias", "fuse_conv_bias"},
       config.fuse_conv_bias);
@@ -885,7 +905,7 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateSyncStreamForDeviceImpl(
 
   auto* factory = static_cast<CudaEpFactory*>(this_ptr);
   int req_device_id = factory->ep_api_.MemoryDevice_GetDeviceId(memory_device);
-  auto cuda_stream = std::make_unique<CudaSyncStream>(*factory, req_device_id, nullptr);
+  auto cuda_stream = std::make_unique<CudaSyncStream>(*factory, req_device_id, true, nullptr);
 
   // Initialize CUDA handles (stream, cuBLAS, cuDNN)
   RETURN_IF_ERROR(cuda_stream->InitHandles());
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h b/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
index f134c599d5b46..d8fb2b39d6393 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
@@ -32,6 +32,7 @@
 #include <cublas_v2.h>
 #include <cudnn.h>
 #include "core/providers/cuda/shared_inc/cuda_call.h"
+#include "core/providers/cuda/cudnn_loader.h"
 #include "contrib_ops/cuda/bert/attention_kernel_options.h"
 
 #ifdef __CUDACC__
@@ -179,6 +180,11 @@ using ::onnxruntime::HandleNegativeAxis;
   {                                                                                                                                                                 \
     cudnnStatus_t _status = (expr);                                                                                                                                 \
     if (_status != CUDNN_STATUS_SUCCESS) {                                                                                                                          \
+      if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) {                                                                                                     \
+        return onnxruntime::common::Status(onnxruntime::common::ONNXRUNTIME, onnxruntime::common::NOT_IMPLEMENTED,                                                  \
+                                           std::string("cuDNN is unavailable for CUDA Plugin Execution Provider: ") +                                                \
+                                               onnxruntime::cuda::CudnnLibrary::Get().Error());                                                                      \
+      }                                                                                                                                                             \
       return onnxruntime::common::Status(onnxruntime::common::ONNXRUNTIME, onnxruntime::common::FAIL, std::string("cuDNN error: ") + cudnnGetErrorString(_status)); \
     }                                                                                                                                                               \
   }
@@ -436,6 +442,7 @@ struct CudaKernelAdapterRuntimeConfig {
   int cudnn_conv_algo = 0;
   bool cudnn_conv_use_max_workspace = true;
   bool cudnn_conv1d_pad_to_nc1d = false;
+  bool enable_cudnn = true;
   bool fuse_conv_bias = false;
   int sdpa_kernel = 0;
   int device_id = 0;
@@ -516,14 +523,16 @@ inline DefaultCudaHandles& GetDefaultCudaHandlesForDevice(int device_id) {
       handles_by_device.erase(it);
       ORT_THROW("Failed to create default cuBLAS handle for CUDA plugin device ", device_id);
     }
-    if (cudnnCreate(&it->second.cudnn) != CUDNN_STATUS_SUCCESS) {
-      cublasDestroy(it->second.cublas);
-      it->second.cublas = nullptr;
-      if (get_device_result == cudaSuccess) {
-        cudaSetDevice(prev_device);
+    if (onnxruntime::cuda::CudnnLibrary::Get().Available()) {
+      if (cudnnCreate(&it->second.cudnn) != CUDNN_STATUS_SUCCESS) {
+        cublasDestroy(it->second.cublas);
+        it->second.cublas = nullptr;
+        if (get_device_result == cudaSuccess) {
+          cudaSetDevice(prev_device);
+        }
+        handles_by_device.erase(it);
+        ORT_THROW("Failed to create default cuDNN handle for CUDA plugin device ", device_id);
       }
-      handles_by_device.erase(it);
-      ORT_THROW("Failed to create default cuDNN handle for CUDA plugin device ", device_id);
     }
     if (cublasLtCreate(&it->second.cublas_lt) != CUBLAS_STATUS_SUCCESS) {
       cudnnDestroy(it->second.cudnn);
@@ -646,6 +655,7 @@ inline void SetCudaKernelAdapterRuntimeConfigForProvider(
   config->cudnn_conv_algo = init_config.cudnn_conv_algo;
   config->cudnn_conv_use_max_workspace = init_config.cudnn_conv_use_max_workspace;
   config->cudnn_conv1d_pad_to_nc1d = init_config.cudnn_conv1d_pad_to_nc1d;
+  config->enable_cudnn = init_config.enable_cudnn;
   config->fuse_conv_bias = init_config.fuse_conv_bias;
   config->sdpa_kernel = init_config.sdpa_kernel;
   config->device_id = init_config.device_id;
@@ -890,7 +900,12 @@ class CudaKernel : public OpKernel {
 
   inline cudaStream_t DefaultCudaStream() const { return Stream(static_cast<OpKernelContext*>(nullptr)); }
   inline cublasHandle_t DefaultCublasHandle() const { return detail::GetDefaultCudaHandlesForDevice(device_id_).cublas; }
-  inline cudnnHandle_t DefaultCudnnHandle() const { return detail::GetDefaultCudaHandlesForDevice(device_id_).cudnn; }
+  inline cudnnHandle_t DefaultCudnnHandle() const {
+    if (!runtime_config_->enable_cudnn || !onnxruntime::cuda::CudnnLibrary::Get().Available()) {
+      return nullptr;
+    }
+    return detail::GetDefaultCudaHandlesForDevice(device_id_).cudnn;
+  }
   inline cublasLtHandle_t DefaultCublasLtHandle() const { return detail::GetDefaultCudaHandlesForDevice(device_id_).cublas_lt; }
 
   inline Status CopyTensor(const onnxruntime::Tensor& src, onnxruntime::Tensor& dst, onnxruntime::Stream& stream) const {
@@ -934,7 +949,13 @@ class CudaKernel : public OpKernel {
     }
 
     handle = DefaultCudnnHandle();
-    if (stream != nullptr) {
+    if (handle == nullptr) {
+      ORT_THROW_IF_ERROR(onnxruntime::common::Status(
+          onnxruntime::common::ONNXRUNTIME, onnxruntime::common::NOT_IMPLEMENTED,
+            std::string("cuDNN is unavailable or disabled for CUDA Plugin Execution Provider: ") +
+              onnxruntime::cuda::CudnnLibrary::Get().Error()));
+    }
+    if (handle != nullptr && stream != nullptr) {
       CUDNN_CALL_THROW(cudnnSetStream(handle, stream));
     }
     return handle;
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h b/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
index 6ad2bb0f53ccc..d3ab3a6a18b1e 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
@@ -30,6 +30,8 @@
 #include <cudnn.h>
 #include <cublasLt.h>
 
+#include "core/providers/cuda/cudnn_loader.h"
+
 // Error handling macros
 
 #ifndef PL_CUDA_RETURN_IF_ERROR
@@ -73,6 +75,14 @@ inline Ort::Status StatusFromCudnnError(cudnnStatus_t cudnn_err) {
     return Ort::Status{};
   }
 
+  if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) {
+    return Ort::Status{
+        (std::string("cuDNN is unavailable for CUDA Plugin Execution Provider: ") +
+         onnxruntime::cuda::CudnnLibrary::Get().Error())
+            .c_str(),
+        ORT_NOT_IMPLEMENTED};
+  }
+
   return Ort::Status{
       (std::string("cuDNN error: ") + cudnnGetErrorString(cudnn_err)).c_str(),
       ORT_EP_FAIL};
@@ -114,6 +124,13 @@ inline bool TryGetCurrentCudaDevice(int& device_id) noexcept {
   do {                                            \
     cudnnStatus_t _cudnn_err = (cudnn_call_expr); \
     if (_cudnn_err != CUDNN_STATUS_SUCCESS) {     \
+      if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) { \
+        return Ort::GetApi().CreateStatus(        \
+            ORT_NOT_IMPLEMENTED,                  \
+            (std::string("cuDNN is unavailable for CUDA Plugin Execution Provider: ") + \
+             onnxruntime::cuda::CudnnLibrary::Get().Error()) \
+                .c_str());                        \
+      }                                           \
       return Ort::GetApi().CreateStatus(          \
           ORT_EP_FAIL,                            \
           (std::string("cuDNN error: ") +         \
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc b/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
index e497a9151c73d..6ff83c344553f 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
@@ -3,6 +3,7 @@
 
 #include "cuda_stream_plugin.h"
 #include "cuda_ep_factory.h"
+#include "core/providers/cuda/cudnn_loader.h"
 #include <atomic>
 #include <mutex>
 #include <shared_mutex>
@@ -39,11 +40,12 @@ std::atomic<uint64_t>& GetStreamMapGeneration() {
 // CudaSyncStream
 // ---------------------------------------------------------------------------
 
-CudaSyncStream::CudaSyncStream(CudaEpFactory& factory, int device_id,
+CudaSyncStream::CudaSyncStream(CudaEpFactory& factory, int device_id, bool enable_cudnn,
                                const OrtEp* /*ep*/)
     : OrtSyncStreamImpl{},
       factory_(factory),
-      device_id_(device_id) {
+  device_id_(device_id),
+  enable_cudnn_(enable_cudnn) {
   ort_version_supported = ORT_API_VERSION;
   GetHandle = GetHandleImpl;
   CreateNotification = CreateNotificationImpl;
@@ -124,10 +126,10 @@ OrtStatus* CudaSyncStream::InitHandles() {
   if (status.IsOK()) {
     status = StatusFromCublasError(cublasSetStream(cublas_handle_, cuda_stream_));
   }
-  if (status.IsOK()) {
+  if (status.IsOK() && enable_cudnn_ && onnxruntime::cuda::CudnnLibrary::Get().Available()) {
     status = StatusFromCudnnError(cudnnCreate(&cudnn_handle_));
   }
-  if (status.IsOK()) {
+  if (status.IsOK() && cudnn_handle_ != nullptr) {
     status = StatusFromCudnnError(cudnnSetStream(cudnn_handle_, cuda_stream_));
   }
   if (status.IsOK()) {
@@ -192,10 +194,10 @@ OrtStatus* CudaSyncStream::InitHandlesWithUserStream(cudaStream_t user_stream) {
   if (status.IsOK()) {
     status = StatusFromCublasError(cublasSetStream(cublas_handle_, cuda_stream_));
   }
-  if (status.IsOK()) {
+  if (status.IsOK() && enable_cudnn_ && onnxruntime::cuda::CudnnLibrary::Get().Available()) {
     status = StatusFromCudnnError(cudnnCreate(&cudnn_handle_));
   }
-  if (status.IsOK()) {
+  if (status.IsOK() && cudnn_handle_ != nullptr) {
     status = StatusFromCudnnError(cudnnSetStream(cudnn_handle_, cuda_stream_));
   }
   if (status.IsOK()) {
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h b/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h
index 1f1e5fade3de1..94d458149e2ca 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.h
@@ -28,7 +28,7 @@ class CudaEpFactory;
 /// or wraps an external stream for graph-mode registration/lifecycle tracking.
 class CudaSyncStream : public OrtSyncStreamImpl {
  public:
-  CudaSyncStream(CudaEpFactory& factory, int device_id,
+  CudaSyncStream(CudaEpFactory& factory, int device_id, bool enable_cudnn,
                  const OrtEp* ep);
   ~CudaSyncStream();
 
@@ -70,6 +70,7 @@ class CudaSyncStream : public OrtSyncStreamImpl {
 
   CudaEpFactory& factory_;
   int device_id_;
+  bool enable_cudnn_ = true;
   cudaStream_t cuda_stream_ = nullptr;
   bool owns_stream_ = true;  ///< False when wrapping an external stream (e.g., for CUDA graph).
   cublasHandle_t cublas_handle_ = nullptr;
diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index bebd290bbe810..7351ea3f3cac2 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -439,6 +439,10 @@ def test_get_and_set_option_with_values(option_name, option_values):
 
                 test_get_and_set_option_with_values("cudnn_conv_algo_search", ["DEFAULT", "EXHAUSTIVE", "HEURISTIC"])
 
+                test_get_and_set_option_with_values("enable_cudnn", ["1", "0"])
+
+                test_get_and_set_option_with_values("cudnn_path", ["", "/tmp/ort_cudnn"])
+
                 test_get_and_set_option_with_values("do_copy_in_default_stream", [0, 1])
 
                 test_get_and_set_option_with_values("tunable_op_enable", ["1", "0"])

From 23e39da6614ab57eb2b26db2f9b04758a27753db Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 23:57:19 +0000
Subject: [PATCH 03/15] remove cudnn_path provider option

---
 docs/CUDA_cuDNN_Optional_Design.md            | 103 ++++++++----------
 .../providers/cuda/cuda_execution_provider.cc |   2 +-
 .../cuda/cuda_execution_provider_info.cc      |   4 -
 .../cuda/cuda_execution_provider_info.h       |   2 -
 .../core/providers/cuda/cudnn_loader.cc       |  40 +------
 .../core/providers/cuda/cudnn_loader.h        |   3 +-
 .../core/providers/cuda/plugin/cuda_ep.cc     |   2 +-
 .../core/providers/cuda/plugin/cuda_ep.h      |   1 -
 .../providers/cuda/plugin/cuda_ep_factory.cc  |  18 +--
 .../test/python/onnxruntime_test_python.py    |   2 -
 10 files changed, 54 insertions(+), 123 deletions(-)

diff --git a/docs/CUDA_cuDNN_Optional_Design.md b/docs/CUDA_cuDNN_Optional_Design.md
index 129e2d0fe1922..8d24b8db30186 100644
--- a/docs/CUDA_cuDNN_Optional_Design.md
+++ b/docs/CUDA_cuDNN_Optional_Design.md
@@ -219,18 +219,17 @@ Loader responsibilities:
   dlopens its sub‑libraries.
 - Resolve each required symbol with `dlsym` / `GetProcAddress`.
 - If the umbrella library or any **required** symbol is missing, set `available_ = false`.
-- Honor an optional runtime directory provider option, `cudnn_path`, before falling back to
-  the default OS/library search paths. This mirrors the role of `onnxruntime_CUDNN_HOME` at
-  build time, but is deliberately runtime-only and points to the directory that contains the
-  cuDNN shared libraries (or to a cuDNN root directory with `bin` / `lib` children).
+- Do not accept a provider option that names a cuDNN runtime path. A provider-controlled
+  native library path is equivalent to a native code loading hook if provider options are
+  influenced by untrusted input.
 
 On Windows, cuDNN 9 is split into multiple DLLs. The loader should not rely on the process
-working directory or global `PATH`. If `cudnn_path` is set, first add that directory to the
-DLL search path for this load, or load the required cuDNN DLLs from that directory in a known
-order before loading `cudnn64_9.dll`. This is the C++ equivalent of the Python package's
-`preload_dlls()` behavior. On Linux, prefer loading the umbrella `libcudnn.so.9` from
-`cudnn_path` and let cuDNN resolve its own sub-libraries, matching the current Python preload
-behavior.
+working directory. Application or package code that needs an explicit directory should use a
+trusted process-level preload mechanism, such as Python `preload_dlls(cudnn=True,
+directory=...)`, before creating the session. On Linux, the C++ loader uses the default
+dynamic loader search behavior for the umbrella `libcudnn.so.9`; deployment should provide
+trusted library paths via the system loader, container image, package manager, or explicit
+application preload.
 
 The loader must not run when the CUDA provider option `enable_cudnn=0` is set (see §3.3).
 This keeps "force no cuDNN" tests deterministic even on machines where cuDNN is installed.
@@ -270,41 +269,29 @@ bool CudnnAvailable(const OpKernelContext* context);  // provider option && runt
 - `CudaErrString<cudnnStatus_t>` must not call `cudnnGetErrorString` when cuDNN is unavailable
   (route through the shim, which returns a static string in that case).
 
-### 3.3 CUDA provider options: `enable_cudnn` and `cudnn_path`
+### 3.3 CUDA provider option: `enable_cudnn`
 
 cuDNN can be disabled explicitly with a CUDA provider option:
 
 ```text
 enable_cudnn = 1  # default: try to load and use cuDNN when it is present
 enable_cudnn = 0  # do not load cuDNN; force native CUDA paths / Phase-1 NOT_IMPLEMENTED
-cudnn_path = /path/to/cudnn/lib-or-bin  # optional: runtime search directory for cuDNN DLLs/SOs
 ```
 
-`enable_cudnn` and `cudnn_path` serve different purposes:
-
-- `enable_cudnn` is the policy switch. When it is `0`, ORT must not attempt to load cuDNN,
-  even if `cudnn_path` is set.
-- `cudnn_path` is a location hint. When `enable_cudnn=1`, the lazy loader searches this
-  directory first, then falls back to default OS/library paths. It should not force cuDNN to
-  be required; if the directory is missing cuDNN, the loader reports cuDNN unavailable and
-  Phase-1/2/3 behavior proceeds normally.
-- `cudnn_path` accepts a directory, not a single library file. The implementation may accept
-  either the directory that directly contains the shared libraries (`bin` on Windows, `lib`
-  on Linux package layouts) or a cuDNN root directory and internally probe common children
-  such as `bin`, `lib`, and `lib64`.
+`enable_cudnn` is the policy switch. When it is `0`, ORT must not attempt to load cuDNN. ORT
+intentionally does not provide a `cudnn_path` provider option because provider options can be
+supplied by higher-level configuration systems, and allowing them to choose a native DLL/SO
+path would create a library-loading security risk.
 
 Implementation details:
 
 - Add `constexpr const char* kEnableCudnn = "enable_cudnn"` in
   `cuda::provider_option_names`.
-- Add `constexpr const char* kCudnnPath = "cudnn_path"` in `cuda::provider_option_names`.
 - Add `bool enable_cudnn{true};` to `CUDAExecutionProviderInfo`.
-- Add `std::string cudnn_path;` to `CUDAExecutionProviderInfo`.
 - Parse it with `ProviderOptionsParser::AddAssignmentToReference(...)` in
   `CUDAExecutionProviderInfo::FromProviderOptions(...)`.
-- Emit both values from `CUDAExecutionProviderInfo::ToProviderOptions(...)`.
-- Include both values in `std::hash<CUDAExecutionProviderInfo>` because they change the EP
-  behavior.
+- Emit it from `CUDAExecutionProviderInfo::ToProviderOptions(...)`.
+- Include it in `std::hash<CUDAExecutionProviderInfo>` because it changes the EP behavior.
 - Do **not** add a field to `OrtCUDAProviderOptionsV2` for Phase 1. That struct is public C
   ABI surface; string-key provider options are sufficient and can be set through existing
   provider-options APIs.
@@ -320,8 +307,9 @@ effective_cudnn_available = info.enable_cudnn && CudnnLibrary::Get().Available()
 ```
 
 If `enable_cudnn=0`, ORT must not call `dlopen` / `LoadLibrary` for cuDNN and must not create
-a cuDNN handle. If `enable_cudnn=1` and `cudnn_path` is empty, ORT uses the default search
-behavior. If `cudnn_path` is non-empty, ORT searches it first.
+a cuDNN handle. If `enable_cudnn=1`, ORT uses trusted process-level library discovery: the
+system loader search path, package/container deployment, or an explicit preload performed by
+application code before session creation.
 
 ### 3.4 Phase 1 fallback behavior (chosen: throw at Run time)
 
@@ -398,15 +386,14 @@ present, behavior is byte‑for‑byte identical to today.
 2. **Loader (`cudnn_loader.{h,cc}`).**
    - `dlopen`/`LoadLibrary` of the cuDNN umbrella lib with versioned name candidates
      (`libcudnn.so.9`, `libcudnn.so`, `cudnn64_9.dll`, …).
-   - Search `cudnn_path` first when it is set. Accept a directory that directly contains the
-     cuDNN runtime libraries, and optionally probe `bin`, `lib`, and `lib64` if the value is a
-     cuDNN root directory.
-   - On Windows, handle cuDNN 9 sub-DLL discovery explicitly: either add the chosen cuDNN
-     directory to the DLL search path for the duration of the load, or preload required
-     cuDNN sub-DLLs in dependency order before loading `cudnn64_9.dll`.
+   - Do not accept a provider-supplied cuDNN path. Rely on trusted deployment/library-search
+     mechanisms or application-controlled preloading.
+   - On Windows, avoid relying on the process working directory. Python package users who need
+     an explicit directory should call `preload_dlls(cudnn=True, directory=...)` from trusted
+     application code before creating a CUDA EP session.
    - Resolve all manifest symbols; populate function‑pointer table.
-   - `Available()` + thread‑safe one‑time init; report whether cuDNN was loaded from
-     `cudnn_path` or from the default search path for diagnostics.
+   - `Available()` + thread‑safe one‑time init; report loader diagnostics without exposing a
+     provider-controlled library path option.
    - Expose the raw library handle to `cudnn_frontend` dynamic-loading mode.
    - Define and maintain `cudnn_frontend::cudnn_dlhandle` in one ORT translation unit when
      `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` is enabled. Set it to the loader's cuDNN handle
@@ -473,19 +460,19 @@ present, behavior is byte‑for‑byte identical to today.
      `build_plans()`, or `execute()`.
 
 7. **Provider-option plumbing.**
-  - Add and parse `enable_cudnn` and `cudnn_path` in `CUDAExecutionProviderInfo`.
-  - Return them from `GetProviderOptions()` / `ToProviderOptions()`.
-  - Include them in the EP hash.
+  - Add and parse `enable_cudnn` in `CUDAExecutionProviderInfo`.
+  - Return it from `GetProviderOptions()` / `ToProviderOptions()`.
+  - Include it in the EP hash.
   - Add tests for parsing: `enable_cudnn` default true, `"0"` false, `"1"` true, invalid
-    values rejected; `cudnn_path` default empty and round-trips as a string.
+    values rejected.
 
 8. **Error‑string safety.**
    - Make `CudaErrString<cudnnStatus_t>` shim‑safe.
-  - Make `CudaErrString<cudnn_frontend::error_t>` report frontend dynamic-loading failures
-    without assuming cuDNN is available.
+   - Make `CudaErrString<cudnn_frontend::error_t>` report frontend dynamic-loading failures
+     without assuming cuDNN is available.
 
 9. **Docs & messaging.**
-   - Document the new behavior, `enable_cudnn`, and `cudnn_path`.
+   - Document the new behavior and `enable_cudnn`.
    - Update Python package guidance for `onnxruntime.preload_dlls(cuda=True, cudnn=True,
      directory=...)`: users can still preload a known cuDNN directory, but preloading is now
      optional for CUDA EP load because the provider itself lazy-loads cuDNN.
@@ -615,9 +602,6 @@ kernels; CUTLASS (already vendored) for conv/GEMM‑shaped work.
     need to physically remove cuDNN in CI.
   - "Disabled" is tested with CUDA provider option `enable_cudnn=0`; this should not touch
     the dynamic loader at all.
-  - `cudnn_path` is tested with a temporary directory / fake loader hook to verify search
-    precedence, missing-directory handling, and that `enable_cudnn=0` suppresses all loads
-    even when `cudnn_path` is set.
 - **Op‑level tests:** for each cuDNN op, assert the clear `NOT_IMPLEMENTED` error in the
   forced‑absent mode (Phase 1), and correctness in present mode.
 - **cuDNN frontend tests:** add a Conv / ConvTranspose test that exercises frontend graph
@@ -663,17 +647,21 @@ kernels; CUTLASS (already vendored) for conv/GEMM‑shaped work.
   symbols. *Mitigation:* after every `cudnn_frontend` update, grep/audit
   `cudnn_frontend_shim.h` and experimental shims, then update the frontend-symbol audit tests.
 - **cuDNN sub‑library packaging differences (v9 split libs; distro/conda/pip layouts).**
-  *Mitigation:* provide the `cudnn_path` provider option. On Linux, load only the umbrella
-  `libcudnn` and let cuDNN resolve its own sub‑libs. On Windows, add the chosen directory to
-  the DLL search path or preload required cuDNN sub-DLLs before `cudnn64_9.dll`, following the
-  same ordering already encoded in Python `preload_dlls()`.
+  *Mitigation:* rely on trusted deployment mechanisms for library discovery. Python users who
+  need a specific directory can call `onnxruntime.preload_dlls(cudnn=True, directory=...)`
+  from application code before creating the session; C++ deployments should use container,
+  package-manager, or system loader configuration.
+- **Native library path provider options can become code-loading hooks.** If an attacker can
+  influence a provider option that names a DLL/SO directory, they can potentially cause ORT
+  to load attacker-controlled native code. *Mitigation:* do not expose a `cudnn_path` provider
+  option. Keep custom library discovery at trusted process/deployment layers instead.
 - **Python preload behavior can conflict with optional cuDNN.** Today
   `onnxruntime.preload_dlls(cudnn=True)` tries to load cuDNN and prints installation guidance
   on failure. That is useful when the user requested preloading, but too alarming if cuDNN is
   optional. *Mitigation:* keep explicit preloading available, but avoid invoking or requiring
   cuDNN preload as part of normal optional-cuDNN package import / CUDA EP load. Update docs so
-  users who need a specific cuDNN directory can use either provider option `cudnn_path` or
-  `preload_dlls(cudnn=True, directory=...)` before creating the session.
+  users who need a specific cuDNN directory can call `preload_dlls(cudnn=True, directory=...)`
+  before creating the session.
 - **Python version metadata currently assumes cuDNN on Windows CUDA builds.**
   `cmake/onnxruntime_python.cmake` fails if no `cudnn64_*.dll` is found when generating
   `version_info.py`. *Mitigation:* make `cudnn_version` optional in that generated file.
@@ -714,8 +702,9 @@ kernels; CUTLASS (already vendored) for conv/GEMM‑shaped work.
 
 - Force-disabling cuDNN is a **CUDA provider option**, not a session option. Use
   `enable_cudnn=0`.
-- Providing a custom cuDNN runtime directory is also a **CUDA provider option**. Use
-  `cudnn_path=<directory>` while keeping `enable_cudnn=1`.
+- Providing a custom cuDNN runtime directory is **not** a CUDA provider option. Use trusted
+  deployment/library-search configuration, or Python `preload_dlls(cudnn=True, directory=...)`
+  from application code before creating the session.
 - No new "CUDA-minimal" wheel is required for Phase 1. cuDNN DLLs are not packed in the wheel
   today.
 - Keep the existing `USE_CUDA_MINIMAL` build-time path. It is used by RTX/TensorRT-related EP
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
index db7d278bc454d..65107dca173ae 100755
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
@@ -339,7 +339,7 @@ CUDAExecutionProvider::CUDAExecutionProvider(const CUDAExecutionProviderInfo& in
 #endif
 
 #ifndef USE_CUDA_MINIMAL
-  cuda::CudnnLibrary::Get().Configure(info_.enable_cudnn, info_.cudnn_path);
+  cuda::CudnnLibrary::Get().Configure(info_.enable_cudnn);
 #endif
 
   CUDA_CALL_THROW(cudaSetDevice(info_.device_id));
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc
index 71e7ba73e68e4..496eb84e747c0 100644
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc
@@ -28,7 +28,6 @@ constexpr const char* kCudnnConvUseMaxWorkspace = "cudnn_conv_use_max_workspace"
 constexpr const char* kEnableCudaGraph = "enable_cuda_graph";
 constexpr const char* kCudnnConv1dPadToNc1d = "cudnn_conv1d_pad_to_nc1d";
 constexpr const char* kEnableCudnn = "enable_cudnn";
-constexpr const char* kCudnnPath = "cudnn_path";
 constexpr const char* kTunableOpEnable = "tunable_op_enable";
 constexpr const char* kTunableOpTuningEnable = "tunable_op_tuning_enable";
 constexpr const char* kTunableOpMaxTuningDurationMs = "tunable_op_max_tuning_duration_ms";
@@ -118,7 +117,6 @@ CUDAExecutionProviderInfo CUDAExecutionProviderInfo::FromProviderOptions(const P
           .AddAssignmentToReference(cuda::provider_option_names::kEnableCudaGraph, info.enable_cuda_graph)
           .AddAssignmentToReference(cuda::provider_option_names::kCudnnConv1dPadToNc1d, info.cudnn_conv1d_pad_to_nc1d)
           .AddAssignmentToReference(cuda::provider_option_names::kEnableCudnn, info.enable_cudnn)
-          .AddAssignmentToReference(cuda::provider_option_names::kCudnnPath, info.cudnn_path)
           .AddAssignmentToReference(cuda::provider_option_names::kEnableSkipLayerNormStrictMode, info.enable_skip_layer_norm_strict_mode)
           .AddAssignmentToReference(cuda::provider_option_names::kPreferNHWCMode, info.prefer_nhwc)
           .AddAssignmentToReference(cuda::provider_option_names::kUseEPLevelUnifiedStream, info.use_ep_level_unified_stream)
@@ -172,7 +170,6 @@ ProviderOptions CUDAExecutionProviderInfo::ToProviderOptions(const CUDAExecution
       {cuda::provider_option_names::kEnableCudaGraph, MakeStringWithClassicLocale(info.enable_cuda_graph)},
       {cuda::provider_option_names::kCudnnConv1dPadToNc1d, MakeStringWithClassicLocale(info.cudnn_conv1d_pad_to_nc1d)},
       {cuda::provider_option_names::kEnableCudnn, MakeStringWithClassicLocale(info.enable_cudnn)},
-      {cuda::provider_option_names::kCudnnPath, info.cudnn_path},
       {cuda::provider_option_names::kTunableOpEnable, MakeStringWithClassicLocale(info.tunable_op.enable)},
       {cuda::provider_option_names::kTunableOpTuningEnable, MakeStringWithClassicLocale(info.tunable_op.tuning_enable)},
       {cuda::provider_option_names::kTunableOpMaxTuningDurationMs, MakeStringWithClassicLocale(info.tunable_op.max_tuning_duration_ms)},
@@ -199,7 +196,6 @@ ProviderOptions CUDAExecutionProviderInfo::ToProviderOptions(const OrtCUDAProvid
       {cuda::provider_option_names::kCudnnConvUseMaxWorkspace, MakeStringWithClassicLocale(info.cudnn_conv_use_max_workspace)},
       {cuda::provider_option_names::kCudnnConv1dPadToNc1d, MakeStringWithClassicLocale(info.cudnn_conv1d_pad_to_nc1d)},
       {cuda::provider_option_names::kEnableCudnn, MakeStringWithClassicLocale(true)},
-      {cuda::provider_option_names::kCudnnPath, ""},
       {cuda::provider_option_names::kTunableOpEnable, MakeStringWithClassicLocale(info.tunable_op_enable)},
       {cuda::provider_option_names::kTunableOpTuningEnable, MakeStringWithClassicLocale(info.tunable_op_tuning_enable)},
       {cuda::provider_option_names::kTunableOpMaxTuningDurationMs, MakeStringWithClassicLocale(info.tunable_op_max_tuning_duration_ms)},
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h
index a3095c8afb23b..521c392cf3f09 100644
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider_info.h
@@ -71,7 +71,6 @@ struct CUDAExecutionProviderInfo {
   bool cudnn_conv1d_pad_to_nc1d{false};
 
   bool enable_cudnn{true};
-  std::string cudnn_path;
 
   cuda::TunableOpInfo tunable_op{};
 
@@ -120,7 +119,6 @@ struct std::hash<::onnxruntime::CUDAExecutionProviderInfo> {
     onnxruntime::HashCombine(info.tunable_op.max_tuning_duration_ms, value);
     onnxruntime::HashCombine(info.sdpa_kernel, value);
     onnxruntime::HashCombine(info.enable_cudnn, value);
-    onnxruntime::HashCombine(info.cudnn_path, value);
 
     // Memory pointers
     onnxruntime::HashCombine(reinterpret_cast<size_t>(info.user_compute_stream), value);
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.cc b/onnxruntime/core/providers/cuda/cudnn_loader.cc
index ffdf322efd317..2709da40b3705 100644
--- a/onnxruntime/core/providers/cuda/cudnn_loader.cc
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.cc
@@ -5,7 +5,6 @@
 
 #ifndef USE_CUDA_MINIMAL
 
-#include <utility>
 #include <vector>
 
 #ifdef _WIN32
@@ -17,22 +16,7 @@
 
 namespace {
 
-std::string JoinPath(std::string_view base, std::string_view child) {
-  if (base.empty()) {
-    return std::string(child);
-  }
-  const char last = base.back();
-  if (last == '/' || last == '\\') {
-    return std::string(base) + std::string(child);
-  }
-#ifdef _WIN32
-  return std::string(base) + "\\" + std::string(child);
-#else
-  return std::string(base) + "/" + std::string(child);
-#endif
-}
-
-std::vector<std::string> GetCandidateLibraryNames(std::string_view cudnn_path) {
+std::vector<std::string> GetCandidateLibraryNames() {
 #ifdef _WIN32
   constexpr const char* kCudnnLibraryName = "cudnn64_9.dll";
 #else
@@ -41,19 +25,6 @@ std::vector<std::string> GetCandidateLibraryNames(std::string_view cudnn_path) {
 #endif
 
   std::vector<std::string> candidates;
-  if (!cudnn_path.empty()) {
-    candidates.push_back(JoinPath(cudnn_path, kCudnnLibraryName));
-#ifdef _WIN32
-    candidates.push_back(JoinPath(JoinPath(cudnn_path, "bin"), kCudnnLibraryName));
-#else
-    candidates.push_back(JoinPath(cudnn_path, kCudnnUnversionedLibraryName));
-    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib"), kCudnnLibraryName));
-    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib"), kCudnnUnversionedLibraryName));
-    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib64"), kCudnnLibraryName));
-    candidates.push_back(JoinPath(JoinPath(cudnn_path, "lib64"), kCudnnUnversionedLibraryName));
-#endif
-  }
-
   candidates.push_back(kCudnnLibraryName);
 #ifndef _WIN32
   candidates.push_back(kCudnnUnversionedLibraryName);
@@ -116,16 +87,13 @@ CudnnLibrary& CudnnLibrary::Get() {
   return library;
 }
 
-void CudnnLibrary::Configure(bool enabled, std::string cudnn_path) {
+void CudnnLibrary::Configure(bool enabled) {
   std::lock_guard<std::mutex> lock(mutex_);
   if (!enabled) {
     return;
   }
 
   enabled_ = enabled;
-  if (!load_attempted_) {
-    cudnn_path_ = std::move(cudnn_path);
-  }
 }
 
 bool CudnnLibrary::Available() {
@@ -155,7 +123,7 @@ bool CudnnLibrary::EnsureLoaded() {
 
   load_attempted_ = true;
   std::string last_error;
-  for (const auto& candidate : GetCandidateLibraryNames(cudnn_path_)) {
+  for (const auto& candidate : GetCandidateLibraryNames()) {
     handle_ = LoadLibraryCandidate(candidate, last_error);
     if (handle_ != nullptr) {
       available_ = true;
@@ -203,7 +171,7 @@ void* CudnnLibrary::ResolveSymbol(const char* symbol) {
 }
 
 const char* CudnnUnavailableErrorString() {
-  return "cuDNN is not available. Install cuDNN, set CUDA provider option cudnn_path, or set enable_cudnn=0 to force native CUDA paths where available.";
+  return "cuDNN is not available. Install cuDNN, update the system library search path, or set enable_cudnn=0 to force native CUDA paths where available.";
 }
 
 }  // namespace onnxruntime::cuda
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.h b/onnxruntime/core/providers/cuda/cudnn_loader.h
index 47f78fa75f006..1328104074820 100644
--- a/onnxruntime/core/providers/cuda/cudnn_loader.h
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.h
@@ -17,7 +17,7 @@ class CudnnLibrary {
  public:
   static CudnnLibrary& Get();
 
-  void Configure(bool enabled, std::string cudnn_path);
+  void Configure(bool enabled);
   bool Available();
   const char* Error() const;
   void* Handle();
@@ -37,7 +37,6 @@ class CudnnLibrary {
   bool enabled_{true};
   bool load_attempted_{false};
   bool available_{false};
-  std::string cudnn_path_;
   std::string error_;
   void* handle_{nullptr};
   std::unordered_map<std::string, void*> symbols_;
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc b/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
index 6555f55541c4c..d5f3b79783895 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
@@ -128,7 +128,7 @@ CudaEp::CudaEp(CudaEpFactory& factory, const Config& config, const OrtLogger& lo
   // ORT uses it to avoid reading OrtEp struct fields that did not exist when the plugin was compiled.
   ort_version_supported = ORT_API_VERSION;
 
-  onnxruntime::cuda::CudnnLibrary::Get().Configure(config_.enable_cudnn, config_.cudnn_path);
+  onnxruntime::cuda::CudnnLibrary::Get().Configure(config_.enable_cudnn);
 
   // The plugin is compiled against the latest ORT headers (ORT_API_VERSION) but may be loaded by an
   // older ORT runtime, down to the floor declared in plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION. Some
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep.h b/onnxruntime/core/providers/cuda/plugin/cuda_ep.h
index 38917cd1be07a..e68adaef99851 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep.h
@@ -32,7 +32,6 @@ class CudaEp : public onnxruntime::ep::adapter::Ep {
     bool cudnn_conv_use_max_workspace = true;         ///< Use maximum workspace for cuDNN conv algo search.
     bool cudnn_conv1d_pad_to_nc1d = false;            ///< Pad 1D convolutions to NC1D format.
     bool enable_cudnn = true;                         ///< Enable runtime loading and use of cuDNN-backed kernels.
-    std::string cudnn_path;                           ///< Optional directory containing cuDNN runtime libraries.
     bool fuse_conv_bias = false;                      ///< Enable cuDNN frontend conv+bias fusion.
     int sdpa_kernel = 0;                              ///< Attention backend bitmask override.
     bool enable_cuda_graph = false;                   ///< Enable CUDA graph capture and replay.
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc b/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
index 3b79d3ca4ef97..ffd89e2b1e8db 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
@@ -420,18 +420,6 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateEpImpl(
     }
   };
 
-  auto read_session_config_string = [&](std::initializer_list<std::string_view> keys, std::string& value) {
-    for (const auto& key : keys) {
-      auto raw_value = try_get_session_config(key);
-      if (!raw_value.has_value()) {
-        continue;
-      }
-
-      value = std::move(*raw_value);
-      return;
-    }
-  };
-
   auto read_cudnn_conv_algo = [&](std::initializer_list<std::string_view> keys, int& value) {
     for (const auto& key : keys) {
       auto raw_value = try_get_session_config(key);
@@ -500,7 +488,6 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateEpImpl(
   const std::string cudnn_conv_algo_key = ep_options_prefix + "cudnn_conv_algo";
   const std::string cudnn_conv_algo_search_key = ep_options_prefix + "cudnn_conv_algo_search";
   const std::string enable_cudnn_key = ep_options_prefix + "enable_cudnn";
-  const std::string cudnn_path_key = ep_options_prefix + "cudnn_path";
   const std::string fuse_conv_bias_key = ep_options_prefix + "fuse_conv_bias";
   const std::string sdpa_kernel_key = ep_options_prefix + "sdpa_kernel";
   const std::string enable_cuda_graph_key = ep_options_prefix + "enable_cuda_graph";
@@ -532,12 +519,9 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateEpImpl(
       {cudnn_conv_algo_search_key, cudnn_conv_algo_key, "ep.cuda.cudnn_conv_algo_search", "ep.cuda.cudnn_conv_algo",
        "cudnn_conv_algo_search", "cudnn_conv_algo"},
       config.cudnn_conv_algo);
-    read_session_config_bool(
+  read_session_config_bool(
       {enable_cudnn_key, "ep.cuda.enable_cudnn", "enable_cudnn"},
       config.enable_cudnn);
-    read_session_config_string(
-      {cudnn_path_key, "ep.cuda.cudnn_path", "cudnn_path"},
-      config.cudnn_path);
   read_session_config_bool(
       {fuse_conv_bias_key, "ep.cuda.fuse_conv_bias", "fuse_conv_bias"},
       config.fuse_conv_bias);
diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index 7351ea3f3cac2..d60acfc8bc34f 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -441,8 +441,6 @@ def test_get_and_set_option_with_values(option_name, option_values):
 
                 test_get_and_set_option_with_values("enable_cudnn", ["1", "0"])
 
-                test_get_and_set_option_with_values("cudnn_path", ["", "/tmp/ort_cudnn"])
-
                 test_get_and_set_option_with_values("do_copy_in_default_stream", [0, 1])
 
                 test_get_and_set_option_with_values("tunable_op_enable", ["1", "0"])

From 6e3d2433915869dc83ff404fad2919ada04d4331 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Thu, 25 Jun 2026 00:51:24 +0000
Subject: [PATCH 04/15] update plugin test

---
 .github/workflows/windows_cuda_no_cudnn.yml   | 238 ++++++++++++++++++
 docs/cuda_plugin_ep/QUICK_START.md            |  21 ++
 docs/cuda_plugin_ep/cuda_plugin_ep_design.md  |  17 +-
 .../transformers/test_cuda_plugin_ep.py       | 103 +++++---
 4 files changed, 346 insertions(+), 33 deletions(-)
 create mode 100644 .github/workflows/windows_cuda_no_cudnn.yml

diff --git a/.github/workflows/windows_cuda_no_cudnn.yml b/.github/workflows/windows_cuda_no_cudnn.yml
new file mode 100644
index 0000000000000..51ff918458a5a
--- /dev/null
+++ b/.github/workflows/windows_cuda_no_cudnn.yml
@@ -0,0 +1,238 @@
+name: Windows CUDA No cuDNN CI
+
+on:
+  pull_request:
+    branches: [main, 'rel-*']
+    paths:
+      - '.github/workflows/windows_cuda_no_cudnn.yml'
+      - 'cmake/onnxruntime_providers_cuda.cmake'
+      - 'cmake/onnxruntime_providers_cuda_plugin.cmake'
+      - 'cmake/onnxruntime_python.cmake'
+      - 'docs/CUDA_cuDNN_Optional_Design.md'
+      - 'docs/cuda_plugin_ep/**'
+      - 'onnxruntime/__init__.py'
+      - 'onnxruntime/core/providers/cuda/**'
+      - 'onnxruntime/test/python/transformers/test_cuda_plugin_ep.py'
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event_name == 'pull_request' && github.ref || github.sha }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    name: Windows CUDA Plugin EP Build without cuDNN
+    runs-on: [
+        "self-hosted",
+        "1ES.Pool=onnxruntime-github-vs2022-latest",
+        "JobId=windows-cuda-plugin-no-cudnn-${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}"
+        ]
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+          submodules: 'none'
+
+      - uses: actions/setup-python@v6
+        with:
+          python-version: '3.14'
+          architecture: x64
+
+      - name: Locate vcvarsall and Setup Env
+        uses: ./.github/actions/locate-vcvarsall-and-setup-env
+        with:
+          architecture: x64
+
+      - name: Install python modules
+        run: python -m pip install -r .\tools\ci_build\github\windows\python\requirements.txt
+        working-directory: ${{ github.workspace }}
+        shell: cmd
+
+      - name: Download CUDA SDK v13.0
+        working-directory: ${{ runner.temp }}
+        run: |
+          azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/cuda_sdk/v13.0" .
+          dir
+        shell: pwsh
+
+      - name: Download cuDNN headers for CUDA 13.0
+        working-directory: ${{ runner.temp }}
+        run: |
+          azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/cudnn_sdk/$env:CUDNN_FOLDER" .
+          dir
+        shell: pwsh
+
+      - name: Add CUDA to PATH
+        shell: pwsh
+        run: |
+          Write-Host "Adding CUDA to PATH without adding any cuDNN directory"
+          Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin"
+          Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\extras\CUPTI\lib64"
+
+      - name: Set OnnxRuntimeBuildDirectory
+        shell: pwsh
+        run: |
+          $buildDir = Join-Path $env:RUNNER_TEMP "build"
+          echo "OnnxRuntimeBuildDirectory=$buildDir" >> $env:GITHUB_ENV
+
+      - name: Build ONNX Runtime with CUDA Plugin EP and no cuDNN runtime path
+        working-directory: ${{ runner.temp }}
+        shell: pwsh
+        run: |
+          python.exe ${{ github.workspace }}\tools\ci_build\build.py `
+            --update --build --config Release `
+            --build_dir build `
+            --skip_submodule_sync `
+            --parallel `
+            --nvcc_threads 4 `
+            --flash_nvcc_threads 4 `
+            --use_binskim_compliant_compile_flags `
+            --cmake_generator "Visual Studio 17 2022" `
+            --build_shared_lib `
+            --build_wheel `
+            --use_cuda `
+            --cuda_version=13.0 `
+            --cuda_home="$env:RUNNER_TEMP\v13.0" `
+            --cudnn_home="$env:RUNNER_TEMP\$env:CUDNN_FOLDER" `
+            --skip_tests `
+            --use_vcpkg `
+            --use_vcpkg_ms_internal_asset_cache `
+            --enable_cuda_profiling `
+            --cmake_extra_defines onnxruntime_QUICK_BUILD=ON `
+            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86 `
+            --cmake_extra_defines onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON
+
+          if ($lastExitCode -ne 0) {
+            exit $lastExitCode
+          }
+
+          $outputDir = "${{ runner.temp }}\build\Release"
+          Write-Host "Cleaning up files from $outputDir..."
+          Remove-Item -Path "$outputDir\onnxruntime" -Recurse -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path "$outputDir\pybind11" -Recurse -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path "$outputDir\models" -Recurse -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path "$outputDir\vcpkg_installed" -Recurse -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path "$outputDir\_deps" -Recurse -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path "$outputDir\CMakeCache.txt" -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path "$outputDir\CMakeFiles" -Recurse -Force -ErrorAction SilentlyContinue
+          Remove-Item -Path $outputDir -Include "*.obj" -Recurse
+
+      - name: Upload build artifacts
+        uses: actions/upload-artifact@v6
+        with:
+          name: cuda-plugin-no-cudnn-build-artifacts
+          path: ${{ runner.temp }}\build
+    env:
+      DOTNET_SKIP_FIRST_TIME_EXPERIENCE: true
+      setVcvars: true
+      ALLOW_RELEASED_ONNX_OPSET_ONLY: '0'
+      ONNXRUNTIME_TEST_GPU_DEVICE_ID: '0'
+      AZCOPY_AUTO_LOGIN_TYPE: MSI
+      AZCOPY_MSI_CLIENT_ID: 63b63039-6328-442f-954b-5a64d124e5b4
+      CUDNN_FOLDER: 9.14.0.64_cuda13
+
+  test:
+    name: Windows CUDA Plugin EP Test without cuDNN
+    needs: build
+    timeout-minutes: 120
+    runs-on: [
+      "self-hosted",
+      "1ES.Pool=onnxruntime-github-Win2022-GPU-A10",
+      "JobId=windows-cuda-plugin-no-cudnn-test-${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}"
+    ]
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+          submodules: 'none'
+
+      - name: Download build artifacts
+        uses: actions/download-artifact@v7
+        with:
+          name: cuda-plugin-no-cudnn-build-artifacts
+          path: ${{ runner.temp }}\build
+
+      - uses: actions/setup-python@v6
+        with:
+          python-version: '3.14'
+          architecture: x64
+
+      - name: Locate vcvarsall and Setup Env
+        uses: ./.github/actions/locate-vcvarsall-and-setup-env
+        with:
+          architecture: x64
+
+      - name: Install python modules
+        run: python -m pip install -r .\tools\ci_build\github\windows\python\requirements.txt
+        working-directory: ${{ github.workspace }}
+        shell: cmd
+
+      - name: Install torch for CPU only
+        run: python -m pip install torch
+        working-directory: ${{ github.workspace }}
+        shell: cmd
+
+      - name: Download CUDA SDK v13.0
+        working-directory: ${{ runner.temp }}
+        run: |
+          azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/cuda_sdk/v13.0" .
+          dir
+        shell: pwsh
+
+      - name: Add CUDA to PATH
+        shell: pwsh
+        run: |
+          Write-Host "Adding CUDA to PATH without adding any cuDNN directory"
+          Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin"
+          Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\extras\CUPTI\lib64"
+
+      - name: Set OnnxRuntimeBuildDirectory
+        shell: pwsh
+        run: |
+          $buildDir = Join-Path $env:RUNNER_TEMP "build"
+          echo "OnnxRuntimeBuildDirectory=$buildDir" >> $env:GITHUB_ENV
+
+      - name: Install ONNX Runtime Wheel
+        uses: ./.github/actions/install-onnxruntime-wheel
+        with:
+          whl-directory: ${{ runner.temp }}\build\Release\Release\dist
+
+      - name: Verify GPU access
+        shell: pwsh
+        run: nvidia-smi
+
+      - name: Verify CUDA plugin has no direct cuDNN dependency
+        shell: pwsh
+        run: |
+          $pluginPath = "${{ runner.temp }}\build\Release\Release\onnxruntime_providers_cuda_plugin.dll"
+          if (-not (Test-Path $pluginPath)) {
+            Write-Error "CUDA plugin EP library not found at $pluginPath"
+            exit 1
+          }
+
+          dumpbin /dependents $pluginPath | Tee-Object -FilePath $env:RUNNER_TEMP\cuda_plugin_dependents.txt
+          if (Select-String -Path $env:RUNNER_TEMP\cuda_plugin_dependents.txt -Pattern "cudnn" -SimpleMatch -Quiet) {
+            Write-Error "CUDA plugin EP has a direct cuDNN dependency"
+            exit 1
+          }
+
+      - name: Run CUDA Plugin EP Python Tests without cuDNN
+        working-directory: ${{ github.workspace }}\onnxruntime\test\python\transformers
+        shell: pwsh
+        run: |
+          $env:ORT_CUDA_PLUGIN_PATH = "${{ runner.temp }}\build\Release\Release\onnxruntime_providers_cuda_plugin.dll"
+          $env:ORT_TEST_CUDA_PLUGIN_EP = "1"
+          $env:ORT_TEST_CUDA_PLUGIN_NO_CUDNN = "1"
+          Write-Host "ORT_CUDA_PLUGIN_PATH=$env:ORT_CUDA_PLUGIN_PATH"
+          python test_cuda_plugin_ep.py
+          if ($lastExitCode -ne 0) {
+            exit $lastExitCode
+          }
+    env:
+      DOTNET_SKIP_FIRST_TIME_EXPERIENCE: true
+      setVcvars: true
+      ALLOW_RELEASED_ONNX_OPSET_ONLY: '0'
+      ONNXRUNTIME_TEST_GPU_DEVICE_ID: '0'
+      AZCOPY_AUTO_LOGIN_TYPE: MSI
+      AZCOPY_MSI_CLIENT_ID: 63b63039-6328-442f-954b-5a64d124e5b4
+      CUDNN_FOLDER: 9.14.0.64_cuda13
diff --git a/docs/cuda_plugin_ep/QUICK_START.md b/docs/cuda_plugin_ep/QUICK_START.md
index 4055e1056b507..b5acab30748d9 100644
--- a/docs/cuda_plugin_ep/QUICK_START.md
+++ b/docs/cuda_plugin_ep/QUICK_START.md
@@ -15,6 +15,18 @@ build.bat --cmake_generator "Visual Studio 17 2022" --config Release --build_whe
           --cmake_extra_defines "onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON"
 ```
 
+### Building and testing without cuDNN at runtime
+
+The CUDA Plugin EP build still requires cuDNN headers, but the plugin library must not have a hard runtime dependency on cuDNN. When cuDNN is not present, non-cuDNN kernels can still run. Kernels that still require cuDNN fail with `NOT_IMPLEMENTED` unless they have a native CUDA fallback.
+
+For local Linux CUDA 13 validation, use the no-cuDNN helper script. It keeps `CUDNN_HOME` available for headers, excludes cuDNN directories from `PATH` and `LD_LIBRARY_PATH`, verifies the plugin has no direct cuDNN dependency, and runs plugin tests in no-cuDNN mode:
+
+```bash
+bash .env/cuda_130_plugin_no_cudnn.sh --build --test_plugin
+```
+
+The test mode sets `ORT_TEST_CUDA_PLUGIN_EP=1` and `ORT_TEST_CUDA_PLUGIN_NO_CUDNN=1`, which passes `enable_cudnn=0` to plugin sessions and skips plugin tests for operators that still require cuDNN, such as Conv, ConvTranspose, BatchNormalization, InstanceNormalization, LRN, ArgMax, reductions, Einsum, and cuDNN-backed pooling paths.
+
 ## Minimum ONNX Runtime Version
 
 The plugin is compiled against the ONNX Runtime headers in this repository, but it is designed to load into an **older** ONNX Runtime runtime as well. The minimum compatible version is declared in [`plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION`](../../plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION) (currently **1.24.4**) and is the single source of truth:
@@ -159,6 +171,15 @@ python test_cuda_plugin_ep.py
 
 The script validates plugin registration, device enumeration, provider options, operator coverage, and that key nodes are actually assigned to `CudaPluginExecutionProvider`.
 
+To run the same focused test against a plugin build without cuDNN in the runtime search path:
+
+```bash
+export ORT_TEST_CUDA_PLUGIN_NO_CUDNN=1
+export ORT_TEST_CUDA_PLUGIN_EP=1
+export ORT_CUDA_PLUGIN_PATH=/path/to/build/Release/libonnxruntime_providers_cuda_plugin.so
+python test_cuda_plugin_ep.py
+```
+
 ### Test against the minimum supported ORT version
 
 The plugin must keep working on the oldest supported ONNX Runtime (see [Minimum ONNX Runtime Version](#minimum-onnx-runtime-version)), not just the version it was built against. To validate this locally, install the minimum base runtime and run the same test against the freshly built plugin library:
diff --git a/docs/cuda_plugin_ep/cuda_plugin_ep_design.md b/docs/cuda_plugin_ep/cuda_plugin_ep_design.md
index 438fb8606fc09..953dcf9f4f738 100644
--- a/docs/cuda_plugin_ep/cuda_plugin_ep_design.md
+++ b/docs/cuda_plugin_ep/cuda_plugin_ep_design.md
@@ -26,6 +26,21 @@ The ORT CUDA build produces four separate libraries:
 | `onnxruntime_providers_cuda` | `libonnxruntime_providers_cuda.so` | Shared module | In-tree CUDA EP (uses `SHARED_PROVIDER` bridge) |
 | `onnxruntime_providers_cuda_plugin` | `libonnxruntime_providers_cuda_plugin.so` | Shared module | Plugin CUDA EP (uses EP API adapters) |
 
+### 2.1.1 Optional cuDNN Runtime Dependency
+
+The CUDA Plugin EP follows the in-tree CUDA EP's optional-cuDNN model. cuDNN headers are still required at build time, but the plugin shared library must not link directly to cuDNN or contain a cuDNN DLL/SO in its dynamic dependency table. cuDNN is loaded lazily through the ORT cuDNN loader when `enable_cudnn` is enabled and the runtime libraries are available through trusted process-level library discovery.
+
+The plugin exposes the same `enable_cudnn` provider option as the in-tree CUDA EP:
+
+```text
+enable_cudnn = 1  # default: try to load and use cuDNN when available
+enable_cudnn = 0  # do not load cuDNN; run native CUDA paths or fail cuDNN-required ops with NOT_IMPLEMENTED
+```
+
+There is intentionally no provider option for a custom cuDNN DLL/SO path. Provider options can flow from higher-level configuration systems, so allowing them to choose a native library path would create a code-loading security risk. Deployments that need a specific cuDNN directory should use trusted process-level mechanisms, such as the OS loader configuration, container image setup, or Python `preload_dlls(cudnn=True, directory=...)` before plugin registration.
+
+No-cuDNN plugin validation runs `test_cuda_plugin_ep.py` with `ORT_TEST_CUDA_PLUGIN_EP=1` and `ORT_TEST_CUDA_PLUGIN_NO_CUDNN=1`. That mode passes `enable_cudnn=0` to plugin sessions and skips tests for operators that still require cuDNN in the current implementation. Non-cuDNN operator coverage, plugin registration, device enumeration, graph assignment, CUDA graph, I/O binding, and profiling tests continue to run.
+
 ### 2.2 Preprocessor Defines
 
 Each build target uses different preprocessor defines that control how framework types are resolved:
@@ -615,7 +630,7 @@ The in-tree CUDA EP and shared provider bridge are compiled identically regardle
 
 ### 9.3 Plugin Independence
 
-`libonnxruntime_providers_cuda_plugin.so` is **fully self-contained**. It does not depend on `libonnxruntime_providers_cuda.so` or `libonnxruntime_providers_shared.so` at load time. It statically links against `onnxruntime_framework`, `onnxruntime_graph`, `onnxruntime_common`, `onnxruntime_mlas`, `onnxruntime_flatbuffers`, and links dynamically against CUDA (`cudart`, `cublas`, `cublasLt`, `cufft`), cuDNN, and protobuf. Communication with the ORT runtime happens exclusively through the C API (`OrtApi`/`OrtEpApi`) passed at load time.
+`libonnxruntime_providers_cuda_plugin.so` is **fully self-contained**. It does not depend on `libonnxruntime_providers_cuda.so` or `libonnxruntime_providers_shared.so` at load time. It statically links against `onnxruntime_framework`, `onnxruntime_graph`, `onnxruntime_common`, `onnxruntime_mlas`, `onnxruntime_flatbuffers`, and links dynamically against CUDA (`cudart`, `cublas`, `cublasLt`, `cufft`) and protobuf. cuDNN is loaded lazily only when enabled and available at runtime. Communication with the ORT runtime happens exclusively through the C API (`OrtApi`/`OrtEpApi`) passed at load time.
 
 ### 9.4 Build Outputs
 
diff --git a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
index 3cb1392919398..3ab06edb92011 100644
--- a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
+++ b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
@@ -27,6 +27,22 @@
 TEST_SKIP = "SKIP"
 TEST_FAIL = "FAIL"
 EP_GRAPH_ASSIGNMENT_CONFIG_KEY = "session.record_ep_graph_assignment_info"
+NO_CUDNN_PLUGIN_TEST = os.getenv("ORT_TEST_CUDA_PLUGIN_NO_CUDNN", "").upper() in {"1", "ON", "TRUE", "YES"}
+requires_cudnn = unittest.skipIf(NO_CUDNN_PLUGIN_TEST, "test requires cuDNN-backed CUDA plugin kernels")
+DEFAULT_ONNX_OPSET = 26
+
+
+def _make_released_opset_model(graph, producer_name="onnx-example"):
+    opset = OperatorSetIdProto()
+    opset.version = DEFAULT_ONNX_OPSET
+    return helper.make_model(graph, producer_name=producer_name, opset_imports=[opset])
+
+
+def _plugin_provider_options(extra_options=None):
+    options = {"enable_cudnn": "0"} if NO_CUDNN_PLUGIN_TEST else {}
+    if extra_options:
+        options.update(extra_options)
+    return options
 
 
 def require_cuda_plugin_ep():
@@ -140,7 +156,7 @@ def create_add_model(model_path):
         ],
         [helper.make_tensor_value_info("Y", TensorProto.FLOAT, [3, 2])],
     )
-    model_def = helper.make_model(graph_def, producer_name="onnx-example")
+    model_def = _make_released_opset_model(graph_def)
     save(model_def, model_path)
 
 
@@ -156,7 +172,7 @@ def create_matmul_model(model_path):
         ],
         [helper.make_tensor_value_info("Y", TensorProto.FLOAT, [3, 5])],
     )
-    model_def = helper.make_model(graph_def, producer_name="onnx-example")
+    model_def = _make_released_opset_model(graph_def)
     save(model_def, model_path)
 
 
@@ -181,7 +197,7 @@ def create_gemm_model(model_path, alpha=1.0, beta=1.0, transA=0, transB=0):
         ],
         [helper.make_tensor_value_info("Y", TensorProto.FLOAT, [m, n])],
     )
-    model_def = helper.make_model(graph_def, producer_name="onnx-example")
+    model_def = _make_released_opset_model(graph_def)
     save(model_def, model_path)
 
 
@@ -323,7 +339,7 @@ def run_operator_test(
     try:
         model_creator(model_path)
         sess_options = _create_session_options(session_config)
-        sess_options.add_provider_for_devices([target_device], {})
+        sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
         sess = onnxrt.InferenceSession(model_path, sess_options=sess_options)
 
         active_providers = sess.get_providers()
@@ -375,7 +391,7 @@ def run_provider_options_test(provider_options, expect_plugin_provider=True):
         model_path = tmp.name
     try:
         create_add_model(model_path)
-        providers = [(CUDA_PLUGIN_EP_NAME, provider_options), "CPUExecutionProvider"]
+        providers = [(CUDA_PLUGIN_EP_NAME, _plugin_provider_options(provider_options)), "CPUExecutionProvider"]
         sess = onnxrt.InferenceSession(model_path, sess_options=_create_session_options(), providers=providers)
         active_providers = sess.get_providers()
         assigned_nodes, assignment_info = _get_assigned_nodes(sess, CUDA_PLUGIN_EP_NAME)
@@ -483,7 +499,7 @@ def _run_nhwc_model_test(target_device, op_name, model, feed_dict, expected_fn,
     try:
         save(model, model_path)
         sess_options = _create_session_options(_NHWC_CONFIG)
-        sess_options.add_provider_for_devices([target_device], {})
+        sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
         sess = onnxrt.InferenceSession(model_path, sess_options=sess_options)
         assigned_nodes, assignment_info = _get_assigned_nodes(sess, CUDA_PLUGIN_EP_NAME)
         if not assigned_nodes:
@@ -556,7 +572,7 @@ def _run_model_test(
     try:
         save(model, model_path)
         sess_options = _create_session_options()
-        sess_options.add_provider_for_devices([target_device], {})
+        sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
         sess = onnxrt.InferenceSession(model_path, sess_options=sess_options)
         active_providers = sess.get_providers()
         assigned_nodes, assignment_info = _get_assigned_nodes(sess, ep_name)
@@ -628,6 +644,7 @@ def test_registration_gemm(self):
         )
         self.assertTrue(result, "Gemm plugin registration test failed")
 
+    @requires_cudnn
     def test_registration_conv(self):
         target_device = get_cuda_plugin_device()
         inputs = {
@@ -658,7 +675,7 @@ def test_provider_options_second_device(self):
             model_path = tmp.name
         try:
             create_add_model(model_path)
-            providers = [(CUDA_PLUGIN_EP_NAME, {"device_id": "1"}), "CPUExecutionProvider"]
+            providers = [(CUDA_PLUGIN_EP_NAME, _plugin_provider_options({"device_id": "1"})), "CPUExecutionProvider"]
             sess = onnxrt.InferenceSession(model_path, sess_options=_create_session_options(), providers=providers)
 
             active_providers = sess.get_providers()
@@ -687,6 +704,7 @@ def test_provider_options_second_device(self):
 
     # ---- NHWC layout tests ----
 
+    @requires_cudnn
     def test_nhwc_conv(self):
         target_device = get_cuda_plugin_device()
         inputs = {
@@ -703,6 +721,7 @@ def test_nhwc_conv(self):
         )
         self.assertTrue(result, "Conv (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_batch_normalization(self):
         target_device = get_cuda_plugin_device()
         inputs = {"X": np.random.rand(1, 3, 4, 4).astype(np.float32)}
@@ -716,6 +735,7 @@ def test_nhwc_batch_normalization(self):
         )
         self.assertTrue(result, "BatchNormalization (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_maxpool(self):
         target_device = get_cuda_plugin_device()
         inputs = {"X": np.random.rand(1, 3, 4, 4).astype(np.float32)}
@@ -729,6 +749,7 @@ def test_nhwc_maxpool(self):
         )
         self.assertTrue(result, "MaxPool (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_avgpool(self):
         target_device = get_cuda_plugin_device()
         inputs = {"X": np.random.rand(1, 3, 4, 4).astype(np.float32)}
@@ -742,6 +763,7 @@ def test_nhwc_avgpool(self):
         )
         self.assertTrue(result, "AveragePool (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_conv_transpose(self):
         target_device = get_cuda_plugin_device()
         # ConvTranspose: input [1,2,4,4], weight [2,3,3,3] -> output [1,3,6,6] with stride=2, padding=1, output_padding=1
@@ -782,6 +804,7 @@ def expected_fn(feed):
         result = _run_nhwc_model_test(target_device, "ConvTranspose", model, {"X": x, "W": w}, expected_fn)
         self.assertEqual(result, TEST_PASS, "ConvTranspose (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_global_max_pool(self):
         target_device = get_cuda_plugin_device()
         f_dtype = TensorProto.FLOAT
@@ -800,6 +823,7 @@ def expected_fn(feed):
         result = _run_nhwc_model_test(target_device, "GlobalMaxPool", model, {"X": x}, expected_fn)
         self.assertEqual(result, TEST_PASS, "GlobalMaxPool (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_global_average_pool(self):
         target_device = get_cuda_plugin_device()
         f_dtype = TensorProto.FLOAT
@@ -867,6 +891,7 @@ def expected_fn(feed):
         result = _run_nhwc_model_test(target_device, "SpaceToDepth", model, {"X": x}, expected_fn)
         self.assertEqual(result, TEST_PASS, "SpaceToDepth (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_lrn(self):
         target_device = get_cuda_plugin_device()
         f_dtype = TensorProto.FLOAT
@@ -910,6 +935,7 @@ def expected_fn(feed):
         result = _run_nhwc_model_test(target_device, "GridSample", model, {"X": x, "grid": grid}, expected_fn)
         self.assertEqual(result, TEST_PASS, "GridSample (NHWC) plugin test failed")
 
+    @requires_cudnn
     def test_nhwc_conv_with_resource_accounting(self):
         # Smoke test for the NHWC two-pass partitioning flow combined with the resource
         # accountant (session.resource_cuda_partitioning_settings). The NHWC layout
@@ -1592,26 +1618,32 @@ def test_plugin_ep_claims_key_ops(self):
             ),
             # second unary
             ("Sigmoid", "", 13, [("X", TensorProto.FLOAT, [2, 4])], [("Y", TensorProto.FLOAT, [2, 4])], None),
-            # cuDNN: ConvTranspose (Conv already tested by test_registration_conv)
-            (
-                "ConvTranspose",
-                "",
-                13,
-                [("X", TensorProto.FLOAT, [1, 2, 3, 3]), ("W", TensorProto.FLOAT, [2, 3, 3, 3])],
-                [("Y", TensorProto.FLOAT, [1, 3, 5, 5])],
-                None,
-            ),
-            # cuDNN: LRN (local response normalization)
-            (
-                "LRN",
-                "",
-                13,
-                [("X", TensorProto.FLOAT, [1, 2, 4, 4])],
-                [("Y", TensorProto.FLOAT, [1, 2, 4, 4])],
-                {"size": 3},
-            ),
         ]
 
+        if not NO_CUDNN_PLUGIN_TEST:
+            probe_specs.extend(
+                [
+                    # cuDNN: ConvTranspose (Conv already tested by test_registration_conv)
+                    (
+                        "ConvTranspose",
+                        "",
+                        13,
+                        [("X", TensorProto.FLOAT, [1, 2, 3, 3]), ("W", TensorProto.FLOAT, [2, 3, 3, 3])],
+                        [("Y", TensorProto.FLOAT, [1, 3, 5, 5])],
+                        None,
+                    ),
+                    # cuDNN: LRN (local response normalization)
+                    (
+                        "LRN",
+                        "",
+                        13,
+                        [("X", TensorProto.FLOAT, [1, 2, 4, 4])],
+                        [("Y", TensorProto.FLOAT, [1, 2, 4, 4])],
+                        {"size": 3},
+                    ),
+                ]
+            )
+
         claimed = []
         not_claimed = []
         errors = []
@@ -1624,7 +1656,7 @@ def test_plugin_ep_claims_key_ops(self):
                 save(model, model_path)
                 sess_options = _create_session_options()
                 sess_options.graph_optimization_level = onnxrt.GraphOptimizationLevel.ORT_DISABLE_ALL
-                sess_options.add_provider_for_devices([target_device], {})
+                sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
                 sess = onnxrt.InferenceSession(model_path, sess_options=sess_options)
                 assigned_nodes, _ = _get_assigned_nodes(sess, CUDA_PLUGIN_EP_NAME)
                 if assigned_nodes:
@@ -1650,6 +1682,7 @@ def test_plugin_ep_claims_key_ops(self):
 
     # ---- Newly-included ops that previously lacked tests ----
 
+    @requires_cudnn
     def test_op_einsum(self):
         """Test Einsum op (recently un-excluded from plugin build)."""
         target_device = get_cuda_plugin_device()
@@ -1664,6 +1697,7 @@ def test_op_einsum(self):
         result = _run_model_test(target_device, "Einsum", model, feed, lambda f: f["A"] @ f["B"])
         self.assertEqual(result, TEST_PASS, "Einsum test failed")
 
+    @requires_cudnn
     def test_op_einsum_batch(self):
         """Test Einsum op with batch matrix multiply."""
         target_device = get_cuda_plugin_device()
@@ -1793,6 +1827,7 @@ def test_op_flatten(self):
         result = _run_model_test(target_device, "Flatten", model, feed, lambda f: f["X"].reshape(2, 12))
         self.assertEqual(result, TEST_PASS, "Flatten test failed")
 
+    @requires_cudnn
     def test_op_argmax(self):
         target_device = get_cuda_plugin_device()
         model = _make_simple_model(
@@ -1863,6 +1898,7 @@ def expected(f):
         result = _run_model_test(target_device, "LayerNormalization", model, feed, expected)
         self.assertEqual(result, TEST_PASS, "LayerNormalization test failed")
 
+    @requires_cudnn
     def test_op_instance_normalization(self):
         target_device = get_cuda_plugin_device()
         n_channels = 3
@@ -1898,6 +1934,7 @@ def expected(f):
         result = _run_model_test(target_device, "InstanceNormalization", model, feed, expected)
         self.assertEqual(result, TEST_PASS, "InstanceNormalization test failed")
 
+    @requires_cudnn
     def test_op_conv_transpose(self):
         target_device = get_cuda_plugin_device()
         model = _make_simple_model(
@@ -1920,6 +1957,7 @@ def expected(f):
         result = _run_model_test(target_device, "ConvTranspose", model, feed, expected)
         self.assertEqual(result, TEST_PASS, "ConvTranspose test failed")
 
+    @requires_cudnn
     def test_op_reduce_mean(self):
         target_device = get_cuda_plugin_device()
         model = _make_simple_model(
@@ -1935,6 +1973,7 @@ def test_op_reduce_mean(self):
         )
         self.assertEqual(result, TEST_PASS, "ReduceMean test failed")
 
+    @requires_cudnn
     def test_op_reduce_sum(self):
         target_device = get_cuda_plugin_device()
         model = _make_simple_model(
@@ -2284,7 +2323,7 @@ def _create_cuda_graph_session(self, model_path, extra_session_config=None, prov
         if extra_session_config:
             for key, value in extra_session_config.items():
                 sess_options.add_session_config_entry(key, value)
-        provider_options = {"enable_cuda_graph": "1", **(provider_options or {})}
+        provider_options = _plugin_provider_options({"enable_cuda_graph": "1", **(provider_options or {})})
         providers = [(CUDA_PLUGIN_EP_NAME, provider_options), "CPUExecutionProvider"]
         return onnxrt.InferenceSession(model_path, sess_options=sess_options, providers=providers)
 
@@ -2470,7 +2509,7 @@ def test_cuda_graph_annotation_id(self):
                 ],
                 [helper.make_tensor_value_info("Y", TensorProto.FLOAT, ["M", "N"])],
             )
-            model_def = helper.make_model(graph_def, producer_name="onnx-example")
+            model_def = _make_released_opset_model(graph_def)
             save(model_def, model_path)
 
             session = self._create_cuda_graph_session(model_path)
@@ -2637,7 +2676,7 @@ def test_iobinding_add(self):
         try:
             create_add_model(model_path)
             sess_options = _create_session_options()
-            sess_options.add_provider_for_devices([target_device], {})
+            sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
             sess = onnxrt.InferenceSession(model_path, sess_options=sess_options)
 
             assigned_nodes, assignment_info = _get_assigned_nodes(sess, CUDA_PLUGIN_EP_NAME)
@@ -2678,7 +2717,7 @@ def test_iobinding_matmul(self):
         try:
             create_matmul_model(model_path)
             sess_options = _create_session_options()
-            sess_options.add_provider_for_devices([target_device], {})
+            sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
             sess = onnxrt.InferenceSession(model_path, sess_options=sess_options)
 
             assigned_nodes, assignment_info = _get_assigned_nodes(sess, CUDA_PLUGIN_EP_NAME)
@@ -2726,7 +2765,7 @@ def _run_profiling_test(self):
         try:
             create_matmul_model(model_path)
             sess_options = _create_session_options()
-            sess_options.add_provider_for_devices([target_device], {})
+            sess_options.add_provider_for_devices([target_device], _plugin_provider_options())
 
             profile_prefix = os.path.join(tempfile.gettempdir(), "cuda_plugin_ep_profiling_test")
             sess_options.enable_profiling = True

From a826d7e21b1802e2256d97591b90dbf9f2c5f38c Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 22:08:41 -0700
Subject: [PATCH 05/15] fix(cuda): resolve no-cuDNN CI build/link failures and
 address review feedback

- Guard NOMINMAX redefinition in cudnn_loader.cc (Windows -Werror)
- Link cudnn_loader into TensorRT and NV TensorRT RTX providers to
  resolve undefined CudnnLibrary::Get() symbol
- Guard CudnnLibrary reference in cuda_kernel.h for CUDA minimal build
- CudnnLibrary::Configure() now honors enable_cudnn=0 (disable request)
- Install numpy/onnx in Linux no-cuDNN smoke test before running it
- Use latest released ai.onnx opset instead of hard-coded value in test
- Use Python 3.12 in new no-cuDNN CI workflows
- Apply clang-format/ruff formatting
---
 .github/workflows/linux_cuda_no_cudnn.yml     |  4 +--
 .github/workflows/windows_cuda_no_cudnn.yml   |  4 +--
 cmake/onnxruntime_providers_nv.cmake          |  2 ++
 cmake/onnxruntime_providers_tensorrt.cmake    |  2 ++
 onnxruntime/core/providers/cuda/cuda_kernel.h |  5 +++
 .../core/providers/cuda/cudnn_loader.cc       |  6 ++--
 onnxruntime/core/providers/cuda/cudnn_stub.cc |  4 +--
 .../cuda/plugin/cuda_kernel_adapter.h         |  8 ++---
 .../providers/cuda/plugin/cuda_plugin_utils.h | 32 +++++++++----------
 .../cuda/plugin/cuda_stream_plugin.cc         |  4 +--
 .../test/python/onnxruntime_test_python.py    |  4 +--
 .../transformers/test_cuda_plugin_ep.py       |  3 +-
 12 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/.github/workflows/linux_cuda_no_cudnn.yml b/.github/workflows/linux_cuda_no_cudnn.yml
index deee62a643578..db20e35f2d851 100644
--- a/.github/workflows/linux_cuda_no_cudnn.yml
+++ b/.github/workflows/linux_cuda_no_cudnn.yml
@@ -33,7 +33,7 @@ jobs:
       docker_build_args: '--build-arg BASEIMAGE=onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda13_x64_almalinux8_gcc14:20251107.1'
       docker_image_repo: onnxruntimecuda13manylinuxbuild
       extra_build_flags: '--use_binskim_compliant_compile_flags --parallel --nvcc_threads 4 --flash_nvcc_threads 4 --cuda_version=13.0 --cuda_home=/usr/local/cuda-13.0 --cudnn_home=/usr/local/cuda-13.0 --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86 onnxruntime_BUILD_UNIT_TESTS=ON'
-      python_path_prefix: 'PATH=/opt/python/cp310-cp310/bin:$PATH'
+      python_path_prefix: 'PATH=/opt/python/cp312-cp312/bin:$PATH'
       run_tests: false
       upload_build_output: true
       execution_providers: 'cuda'
@@ -96,7 +96,7 @@ jobs:
           docker run --rm --gpus all \
             -v "${{ runner.temp }}/Release:/onnxruntime_src/build/Release" \
             "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
-            bash -lc 'PATH=/opt/python/cp310-cp310/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python - <<"PY"
+            bash -lc 'PATH=/opt/python/cp312-cp312/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python -m pip install --no-cache-dir numpy onnx && PATH=/opt/python/cp312-cp312/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python - <<"PY"
           import numpy as np
           import onnx
           import onnxruntime as ort
diff --git a/.github/workflows/windows_cuda_no_cudnn.yml b/.github/workflows/windows_cuda_no_cudnn.yml
index 51ff918458a5a..18b538761d304 100644
--- a/.github/workflows/windows_cuda_no_cudnn.yml
+++ b/.github/workflows/windows_cuda_no_cudnn.yml
@@ -35,7 +35,7 @@ jobs:
 
       - uses: actions/setup-python@v6
         with:
-          python-version: '3.14'
+          python-version: '3.12'
           architecture: x64
 
       - name: Locate vcvarsall and Setup Env
@@ -154,7 +154,7 @@ jobs:
 
       - uses: actions/setup-python@v6
         with:
-          python-version: '3.14'
+          python-version: '3.12'
           architecture: x64
 
       - name: Locate vcvarsall and Setup Env
diff --git a/cmake/onnxruntime_providers_nv.cmake b/cmake/onnxruntime_providers_nv.cmake
index 92fc0ccadd667..ffb6fd236cf48 100644
--- a/cmake/onnxruntime_providers_nv.cmake
+++ b/cmake/onnxruntime_providers_nv.cmake
@@ -140,6 +140,8 @@ endif ()
     "${ONNXRUNTIME_ROOT}/core/providers/shared_library/*.cc"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_stream_handle.h"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_stream_handle.cc"
+    "${ONNXRUNTIME_ROOT}/core/providers/cuda/cudnn_loader.h"
+    "${ONNXRUNTIME_ROOT}/core/providers/cuda/cudnn_loader.cc"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_graph.h"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_graph.cc"
   )
diff --git a/cmake/onnxruntime_providers_tensorrt.cmake b/cmake/onnxruntime_providers_tensorrt.cmake
index 4184e0b049afc..073e26728c29e 100644
--- a/cmake/onnxruntime_providers_tensorrt.cmake
+++ b/cmake/onnxruntime_providers_tensorrt.cmake
@@ -161,6 +161,8 @@
     "${ONNXRUNTIME_ROOT}/core/providers/shared_library/*.cc"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_stream_handle.h"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_stream_handle.cc"
+    "${ONNXRUNTIME_ROOT}/core/providers/cuda/cudnn_loader.h"
+    "${ONNXRUNTIME_ROOT}/core/providers/cuda/cudnn_loader.cc"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_graph.h"
     "${ONNXRUNTIME_ROOT}/core/providers/cuda/cuda_graph.cc"
   )
diff --git a/onnxruntime/core/providers/cuda/cuda_kernel.h b/onnxruntime/core/providers/cuda/cuda_kernel.h
index b2d6d69bb6eca..85d629764da48 100644
--- a/onnxruntime/core/providers/cuda/cuda_kernel.h
+++ b/onnxruntime/core/providers/cuda/cuda_kernel.h
@@ -264,9 +264,14 @@ class CudaKernel : public OpKernel {
 
   static inline cudnnHandle_t RequireCudnnHandle(cudnnHandle_t handle) {
     if (handle == nullptr) {
+#ifndef USE_CUDA_MINIMAL
       ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
                                          "cuDNN is unavailable or disabled for CUDA Execution Provider: ",
                                          cuda::CudnnLibrary::Get().Error()));
+#else
+      ORT_THROW_IF_ERROR(ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
+                                         "cuDNN is unavailable for CUDA Execution Provider in a CUDA minimal build."));
+#endif
     }
     return handle;
   }
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.cc b/onnxruntime/core/providers/cuda/cudnn_loader.cc
index 2709da40b3705..e8b599bc4d2b6 100644
--- a/onnxruntime/core/providers/cuda/cudnn_loader.cc
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.cc
@@ -8,7 +8,9 @@
 #include <vector>
 
 #ifdef _WIN32
+#ifndef NOMINMAX
 #define NOMINMAX
+#endif
 #include <windows.h>
 #else
 #include <dlfcn.h>
@@ -89,10 +91,6 @@ CudnnLibrary& CudnnLibrary::Get() {
 
 void CudnnLibrary::Configure(bool enabled) {
   std::lock_guard<std::mutex> lock(mutex_);
-  if (!enabled) {
-    return;
-  }
-
   enabled_ = enabled;
 }
 
diff --git a/onnxruntime/core/providers/cuda/cudnn_stub.cc b/onnxruntime/core/providers/cuda/cudnn_stub.cc
index a02a00fb7fba1..97a0c2e05299d 100644
--- a/onnxruntime/core/providers/cuda/cudnn_stub.cc
+++ b/onnxruntime/core/providers/cuda/cudnn_stub.cc
@@ -5,8 +5,8 @@
 
 #ifndef USE_CUDA_MINIMAL
 
-#define ORT_CUDNN_FORWARD_STATUS(name, ...)              \
-  using Fn = decltype(&name);                            \
+#define ORT_CUDNN_FORWARD_STATUS(name, ...)                            \
+  using Fn = decltype(&name);                                          \
   auto fn = onnxruntime::cuda::CudnnLibrary::Get().Resolve<Fn>(#name); \
   return fn != nullptr ? fn(__VA_ARGS__) : CUDNN_STATUS_NOT_INITIALIZED
 
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h b/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
index d8fb2b39d6393..704666ccee463 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
@@ -180,10 +180,10 @@ using ::onnxruntime::HandleNegativeAxis;
   {                                                                                                                                                                 \
     cudnnStatus_t _status = (expr);                                                                                                                                 \
     if (_status != CUDNN_STATUS_SUCCESS) {                                                                                                                          \
-      if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) {                                                                                                     \
+      if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) {                                                                                                    \
         return onnxruntime::common::Status(onnxruntime::common::ONNXRUNTIME, onnxruntime::common::NOT_IMPLEMENTED,                                                  \
-                                           std::string("cuDNN is unavailable for CUDA Plugin Execution Provider: ") +                                                \
-                                               onnxruntime::cuda::CudnnLibrary::Get().Error());                                                                      \
+                                           std::string("cuDNN is unavailable for CUDA Plugin Execution Provider: ") +                                               \
+                                               onnxruntime::cuda::CudnnLibrary::Get().Error());                                                                     \
       }                                                                                                                                                             \
       return onnxruntime::common::Status(onnxruntime::common::ONNXRUNTIME, onnxruntime::common::FAIL, std::string("cuDNN error: ") + cudnnGetErrorString(_status)); \
     }                                                                                                                                                               \
@@ -952,7 +952,7 @@ class CudaKernel : public OpKernel {
     if (handle == nullptr) {
       ORT_THROW_IF_ERROR(onnxruntime::common::Status(
           onnxruntime::common::ONNXRUNTIME, onnxruntime::common::NOT_IMPLEMENTED,
-            std::string("cuDNN is unavailable or disabled for CUDA Plugin Execution Provider: ") +
+          std::string("cuDNN is unavailable or disabled for CUDA Plugin Execution Provider: ") +
               onnxruntime::cuda::CudnnLibrary::Get().Error()));
     }
     if (handle != nullptr && stream != nullptr) {
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h b/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
index d3ab3a6a18b1e..6f8e913ac9124 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_plugin_utils.h
@@ -120,23 +120,23 @@ inline bool TryGetCurrentCudaDevice(int& device_id) noexcept {
 #endif
 
 #ifndef PL_CUDNN_RETURN_IF_ERROR
-#define PL_CUDNN_RETURN_IF_ERROR(cudnn_call_expr) \
-  do {                                            \
-    cudnnStatus_t _cudnn_err = (cudnn_call_expr); \
-    if (_cudnn_err != CUDNN_STATUS_SUCCESS) {     \
-      if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) { \
-        return Ort::GetApi().CreateStatus(        \
-            ORT_NOT_IMPLEMENTED,                  \
+#define PL_CUDNN_RETURN_IF_ERROR(cudnn_call_expr)                                       \
+  do {                                                                                  \
+    cudnnStatus_t _cudnn_err = (cudnn_call_expr);                                       \
+    if (_cudnn_err != CUDNN_STATUS_SUCCESS) {                                           \
+      if (!onnxruntime::cuda::CudnnLibrary::Get().Available()) {                        \
+        return Ort::GetApi().CreateStatus(                                              \
+            ORT_NOT_IMPLEMENTED,                                                        \
             (std::string("cuDNN is unavailable for CUDA Plugin Execution Provider: ") + \
-             onnxruntime::cuda::CudnnLibrary::Get().Error()) \
-                .c_str());                        \
-      }                                           \
-      return Ort::GetApi().CreateStatus(          \
-          ORT_EP_FAIL,                            \
-          (std::string("cuDNN error: ") +         \
-           cudnnGetErrorString(_cudnn_err))       \
-              .c_str());                          \
-    }                                             \
+             onnxruntime::cuda::CudnnLibrary::Get().Error())                            \
+                .c_str());                                                              \
+      }                                                                                 \
+      return Ort::GetApi().CreateStatus(                                                \
+          ORT_EP_FAIL,                                                                  \
+          (std::string("cuDNN error: ") +                                               \
+           cudnnGetErrorString(_cudnn_err))                                             \
+              .c_str());                                                                \
+    }                                                                                   \
   } while (0)
 #endif
 
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc b/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
index 6ff83c344553f..80184dfe101f3 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc
@@ -44,8 +44,8 @@ CudaSyncStream::CudaSyncStream(CudaEpFactory& factory, int device_id, bool enabl
                                const OrtEp* /*ep*/)
     : OrtSyncStreamImpl{},
       factory_(factory),
-  device_id_(device_id),
-  enable_cudnn_(enable_cudnn) {
+      device_id_(device_id),
+      enable_cudnn_(enable_cudnn) {
   ort_version_supported = ORT_API_VERSION;
   GetHandle = GetHandleImpl;
   CreateNotification = CreateNotificationImpl;
diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index d60acfc8bc34f..a9fee586a790a 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -16,11 +16,11 @@
 
 import numpy as np
 from helper import get_name
-
-import onnxruntime as onnxrt
 from onnxruntime.capi import _pybind_state as C
 from onnxruntime.capi.onnxruntime_pybind11_state import Fail, OrtValueVector, RunOptions
 
+import onnxruntime as onnxrt
+
 # handle change from python 3.8 and on where loading a dll from the current directory needs to be explicitly allowed.
 if platform.system() == "Windows" and sys.version_info.major >= 3 and sys.version_info.minor >= 8:  # noqa: YTT204
     os.add_dll_directory(os.getcwd())
diff --git a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
index 3ab06edb92011..51ee1a4568e95 100644
--- a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
+++ b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
@@ -29,7 +29,8 @@
 EP_GRAPH_ASSIGNMENT_CONFIG_KEY = "session.record_ep_graph_assignment_info"
 NO_CUDNN_PLUGIN_TEST = os.getenv("ORT_TEST_CUDA_PLUGIN_NO_CUDNN", "").upper() in {"1", "ON", "TRUE", "YES"}
 requires_cudnn = unittest.skipIf(NO_CUDNN_PLUGIN_TEST, "test requires cuDNN-backed CUDA plugin kernels")
-DEFAULT_ONNX_OPSET = 26
+# Use the latest released ai.onnx opset so the model builders stay current as ONNX releases new opsets.
+DEFAULT_ONNX_OPSET = max(v for (d, v) in helper.OP_SET_ID_VERSION_MAP if d == "ai.onnx")
 
 
 def _make_released_opset_model(graph, producer_name="onnx-example"):

From 4494a3b308a41e93f72df224b700a36c32ba665c Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 22:28:47 -0700
Subject: [PATCH 06/15] lintrunner

---
 onnxruntime/test/python/onnxruntime_test_python.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index a9fee586a790a..d60acfc8bc34f 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -16,10 +16,10 @@
 
 import numpy as np
 from helper import get_name
-from onnxruntime.capi import _pybind_state as C
-from onnxruntime.capi.onnxruntime_pybind11_state import Fail, OrtValueVector, RunOptions
 
 import onnxruntime as onnxrt
+from onnxruntime.capi import _pybind_state as C
+from onnxruntime.capi.onnxruntime_pybind11_state import Fail, OrtValueVector, RunOptions
 
 # handle change from python 3.8 and on where loading a dll from the current directory needs to be explicitly allowed.
 if platform.system() == "Windows" and sys.version_info.major >= 3 and sys.version_info.minor >= 8:  # noqa: YTT204

From b725fbc93f02344c95d7e67c4f90ddc514c7878a Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 22:56:32 -0700
Subject: [PATCH 07/15] fix(cuda): make CudnnLibrary a pure loader; gate
 enable_cudnn per-stream

Address review feedback on the process-global CudnnLibrary singleton:

- Remove CudnnLibrary::Configure()/enabled_ so the singleton is purely a
  process-wide cuDNN library loader. Per-session enable_cudnn no longer
  mutates global state, eliminating cross-session interference and the
  one-time-init bug where disabling cuDNN in one session permanently
  prevented later sessions from loading it.
- Drop the now-redundant Configure() calls in CUDAExecutionProvider and
  the plugin CudaEp constructors. cuDNN usage is already gated per-stream
  via 'enable_cudnn && CudnnLibrary::Get().Available()', whose
  short-circuit avoids any dlopen when enable_cudnn=0.
- Default factory-level CreateSyncStreamForDeviceImpl streams to
  enable_cudnn=false so they never trigger an unexpected cuDNN load;
  EP-owned compute streams still honor the EP's enable_cudnn setting.
- Reorder imports in onnxruntime_test_python.py per ruff.
---
 .../core/providers/cuda/cuda_execution_provider.cc    |  4 ----
 onnxruntime/core/providers/cuda/cudnn_loader.cc       | 11 -----------
 onnxruntime/core/providers/cuda/cudnn_loader.h        |  2 --
 onnxruntime/core/providers/cuda/plugin/cuda_ep.cc     |  2 --
 .../core/providers/cuda/plugin/cuda_ep_factory.cc     |  6 +++++-
 onnxruntime/test/python/onnxruntime_test_python.py    |  4 ++--
 6 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
index 65107dca173ae..86620ae3c60af 100755
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
@@ -338,10 +338,6 @@ CUDAExecutionProvider::CUDAExecutionProvider(const CUDAExecutionProviderInfo& in
   ORT_ENFORCE(info_.prefer_nhwc == 0, "This build does not support NHWC layout");
 #endif
 
-#ifndef USE_CUDA_MINIMAL
-  cuda::CudnnLibrary::Get().Configure(info_.enable_cudnn);
-#endif
-
   CUDA_CALL_THROW(cudaSetDevice(info_.device_id));
 
   // must wait GPU idle, otherwise cudaGetDeviceProperties might fail
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.cc b/onnxruntime/core/providers/cuda/cudnn_loader.cc
index e8b599bc4d2b6..5886f92618e86 100644
--- a/onnxruntime/core/providers/cuda/cudnn_loader.cc
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.cc
@@ -89,11 +89,6 @@ CudnnLibrary& CudnnLibrary::Get() {
   return library;
 }
 
-void CudnnLibrary::Configure(bool enabled) {
-  std::lock_guard<std::mutex> lock(mutex_);
-  enabled_ = enabled;
-}
-
 bool CudnnLibrary::Available() {
   return EnsureLoaded();
 }
@@ -109,12 +104,6 @@ void* CudnnLibrary::Handle() {
 
 bool CudnnLibrary::EnsureLoaded() {
   std::lock_guard<std::mutex> lock(mutex_);
-  if (!enabled_) {
-    available_ = false;
-    error_ = "cuDNN was disabled by CUDA provider option enable_cudnn=0";
-    return false;
-  }
-
   if (load_attempted_) {
     return available_;
   }
diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.h b/onnxruntime/core/providers/cuda/cudnn_loader.h
index 1328104074820..b6d501e7772ad 100644
--- a/onnxruntime/core/providers/cuda/cudnn_loader.h
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.h
@@ -17,7 +17,6 @@ class CudnnLibrary {
  public:
   static CudnnLibrary& Get();
 
-  void Configure(bool enabled);
   bool Available();
   const char* Error() const;
   void* Handle();
@@ -34,7 +33,6 @@ class CudnnLibrary {
   void* ResolveSymbol(const char* symbol);
 
   mutable std::mutex mutex_;
-  bool enabled_{true};
   bool load_attempted_{false};
   bool available_{false};
   std::string error_;
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc b/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
index d5f3b79783895..1d9fe1312a282 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
@@ -128,8 +128,6 @@ CudaEp::CudaEp(CudaEpFactory& factory, const Config& config, const OrtLogger& lo
   // ORT uses it to avoid reading OrtEp struct fields that did not exist when the plugin was compiled.
   ort_version_supported = ORT_API_VERSION;
 
-  onnxruntime::cuda::CudnnLibrary::Get().Configure(config_.enable_cudnn);
-
   // The plugin is compiled against the latest ORT headers (ORT_API_VERSION) but may be loaded by an
   // older ORT runtime, down to the floor declared in plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION. Some
   // OrtEp callbacks below — and the OrtEpApi functions their implementations call — only exist in
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc b/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
index ffd89e2b1e8db..c9e739f9d7c6c 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
@@ -889,7 +889,11 @@ OrtStatus* ORT_API_CALL CudaEpFactory::CreateSyncStreamForDeviceImpl(
 
   auto* factory = static_cast<CudaEpFactory*>(this_ptr);
   int req_device_id = factory->ep_api_.MemoryDevice_GetDeviceId(memory_device);
-  auto cuda_stream = std::make_unique<CudaSyncStream>(*factory, req_device_id, true, nullptr);
+  // Factory-level streams are not tied to a specific EP instance's enable_cudnn policy. Default cuDNN
+  // off here so this path never triggers a cuDNN load or handle creation; kernels that need cuDNN run
+  // on EP-owned streams created with the EP's actual enable_cudnn setting, and otherwise fall back to
+  // the per-thread default cuDNN handle.
+  auto cuda_stream = std::make_unique<CudaSyncStream>(*factory, req_device_id, false, nullptr);
 
   // Initialize CUDA handles (stream, cuBLAS, cuDNN)
   RETURN_IF_ERROR(cuda_stream->InitHandles());
diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index d60acfc8bc34f..a9fee586a790a 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -16,11 +16,11 @@
 
 import numpy as np
 from helper import get_name
-
-import onnxruntime as onnxrt
 from onnxruntime.capi import _pybind_state as C
 from onnxruntime.capi.onnxruntime_pybind11_state import Fail, OrtValueVector, RunOptions
 
+import onnxruntime as onnxrt
+
 # handle change from python 3.8 and on where loading a dll from the current directory needs to be explicitly allowed.
 if platform.system() == "Windows" and sys.version_info.major >= 3 and sys.version_info.minor >= 8:  # noqa: YTT204
     os.add_dll_directory(os.getcwd())

From 2b17352ab6b3feac34db50f405e716f39836ff4f Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 22:56:53 -0700
Subject: [PATCH 08/15] lintrunner

---
 onnxruntime/test/python/onnxruntime_test_python.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index a9fee586a790a..d60acfc8bc34f 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -16,10 +16,10 @@
 
 import numpy as np
 from helper import get_name
-from onnxruntime.capi import _pybind_state as C
-from onnxruntime.capi.onnxruntime_pybind11_state import Fail, OrtValueVector, RunOptions
 
 import onnxruntime as onnxrt
+from onnxruntime.capi import _pybind_state as C
+from onnxruntime.capi.onnxruntime_pybind11_state import Fail, OrtValueVector, RunOptions
 
 # handle change from python 3.8 and on where loading a dll from the current directory needs to be explicitly allowed.
 if platform.system() == "Windows" and sys.version_info.major >= 3 and sys.version_info.minor >= 8:  # noqa: YTT204

From 39bc7399a562c308d8ecd4c63a42a1e8123eb4d9 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Wed, 24 Jun 2026 23:30:40 -0700
Subject: [PATCH 09/15] Address review: lazy cuDNN fallback handle and safe
 Windows DLL search

- cuda_kernel_adapter.h: split fallback cuDNN handle creation out of
  GetDefaultCudaHandlesForDevice() into a lazy GetDefaultCudnnHandleForDevice()
  so cuBLAS-only paths (and enable_cudnn=0 sessions) never trigger a cuDNN load.
- cudnn_loader.cc: load cuDNN on Windows via LoadLibraryExA with
  LOAD_LIBRARY_SEARCH_DEFAULT_DIRS to exclude the process CWD from the DLL
  search order.
---
 .../core/providers/cuda/cudnn_loader.cc       |  7 ++-
 .../cuda/plugin/cuda_kernel_adapter.h         | 44 +++++++++++++------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/onnxruntime/core/providers/cuda/cudnn_loader.cc b/onnxruntime/core/providers/cuda/cudnn_loader.cc
index 5886f92618e86..aa21781346973 100644
--- a/onnxruntime/core/providers/cuda/cudnn_loader.cc
+++ b/onnxruntime/core/providers/cuda/cudnn_loader.cc
@@ -36,7 +36,12 @@ std::vector<std::string> GetCandidateLibraryNames() {
 
 void* LoadLibraryCandidate(const std::string& candidate, std::string& error) {
 #ifdef _WIN32
-  HMODULE handle = LoadLibraryA(candidate.c_str());
+  // Use LOAD_LIBRARY_SEARCH_DEFAULT_DIRS so cuDNN is resolved only from the
+  // application directory, %WINDIR%\System32, and directories added via
+  // AddDllDirectory/SetDefaultDllDirectories. This deliberately excludes the
+  // current working directory from the search order to avoid loading an
+  // attacker-controlled DLL from the process CWD.
+  HMODULE handle = LoadLibraryExA(candidate.c_str(), nullptr, LOAD_LIBRARY_SEARCH_DEFAULT_DIRS);
   if (handle == nullptr) {
     error = "LoadLibrary failed for " + candidate + " with error " + std::to_string(GetLastError());
   }
diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h b/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
index 704666ccee463..4dffbf7866c9f 100644
--- a/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
+++ b/onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
@@ -510,6 +510,10 @@ inline DefaultCudaHandles& GetDefaultCudaHandlesForDevice(int device_id) {
   // Fallback handles are only used for code paths that need cuBLAS/cuDNN
   // without an active CudaSyncStream. Keep them thread-local so they are not
   // shared across callers that may use the libraries concurrently.
+  //
+  // Only cuBLAS/cuBLASLt are created here. The cuDNN fallback handle is created
+  // lazily by GetDefaultCudnnHandleForDevice() so that cuBLAS-only paths (and
+  // sessions with enable_cudnn=0) never trigger a cuDNN load.
   thread_local std::unordered_map<int, DefaultCudaHandles> handles_by_device;
   auto [it, inserted] = handles_by_device.try_emplace(device_id);
   if (inserted) {
@@ -523,20 +527,7 @@ inline DefaultCudaHandles& GetDefaultCudaHandlesForDevice(int device_id) {
       handles_by_device.erase(it);
       ORT_THROW("Failed to create default cuBLAS handle for CUDA plugin device ", device_id);
     }
-    if (onnxruntime::cuda::CudnnLibrary::Get().Available()) {
-      if (cudnnCreate(&it->second.cudnn) != CUDNN_STATUS_SUCCESS) {
-        cublasDestroy(it->second.cublas);
-        it->second.cublas = nullptr;
-        if (get_device_result == cudaSuccess) {
-          cudaSetDevice(prev_device);
-        }
-        handles_by_device.erase(it);
-        ORT_THROW("Failed to create default cuDNN handle for CUDA plugin device ", device_id);
-      }
-    }
     if (cublasLtCreate(&it->second.cublas_lt) != CUBLAS_STATUS_SUCCESS) {
-      cudnnDestroy(it->second.cudnn);
-      it->second.cudnn = nullptr;
       cublasDestroy(it->second.cublas);
       it->second.cublas = nullptr;
       if (get_device_result == cudaSuccess) {
@@ -553,6 +544,31 @@ inline DefaultCudaHandles& GetDefaultCudaHandlesForDevice(int device_id) {
   return it->second;
 }
 
+// Lazily creates the thread-local fallback cuDNN handle for the device. Callers
+// must check enable_cudnn and CudnnLibrary::Available() before invoking this so
+// that cuBLAS-only paths never trigger a cuDNN load.
+inline cudnnHandle_t GetDefaultCudnnHandleForDevice(int device_id) {
+  DefaultCudaHandles& handles = GetDefaultCudaHandlesForDevice(device_id);
+  if (handles.cudnn != nullptr) {
+    return handles.cudnn;
+  }
+
+  int prev_device = -1;
+  const cudaError_t get_device_result = cudaGetDevice(&prev_device);
+  PL_CUDA_CALL_THROW(cudaSetDevice(device_id));
+  cudnnHandle_t cudnn = nullptr;
+  const cudnnStatus_t status = cudnnCreate(&cudnn);
+  if (get_device_result == cudaSuccess) {
+    cudaSetDevice(prev_device);
+  }
+  if (status != CUDNN_STATUS_SUCCESS) {
+    ORT_THROW("Failed to create default cuDNN handle for CUDA plugin device ", device_id);
+  }
+
+  handles.cudnn = cudnn;
+  return cudnn;
+}
+
 inline const cudaDeviceProp& GetDevicePropForDevice(int device_id) {
   static std::mutex mutex;
   static std::unordered_map<int, std::unique_ptr<cudaDeviceProp>> props;
@@ -904,7 +920,7 @@ class CudaKernel : public OpKernel {
     if (!runtime_config_->enable_cudnn || !onnxruntime::cuda::CudnnLibrary::Get().Available()) {
       return nullptr;
     }
-    return detail::GetDefaultCudaHandlesForDevice(device_id_).cudnn;
+    return detail::GetDefaultCudnnHandleForDevice(device_id_);
   }
   inline cublasLtHandle_t DefaultCublasLtHandle() const { return detail::GetDefaultCudaHandlesForDevice(device_id_).cublas_lt; }
 

From 02fb23d01eaa9a4ba57174112cbfb350c9602591 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Thu, 25 Jun 2026 17:16:16 +0000
Subject: [PATCH 10/15] fix CI

---
 .github/workflows/linux_cuda_no_cudnn.yml   |  4 +--
 .github/workflows/windows_cuda_no_cudnn.yml | 31 ++++++++++++++++++---
 cmake/onnxruntime_providers_cuda.cmake      | 24 ++++++++++++++++
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/linux_cuda_no_cudnn.yml b/.github/workflows/linux_cuda_no_cudnn.yml
index db20e35f2d851..23a7a754ce74e 100644
--- a/.github/workflows/linux_cuda_no_cudnn.yml
+++ b/.github/workflows/linux_cuda_no_cudnn.yml
@@ -87,14 +87,14 @@ jobs:
       - name: Verify CUDA provider has no direct cuDNN dependency
         run: |
           docker run --rm --gpus all \
-            -v "${{ runner.temp }}/Release:/onnxruntime_src/build/Release" \
+            -v "${{ runner.temp }}:/onnxruntime_src/build" \
             "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
             bash -lc 'ldd /onnxruntime_src/build/Release/libonnxruntime_providers_cuda.so | tee /tmp/ldd.txt && ! grep -i cudnn /tmp/ldd.txt'
 
       - name: Run no-cuDNN CUDA EP smoke test
         run: |
           docker run --rm --gpus all \
-            -v "${{ runner.temp }}/Release:/onnxruntime_src/build/Release" \
+            -v "${{ runner.temp }}:/onnxruntime_src/build" \
             "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
             bash -lc 'PATH=/opt/python/cp312-cp312/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python -m pip install --no-cache-dir numpy onnx && PATH=/opt/python/cp312-cp312/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python - <<"PY"
           import numpy as np
diff --git a/.github/workflows/windows_cuda_no_cudnn.yml b/.github/workflows/windows_cuda_no_cudnn.yml
index 18b538761d304..8e84e1f2bf944 100644
--- a/.github/workflows/windows_cuda_no_cudnn.yml
+++ b/.github/workflows/windows_cuda_no_cudnn.yml
@@ -55,11 +55,21 @@ jobs:
           dir
         shell: pwsh
 
-      - name: Download cuDNN headers for CUDA 13.0
+      - name: Prepare cuDNN SDK without runtime DLLs
         working-directory: ${{ runner.temp }}
         run: |
-          azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/cudnn_sdk/$env:CUDNN_FOLDER" .
-          dir
+          $cudnnSdkUri = "https://lotusscus.blob.core.windows.net/models/cudnn_sdk/$env:CUDNN_FOLDER"
+          azcopy.exe cp --recursive $cudnnSdkUri .
+          $cudnnRoot = Join-Path $env:RUNNER_TEMP $env:CUDNN_FOLDER
+          if (-not (Test-Path $cudnnRoot)) {
+            Write-Error "cuDNN SDK was not downloaded to the expected folder: $cudnnRoot"
+            exit 1
+          }
+          Get-ChildItem -Path $cudnnRoot -Recurse -Include "cudnn*.dll" | Remove-Item -Force
+          if (Get-ChildItem -Path $cudnnRoot -Recurse -Include "cudnn*.dll" -ErrorAction SilentlyContinue) {
+            Write-Error "cuDNN runtime DLLs must not be present in the no-cuDNN build environment"
+            exit 1
+          }
         shell: pwsh
 
       - name: Add CUDA to PATH
@@ -69,6 +79,13 @@ jobs:
           Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin"
           Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\extras\CUPTI\lib64"
 
+      - name: Install CUDA Visual Studio integration
+        shell: pwsh
+        run: |
+          $sourceDir = "$env:RUNNER_TEMP\v13.0\extras\visual_studio_integration\MSBuildExtensions"
+          $targetDir = "${env:ProgramFiles}\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations"
+          Copy-Item -Path "$sourceDir\CUDA 13.0.*" -Destination $targetDir -Force
+
       - name: Set OnnxRuntimeBuildDirectory
         shell: pwsh
         run: |
@@ -117,6 +134,13 @@ jobs:
           Remove-Item -Path "$outputDir\CMakeFiles" -Recurse -Force -ErrorAction SilentlyContinue
           Remove-Item -Path $outputDir -Include "*.obj" -Recurse
 
+          $cudnnArtifacts = Get-ChildItem -Path $outputDir -Recurse -Include "cudnn*.dll" -ErrorAction SilentlyContinue
+          if ($cudnnArtifacts) {
+            $cudnnArtifacts | ForEach-Object { Write-Host $_.FullName }
+            Write-Error "cuDNN runtime DLLs must not be present in no-cuDNN build artifacts"
+            exit 1
+          }
+
       - name: Upload build artifacts
         uses: actions/upload-artifact@v6
         with:
@@ -235,4 +259,3 @@ jobs:
       ONNXRUNTIME_TEST_GPU_DEVICE_ID: '0'
       AZCOPY_AUTO_LOGIN_TYPE: MSI
       AZCOPY_MSI_CLIENT_ID: 63b63039-6328-442f-954b-5a64d124e5b4
-      CUDNN_FOLDER: 9.14.0.64_cuda13
diff --git a/cmake/onnxruntime_providers_cuda.cmake b/cmake/onnxruntime_providers_cuda.cmake
index fdc3fe8786425..882b1063805e9 100644
--- a/cmake/onnxruntime_providers_cuda.cmake
+++ b/cmake/onnxruntime_providers_cuda.cmake
@@ -604,6 +604,30 @@
       )
   set_target_properties(onnxruntime_providers_cuda PROPERTIES
     PUBLIC_HEADER "${ONNXRUNTIME_CUDA_PROVIDER_PUBLIC_HEADERS}")
+  if(WIN32 AND NOT onnxruntime_CUDA_MINIMAL)
+    set(ORT_CUDNN_DLL_PATH "")
+    if(onnxruntime_CUDNN_HOME)
+      set(ORT_CUDNN_DLL_SEARCH_PATHS
+        "${onnxruntime_CUDNN_HOME}/bin/cudnn64_*.dll"
+        "${onnxruntime_CUDNN_HOME}/bin/x64/cudnn64_*.dll"
+        "${onnxruntime_CUDNN_HOME}/bin/${onnxruntime_CUDA_VERSION}/cudnn64_*.dll"
+        "${onnxruntime_CUDNN_HOME}/bin/${onnxruntime_CUDA_VERSION}/x64/cudnn64_*.dll"
+      )
+    else()
+      set(ORT_CUDNN_DLL_SEARCH_PATHS "${onnxruntime_CUDA_HOME}/bin/cudnn64_*.dll")
+    endif()
+    foreach(search_path ${ORT_CUDNN_DLL_SEARCH_PATHS})
+      file(GLOB ORT_CUDNN_DLL_PATH "${search_path}")
+      if(ORT_CUDNN_DLL_PATH)
+        break()
+      endif()
+    endforeach()
+    if(ORT_CUDNN_DLL_PATH)
+      add_custom_command(TARGET onnxruntime_providers_cuda POST_BUILD
+        COMMAND ${CMAKE_COMMAND} -E copy_if_different ${ORT_CUDNN_DLL_PATH} $<TARGET_FILE_DIR:onnxruntime_providers_cuda>
+      )
+    endif()
+  endif()
   install(TARGETS onnxruntime_providers_cuda
           PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/onnxruntime/core/providers/cuda
           ARCHIVE  DESTINATION ${CMAKE_INSTALL_LIBDIR}

From ebf322e0ddee0cdf4b6bdc1c57da8927098825b6 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Thu, 25 Jun 2026 15:03:14 -0700
Subject: [PATCH 11/15] install wheel in CI

---
 .github/workflows/linux_cuda_no_cudnn.yml | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/linux_cuda_no_cudnn.yml b/.github/workflows/linux_cuda_no_cudnn.yml
index 23a7a754ce74e..c1a1012afc361 100644
--- a/.github/workflows/linux_cuda_no_cudnn.yml
+++ b/.github/workflows/linux_cuda_no_cudnn.yml
@@ -96,7 +96,16 @@ jobs:
           docker run --rm --gpus all \
             -v "${{ runner.temp }}:/onnxruntime_src/build" \
             "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
-            bash -lc 'PATH=/opt/python/cp312-cp312/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python -m pip install --no-cache-dir numpy onnx && PATH=/opt/python/cp312-cp312/bin:$PATH PYTHONPATH=/onnxruntime_src/build/Release python - <<"PY"
+            bash -lc 'set -e
+          PATH=/opt/python/cp312-cp312/bin:$PATH
+          WHEEL_PATH=$(find /onnxruntime_src/build/Release -type f -name "onnxruntime_gpu-*.whl" | head -n 1)
+          if [ -z "$WHEEL_PATH" ]; then
+            echo "No built onnxruntime GPU wheel found under /onnxruntime_src/build/Release" >&2
+            exit 1
+          fi
+          echo "Installing $WHEEL_PATH"
+          python -m pip install --no-cache-dir --force-reinstall --no-deps numpy onnx "$WHEEL_PATH"
+          python - <<"PY"
           import numpy as np
           import onnx
           import onnxruntime as ort

From 0853a0542c381a7b02595f9ed67759abffa84603 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Thu, 25 Jun 2026 16:03:00 -0700
Subject: [PATCH 12/15] ci: fix Windows no-cuDNN CUDA 13 setup

---
 .github/workflows/windows_cuda_no_cudnn.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/windows_cuda_no_cudnn.yml b/.github/workflows/windows_cuda_no_cudnn.yml
index 8e84e1f2bf944..2b41645beb74c 100644
--- a/.github/workflows/windows_cuda_no_cudnn.yml
+++ b/.github/workflows/windows_cuda_no_cudnn.yml
@@ -77,6 +77,7 @@ jobs:
         run: |
           Write-Host "Adding CUDA to PATH without adding any cuDNN directory"
           Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin"
+          Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin\x64"
           Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\extras\CUPTI\lib64"
 
       - name: Install CUDA Visual Studio integration
@@ -84,7 +85,7 @@ jobs:
         run: |
           $sourceDir = "$env:RUNNER_TEMP\v13.0\extras\visual_studio_integration\MSBuildExtensions"
           $targetDir = "${env:ProgramFiles}\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations"
-          Copy-Item -Path "$sourceDir\CUDA 13.0.*" -Destination $targetDir -Force
+          Copy-Item -Path "$sourceDir\*" -Destination $targetDir -Force
 
       - name: Set OnnxRuntimeBuildDirectory
         shell: pwsh
@@ -208,6 +209,7 @@ jobs:
         run: |
           Write-Host "Adding CUDA to PATH without adding any cuDNN directory"
           Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin"
+          Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\bin\x64"
           Add-Content -Path $env:GITHUB_PATH -Value "$env:RUNNER_TEMP\v13.0\extras\CUPTI\lib64"
 
       - name: Set OnnxRuntimeBuildDirectory

From abacfeea4cce21921286d2ff98d9bed81fce5ad0 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Fri, 26 Jun 2026 11:07:41 -0700
Subject: [PATCH 13/15] update

---
 .github/workflows/windows_cuda_no_cudnn.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/windows_cuda_no_cudnn.yml b/.github/workflows/windows_cuda_no_cudnn.yml
index 2b41645beb74c..a8b98f2dbf1e8 100644
--- a/.github/workflows/windows_cuda_no_cudnn.yml
+++ b/.github/workflows/windows_cuda_no_cudnn.yml
@@ -153,7 +153,7 @@ jobs:
       ALLOW_RELEASED_ONNX_OPSET_ONLY: '0'
       ONNXRUNTIME_TEST_GPU_DEVICE_ID: '0'
       AZCOPY_AUTO_LOGIN_TYPE: MSI
-      AZCOPY_MSI_CLIENT_ID: 63b63039-6328-442f-954b-5a64d124e5b4
+      AZCOPY_MSI_CLIENT_ID: d712a4c7-a0cf-4e87-af75-31510eba0a8e
       CUDNN_FOLDER: 9.14.0.64_cuda13
 
   test:
@@ -260,4 +260,4 @@ jobs:
       ALLOW_RELEASED_ONNX_OPSET_ONLY: '0'
       ONNXRUNTIME_TEST_GPU_DEVICE_ID: '0'
       AZCOPY_AUTO_LOGIN_TYPE: MSI
-      AZCOPY_MSI_CLIENT_ID: 63b63039-6328-442f-954b-5a64d124e5b4
+      AZCOPY_MSI_CLIENT_ID: d712a4c7-a0cf-4e87-af75-31510eba0a8e

From e9ce61e38c026217e39314e70660529e2d7faf77 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Fri, 26 Jun 2026 11:15:38 -0700
Subject: [PATCH 14/15] update ci

---
 .github/workflows/linux_cuda_no_cudnn.yml | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/linux_cuda_no_cudnn.yml b/.github/workflows/linux_cuda_no_cudnn.yml
index c1a1012afc361..abe92589aa705 100644
--- a/.github/workflows/linux_cuda_no_cudnn.yml
+++ b/.github/workflows/linux_cuda_no_cudnn.yml
@@ -32,7 +32,7 @@ jobs:
       dockerfile_path: tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda
       docker_build_args: '--build-arg BASEIMAGE=onnxruntimebuildcache.azurecr.io/internal/azureml/onnxruntime/build/cuda13_x64_almalinux8_gcc14:20251107.1'
       docker_image_repo: onnxruntimecuda13manylinuxbuild
-      extra_build_flags: '--use_binskim_compliant_compile_flags --parallel --nvcc_threads 4 --flash_nvcc_threads 4 --cuda_version=13.0 --cuda_home=/usr/local/cuda-13.0 --cudnn_home=/usr/local/cuda-13.0 --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86 onnxruntime_BUILD_UNIT_TESTS=ON'
+      extra_build_flags: '--use_binskim_compliant_compile_flags --build_wheel --parallel --nvcc_threads 4 --flash_nvcc_threads 4 --cuda_version=13.0 --cuda_home=/usr/local/cuda-13.0 --cudnn_home=/usr/local/cuda-13.0 --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=86 onnxruntime_BUILD_UNIT_TESTS=ON'
       python_path_prefix: 'PATH=/opt/python/cp312-cp312/bin:$PATH'
       run_tests: false
       upload_build_output: true
@@ -87,20 +87,24 @@ jobs:
       - name: Verify CUDA provider has no direct cuDNN dependency
         run: |
           docker run --rm --gpus all \
-            -v "${{ runner.temp }}:/onnxruntime_src/build" \
+            -v "${{ runner.temp }}/Release:/build/Release" \
             "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
-            bash -lc 'ldd /onnxruntime_src/build/Release/libonnxruntime_providers_cuda.so | tee /tmp/ldd.txt && ! grep -i cudnn /tmp/ldd.txt'
+            bash -lc 'set -euo pipefail
+          ldd /build/Release/Release/libonnxruntime_providers_cuda.so | tee /tmp/ldd.txt
+          ! grep -i cudnn /tmp/ldd.txt'
 
       - name: Run no-cuDNN CUDA EP smoke test
         run: |
           docker run --rm --gpus all \
-            -v "${{ runner.temp }}:/onnxruntime_src/build" \
+            -v "${{ runner.temp }}/Release:/build/Release" \
             "${{ steps.build_docker_image_step.outputs.full-image-name }}" \
             bash -lc 'set -e
           PATH=/opt/python/cp312-cp312/bin:$PATH
-          WHEEL_PATH=$(find /onnxruntime_src/build/Release -type f -name "onnxruntime_gpu-*.whl" | head -n 1)
+          LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH:-}
+          export PATH LD_LIBRARY_PATH
+          WHEEL_PATH=$(find /build/Release/Release/dist -type f -name "onnxruntime_gpu-*.whl" | head -n 1)
           if [ -z "$WHEEL_PATH" ]; then
-            echo "No built onnxruntime GPU wheel found under /onnxruntime_src/build/Release" >&2
+            echo "No built onnxruntime GPU wheel found under /build/Release/Release/dist" >&2
             exit 1
           fi
           echo "Installing $WHEEL_PATH"

From 09b09c062fc3dba8fa1865dd65f16bb0b3feec15 Mon Sep 17 00:00:00 2001
From: Tianlei Wu <tlwu@microsoft.com>
Date: Fri, 26 Jun 2026 15:07:55 -0700
Subject: [PATCH 15/15] fix(cuda): repair Windows CI for runtime-optional cuDNN

Windows CUDA/TensorRT test jobs failed because enabling
NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING makes the cudnn_frontend shim dlopen the
CUDA runtime/driver using Linux names only (libcudart.so.*, libcuda.so.1). On
Windows dlopen maps to LoadLibrary, so every Conv routed to the cuDNN frontend
threw 'Unable to load any libcudart.so.* library'. Patch the shim to use the
Windows DLL names (cudart64_12/13.dll, nvcuda.dll) and clear the thread error
state before each load so the GetLastError-based success check is not tripped
by a stale error.

Also fix the Windows no-cuDNN plugin build: passing both --cuda_version and
--cuda_home made build.py emit '-T cuda=13.0' (version form), which MSBuild
cannot resolve to a CudaToolkitDir for the downloaded, unregistered CUDA SDK.
Drop --cuda_version so the toolset uses the cuda_home path form, matching the
working windows_cuda_plugin.yml.
---
 .github/workflows/windows_cuda_no_cudnn.yml   |  1 -
 cmake/external/cudnn_frontend.cmake           |  5 +++
 .../cudnn_frontend_win_dynamic_loading.patch  | 40 +++++++++++++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 cmake/patches/cudnn_frontend/cudnn_frontend_win_dynamic_loading.patch

diff --git a/.github/workflows/windows_cuda_no_cudnn.yml b/.github/workflows/windows_cuda_no_cudnn.yml
index a8b98f2dbf1e8..fe3269416bb73 100644
--- a/.github/workflows/windows_cuda_no_cudnn.yml
+++ b/.github/workflows/windows_cuda_no_cudnn.yml
@@ -109,7 +109,6 @@ jobs:
             --build_shared_lib `
             --build_wheel `
             --use_cuda `
-            --cuda_version=13.0 `
             --cuda_home="$env:RUNNER_TEMP\v13.0" `
             --cudnn_home="$env:RUNNER_TEMP\$env:CUDNN_FOLDER" `
             --skip_tests `
diff --git a/cmake/external/cudnn_frontend.cmake b/cmake/external/cudnn_frontend.cmake
index 9af915a272b07..e8a84f99354f0 100644
--- a/cmake/external/cudnn_frontend.cmake
+++ b/cmake/external/cudnn_frontend.cmake
@@ -1,8 +1,13 @@
 
+# The cudnn_frontend shim only loads the CUDA runtime / driver libraries with
+# their Linux names (libcudart.so.*, libcuda.so.1) when NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING
+# is enabled. On Windows this always fails (dlopen is mapped to LoadLibrary), so patch the
+# shim to use the Windows DLL names (cudart64_*.dll, nvcuda.dll).
 onnxruntime_fetchcontent_declare(
   cudnn_frontend
   URL ${DEP_URL_cudnn_frontend}
   URL_HASH SHA1=${DEP_SHA1_cudnn_frontend}
+  PATCH_COMMAND ${Patch_EXECUTABLE} --binary --ignore-whitespace -p1 < ${PROJECT_SOURCE_DIR}/patches/cudnn_frontend/cudnn_frontend_win_dynamic_loading.patch
   EXCLUDE_FROM_ALL
 )
 
diff --git a/cmake/patches/cudnn_frontend/cudnn_frontend_win_dynamic_loading.patch b/cmake/patches/cudnn_frontend/cudnn_frontend_win_dynamic_loading.patch
new file mode 100644
index 0000000000000..754335e9ab9cc
--- /dev/null
+++ b/cmake/patches/cudnn_frontend/cudnn_frontend_win_dynamic_loading.patch
@@ -0,0 +1,40 @@
+--- a/include/cudnn_frontend_shim.h	2026-06-26 15:06:33.854043748 -0700
++++ b/include/cudnn_frontend_shim.h	2026-06-26 15:06:33.966044710 -0700
+@@ -69,7 +69,14 @@
+     dlerror();
+ 
+     // Attempt to open the cuda library
++#ifdef _WIN32
++    // Reset the thread error state so the GetLastError()-based success check
++    // below is not tripped by a stale error from an earlier call.
++    SetLastError(0);
++    HMODULE handle    = dlopen("nvcuda.dll", RTLD_NOW);
++#else
+     HMODULE handle    = dlopen("libcuda.so.1", RTLD_NOW);
++#endif
+     const char *error = reinterpret_cast<const char *>(dlerror());
+     if (!handle || error) {
+         // If opening the library fails, throw an exception with the error message
+@@ -85,13 +92,22 @@
+     dlerror();
+ 
+     // List of potential libcudart libraries (Adding major version to support python package)
++#ifdef _WIN32
++    constexpr const char *libs[] = {"cudart64_12.dll", "cudart64_13.dll"};
++#else
+     constexpr const char *libs[] = {"libcudart.so.12", "libcudart.so.13"};
++#endif
+     constexpr size_t num_libs    = sizeof(libs) / sizeof(libs[0]);
+ 
+     HMODULE lib_handle = nullptr;
+     int loaded_index   = -1;
+ 
+     for (size_t i = 0; i < num_libs; ++i) {
++#ifdef _WIN32
++        // Reset the thread error state so the GetLastError()-based success check
++        // below is not tripped by a stale error from a previous failed attempt.
++        SetLastError(0);
++#endif
+         HMODULE handle    = dlopen(libs[i], RTLD_NOW);
+         const char *error = reinterpret_cast<const char *>(dlerror());
+