Skip to content

Convert onnx_perf_test to standalone WinML mode#1

Open
Sumit2318 wants to merge 7 commits into
user/chrisd/rel-1.25.1-qol-ifdeffrom
winml-standalone
Open

Convert onnx_perf_test to standalone WinML mode#1
Sumit2318 wants to merge 7 commits into
user/chrisd/rel-1.25.1-qol-ifdeffrom
winml-standalone

Conversation

@Sumit2318

@Sumit2318 Sumit2318 commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Converts the winappsdk_onnxruntime_perf_test target to use standalone WinML via the flat-C WinMLEpCatalog API from the
Microsoft.Windows.AI.MachineLearning NuGet package. This removes the dependency on WindowsAppSDK bootstrap/WinRT activation and allows the perf test EXE to run standalone on any Windows machine with the WinML , Onnxruntime DLLs present alongside the executable.

image

Key design decisions

  1. ORT API version fallback — The WinML NuGet package (v2.0.297-preview) ships ORT
    1.24.4 (API v24)
    while repo headers define v25. At runtime, we try the compile-time version first, then gracefully fall back to the highest supported version available.

  2. Glob exclusion — winml_standalone.cc/h are excluded from the regular onnxruntime_perf_test target to prevent compilation errors (those files depend on WinML NuGet APIs).

Files

  • New: onnxruntime/test/perftest/windows/winml_standalone.cc/h — Flat-C WinMLEpCatalog implementation
  • Deleted: onnxruntime/test/perftest/windows/winappsdk_bootstrap.cc/h — Old WinRT bootstrap
  • Renamed: cmake/winappsdk_onnxruntime_perf_test.cmake → cmake/winml_standalone_perf_test.cmake
  • Modified: cmake/CMakeLists.txt, cmake/onnxruntime_unittests.cmake, main.cc, command_args_parser.cc, test_configuration.h,
    ort_test_session.cc, chrisd/b.cmd, chrisd/p-x64.cmd, chrisd/p-arm.cmd

Build & usage

Configure
cmake -B build -Donnxruntime_BUILD_WINML_STANDALONE_PERF_TEST=ON -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_USE_DML=OFF -Donnxruntime_USE_WINML=OFF ... -A x64 -G "Visual Studio 17 2022"

Build
cmake --build build --config RelWithDebInfo --target winml_standalone_perf_test

Run
winml_standalone_perf_test.exe --required_device_type npu -e vitisai -m duration -t 30 -I "C:\Users\Sumit\standalone-perf-test\abc.quant.onnx"

Testing

  • ✅ Built and tested on x64, arm64 device
  • ✅ EP discovery and registration works (tested with VitisAI EP)
  • ✅ Regular onnxruntime_perf_test target unaffected

Vitis Test logs

[WinML Standalone] Loaded: C:\Users\Wssiidc\Downloads\standalone-perf-test\onnxruntime.dll
The requested API version [25] is not available, only API versions [1, 24] are supported in this build. Current ORT Version is: 1.24.4
[WinML Standalone] Using ORT API version 24 (compile-time: 25)
[WinML Standalone] Discovering and registering EPs...
[WinML Standalone] Found provider: MIGraphXExecutionProvider
  Skipping MIGraphXExecutionProvider (NotPresent)
[WinML Standalone] Found provider: VitisAIExecutionProvider
  VitisAIExecutionProvider is Ready
  Registering from: C:\Program Files\WindowsApps\MicrosoftCorporationII.WinML.AMD.NPU.EP.1.8_1.8.61.0_x64__8wekyb3d8bbwe\ExecutionProvider\onnxruntime_vitisai_ep.dll
  Registered successfully
ONNX Runtime C++ API version: 25
-------------------------------------------
[WinML Standalone] provider_Type_Name:VitisAIExecutionProvider
[WinML Standalone] has_Required_Device_Type:1
[WinML Standalone] required_Device_Type:2
[WinML Standalone] model_file_path:"C:\Users\Sumit\\standalone-perf-test\abc.quant.onnx"
-------------------------------------------
[WinML Standalone] EP Device [Index: 2, Name: VitisAIExecutionProvider] added to session.
[WinML Standalone] provider_names: VitisAIExecutionProviderVitisAIExecutionProvider|
Session creation time cost: 2.00946 s
First inference time cost: 27 ms
Total inference time cost: 30.0271 s
Total inference requests: 1073
Average inference time cost total: 27.984287 ms
Total inference run time: 30.0342 s
Number of inferences per second: 35.7259
Avg CPU usage: 0 %
Peak working set size: 364568576 bytes
Avg CPU usage:0
Peak working set size:364568576
Runs:1073
Min Latency: 0.026123 s
Max Latency: 0.0294023 s
P50 Latency: 0.0282335 s
P90 Latency: 0.0288627 s
P95 Latency: 0.0289866 s
P99 Latency: 0.029173 s
P999 Latency: 0.0293428 s
retval: 0
Shutting down Protobuf library...

QNN Test logs

[WinML Standalone] Loaded: C:\Users\ROM-MS-IDCLAB-27\Downloads\standalone-perf-test\onnxruntime.dll
The requested API version [25] is not available, only API versions [1, 24] are supported in this build. Current ORT Version is: 1.24.4
[WinML Standalone] Using ORT API version 24 (compile-time: 25)
[WinML Standalone] Discovering and registering EPs...
[WinML Standalone] Found provider: QNNExecutionProvider
  QNNExecutionProvider is Ready
  Registering from: C:\Program Files\WindowsApps\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe\ExecutionProvider\onnxruntime_providers_qnn.dll
  Registered successfully
ONNX Runtime C++ API version: 25
-------------------------------------------
[WinML Standalone] provider_Type_Name:QNNExecutionProvider
[WinML Standalone] has_Required_Device_Type:1
[WinML Standalone] required_Device_Type:2
[WinML Standalone] model_file_path:C:\Users\Sumit\Downloads\standalone-perf-test\abc.quant.onnx
-------------------------------------------
[WinML Standalone] EP Device [Index: 2, Name: QNNExecutionProvider] added to session.
[WinML Standalone] provider_names: QNNExecutionProviderQNNExecutionProvider|
Starting stage: Graph Preparation Initializing
Completed stage: Graph Preparation Initializing (2443 us)
Starting stage: Graph Optimizations
Completed stage: Graph Optimizations (2223404 us)
Starting stage: Post Graph Optimization
Completed stage: Post Graph Optimization (121442 us)
Starting stage: Graph Sequencing for Target
Completed stage: Graph Sequencing for Target (1542431 us)
Starting stage: VTCM Allocation
Completed stage: VTCM Allocation (369298 us)
Starting stage: Parallelization Optimization
Completed stage: Parallelization Optimization (86381 us)
Starting stage: Finalizing Graph Sequence

====== DDR bandwidth summary ======
spill_bytes=0
fill_bytes=0
write_total_bytes=8192
read_total_bytes=113033216

Completed stage: Finalizing Graph Sequence (104543 us)
Starting stage: Completion
Completed stage: Completion (4064 us)
2026-05-06 11:37:37.4691729 [W:onnxruntime:, session_state.cc:1327 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-05-06 11:37:37.4771654 [W:onnxruntime:, session_state.cc:1329 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Session creation time cost: 10.3265 s
First inference time cost: 56 ms
Total inference time cost: 30.0173 s
Total inference requests: 1025
Average inference time cost total: 29.285208 ms
Total inference run time: 30.0208 s
Number of inferences per second: 34.143
Avg CPU usage: 0 %
Peak working set size: 1089642496 bytes
Avg CPU usage:0
Peak working set size:1089642496
Runs:1025
Min Latency: 0.0285205 s
Max Latency: 0.0418038 s
P50 Latency: 0.028985 s
P90 Latency: 0.0299876 s
P95 Latency: 0.0304149 s
P99 Latency: 0.0338743 s
P999 Latency: 0.0415718 s
retval: 0
Shutting down Protobuf library...
~fin~

sumikuma and others added 5 commits May 6, 2026 23:51
Replace WindowsAppSDK framework dependencies with the flat-C
WinMLEpCatalog API from Microsoft.Windows.AI.MachineLearning NuGet
package.

Changes:
- Add winml_standalone.cc/h using WinMLEpCatalog API for EP discovery
- Remove winappsdk_bootstrap.cc/h (WinAppSDK/CppWinRT bootstrap)
- Replace 6 WinAppSDK NuGet packages with single
  Microsoft.Windows.AI.MachineLearning 2.0.297-preview
- Remove CppWinRT, WIL, Foundation, onecoreuap.lib dependencies
- Use BUILD_WINML_STANDALONE_PERF_TEST compile definition
- Add post-build copy of onnxruntime.dll from NuGet package
- Remove --winappsdk_version flag (no longer needed)
- Move WinML init after Ort::Env creation with scope-guard cleanup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The regular onnxruntime_perf_test target globs all files in
perftest/windows/ but should not compile winml_standalone.cc/h
since those depend on the WinML NuGet package which is only
linked by the standalone target.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename cmake file: winappsdk_onnxruntime_perf_test.cmake -> winml_standalone_perf_test.cmake
- Rename cmake option: onnxruntime_BUILD_WINAPPSDK_PERF_TEST -> onnxruntime_BUILD_WINML_STANDALONE_PERF_TEST
- Rename target/EXE: winappsdk_onnxruntime_perf_test -> winml_standalone_perf_test
- Rename flag: --winappsdk_register_provider -> --winml_register_provider
- Rename struct member: winappsdk_register_provider -> winml_register_provider
- Add null check for OrtGetApiBase() return before dereferencing
- Update all log messages from [WinAppSDK] to [WinML Standalone]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- p-x64.cmd: BUILD_WINAPPSDK_PERF_TEST -> BUILD_WINML_STANDALONE_PERF_TEST, removed CPPWINRT_VERSION
- p-arm.cmd: same option rename, removed CPPWINRT_VERSION
- b.cmd: target winappsdk_onnxruntime_perf_test -> winml_standalone_perf_test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The winappsdk_bootstrap.cc/h files were deleted, so the glob
filters excluding them are no-ops. Remove them to reduce clutter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Sumit2318 Sumit2318 marked this pull request as ready for review May 6, 2026 21:20

@chrisdMSFT chrisdMSFT left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1 Review — Convert onnx_perf_test to standalone WinML mode

Repo: chrisdMSFT/onnxruntime
PR: #1
Base: user/chrisd/rel-1.25.1-qol-ifdefHead: winml-standalone
Stats: 15 files, +401 / −516

Overall the change is well-scoped: it cleanly swaps the WinAppSDK/WinRT
bootstrap for the flat-C WinMLEpCatalog API, keeps the regular
onnxruntime_perf_test target unaffected (via glob exclusion), and
preserves ORT_API_MANUAL_INIT consistency. The renames
(winappsdk_*winml_standalone_*, flag, struct member, target,
cmake file) are applied uniformly.

The issues below are ordered by severity. None are necessarily blockers
for a personal/dev branch, but several should be addressed before this
ships to a wider audience.


High severity

H1. ORT API-version fallback risks UB if any v25-only API is called

File: onnxruntime/test/perftest/main.cc (around the new
BUILD_WINML_STANDALONE_PERF_TEST block)

g_ort = api_base->GetApi(ORT_API_VERSION);          // tries v25
if (g_ort == nullptr) {
  for (uint32_t v = ORT_API_VERSION - 1; v >= 1; --v) {
    g_ort = api_base->GetApi(v);
    if (g_ort != nullptr) { ... break; }
  }
}

GetApi(v) from a v24 runtime returns an OrtApi* whose memory layout
is the v24 struct. The compile-time headers (ORT_API_VERSION == 25,
see include/onnxruntime/core/session/onnxruntime_c_api.h:41) describe
the v25 struct, which has new function pointers appended after the
v24 layout. Any call into a v25-only member (g_ort->SomeNewApi(...))
will dereference past the actual struct allocated by the older DLL —
undefined behavior, typically a crash or wild jump.

The PR works today only because the perf-test code paths happen to call
v1–v24 functions. A future contributor adding a v25 API call will
silently introduce a crash on the v24-shipping NuGet, with no compile-
time signal.

Recommended mitigations (any one):

  • Refuse to fall back: hard-error out if GetApi(ORT_API_VERSION)
    returns nullptr, and instead bump the bundled NuGet (or pin
    ORT_API_VERSION to v24 for this target via a private header).
  • Build the standalone target against a v24-only header copy, so the
    compiler enforces the contract.
  • At minimum, log loudly (e.g. red banner) and record the resolved
    version so a g_ort_runtime_version can be checked before any
    v25-specific call site.

H2. UTF-8 → UTF-16 path conversion is incorrect

File: onnxruntime/test/perftest/windows/winml_standalone.cc:158

std::string libPath(pathSize, '\0');
...
HRESULT pathHr = WinMLEpGetLibraryPath(ep, pathSize, libPath.data(), &used);
...
std::wstring wpath(libPath.begin(), libPath.end());   // <-- broken for non-ASCII
ctx->env->RegisterExecutionProviderLibrary(providerName.c_str(), wpath);

The two-iterator std::wstring constructor zero-extends each char
into a wchar_t. This only round-trips ASCII. If the EP package family
or install path contains any non-ASCII byte (e.g. a localized
%LOCALAPPDATA% profile name, accented user folder, CJK), every
multi-byte UTF-8 sequence is split into multiple bogus UTF-16 code
units and LoadLibraryW fails (or, worse, loads the wrong file).

Use MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, ...):

int n = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
                            libPath.data(), (int)libPath.size(), nullptr, 0);
std::wstring wpath(n, L'\0');
MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
                    libPath.data(), (int)libPath.size(),
                    wpath.data(), n);

(or query the path directly as wide if the WinML API offers a wide
variant — worth checking WinMLEpCatalog.h.)

H3. WinMLEpGetLibraryPath length handling is contract-fragile

File: onnxruntime/test/perftest/windows/winml_standalone.cc:137-153

size_t pathSize = 0;
HRESULT pathSizeHr = WinMLEpGetLibraryPathSize(ep, &pathSize);
...
std::string libPath(pathSize, '\0');
size_t used = 0;
HRESULT pathHr = WinMLEpGetLibraryPath(ep, pathSize, libPath.data(), &used);
...
libPath.resize(used > 0 ? used - 1 : 0);  // trim null terminator

Two undocumented assumptions:

  1. used always includes the null terminator. If a future runtime
    returns the strlen instead, used - 1 chops the last character of
    the path.
  2. pathSize already includes the null. If it doesn't, the buffer is
    one byte too small.

Both are pure conjecture without the WinMLEpCatalog.h contract in
hand. Please:

  • Confirm the contract from the header/docs and add a comment quoting
    it.
  • Add a defensive if (used > pathSize) { error } and prefer
    strnlen(libPath.data(), pathSize) over the used - 1 trim.

Medium severity

M1. WinMLEpCatalogEnumProviders return value is ignored

File: winml_standalone.cc:87

WinMLEpCatalogEnumProviders(catalog, [](...) -> BOOL { ... }, &ctx);

If enumeration itself fails (HRESULT or BOOL return), the function
silently completes with zero registered providers — and then session
creation later fails with a misleading "no EP available" error. Capture
and surface the result.

M2. Per-provider failures are silently ignored

The lambda returns TRUE for every error path (NotPresent,
EnsureReady failed, GetLibraryPathSize failed, RegisterExecutionProvider
threw). That is reasonable for "skip this EP and try the next", but the
aggregate outcome is never reported. If the user typed
--winml_register_provider=VitisAIExecutionProvider and that one EP
failed to register, the perf test still runs (against CPU only) and the
real failure is buried in stderr.

Suggest: track registered_count, and if provider_filter is non-empty
and any requested name was not successfully registered, throw / exit
non-zero from WinML_FindAndRegisterAllProviders.

M3. File-static globals make re-init unsafe / leak the catalog

File: winml_standalone.cc:21-22

static WinMLEpCatalogHandle g_ep_catalog = nullptr;
static std::vector<std::string> g_registered_providers;

Calling WinML_InitializeAndRegisterAllProviders twice without a
WinML_Uninitialize in between leaks the previous catalog handle and
appends to g_registered_providers. Either:

  • Document "must be called exactly once" and assert(!g_ep_catalog) at
    entry, or
  • Stash the handle/list inside a small struct returned to the caller (or
    hung off Ort::Env via user-data) so re-entry is well-defined.

M4. Loop invariant on uint32_t v underflows pathologically

File: main.cc

for (uint32_t v = ORT_API_VERSION - 1; v >= 1; --v) { ... }

If ORT_API_VERSION is ever defined as 0 (currently it's 25, but
hypothetically), v initializes to 0xFFFFFFFF and the loop runs ~4B
times. Trivial fix:

for (uint32_t v = ORT_API_VERSION; v > 1; ) {
  --v;
  ...
}

…or use a signed counter. Not exploitable today; flagged for
robustness.

M5. OrtGetApiBase redefinition is a fragile linker contract

File: winml_standalone.cc:24-52

The standalone target re-implements extern "C" const OrtApiBase* __cdecl OrtGetApiBase() to dynamically LoadLibraryExW("onnxruntime.dll")
from the EXE directory. This relies on no other static lib in the link
line (onnx_test_runner_common, onnxruntime_test_utils,
onnxruntime_common, onnxruntime_flatbuffers,
onnx_test_data_proto) ever pulling in a definition of
OrtGetApiBase. Today it links; if anyone later adds a static lib that
brings the symbol along, you'll get a duplicate-symbol error with no
clear hint pointing here. Add a prominent comment and ideally an
/INCLUDE: link assertion (or a unit-test build step) to catch this.

Also: the magic-static lambda is fine for thread safety, but it does
not log the resolved DLL load path on success — only on failure. A
single line std::wcout << L"[WinML Standalone] Loaded: " << ortPath
on success (which is already there!) is great; consider also logging
the resolved OrtApiBase* and the GetVersionString() to make support
diagnosis trivial.


Low severity / nits

L1. Missing using namespace std::filesystem or fs:: alias

Stylistic only — std::filesystem::path is used twice and reads fine.
No change needed.

L2. ready_state_to_string is unused

File: winml_standalone.cc:54-63

The helper function is defined but never called. It's also not declared
static / in an unnamed namespace, so it pollutes the global symbol
table.

L3. Mixed I/O streams (std::cout / std::wcout / printf)

The new code mixes narrow std::cout, wide std::wcout, and the
existing fprintf(stdout, ...) calls. Without a std::ios::sync_with_stdio
guarantee, output ordering can be surprising under buffering. Pick one
flavor for the new file (probably narrow std::cout everywhere except
where wchar paths are needed) and stick to it. Existing precedent in
the codebase is fprintf, but this is a minor preference.

L4. EnumContext ctx shadows the lambda's local auto* ctx

File: winml_standalone.cc:85-88

EnumContext ctx{ &provider_filter, &env, &g_registered_providers };
WinMLEpCatalogEnumProviders(catalog, [](..., void* context) -> BOOL {
    auto* ctx = static_cast<EnumContext*>(context);   // shadows outer ctx
    ...
}, &ctx);

Lambdas don't capture, so the inner ctx doesn't actually hide
anything reachable, but the name collision is mildly confusing. Rename
the inner local to c or ec.

L5. WinML_InitializeAndRegisterAllProviders is a one-line

forwarder to WinML_FindAndRegisterAllProviders. Either inline the
Find... body or drop the wrapper — having two near-identical names is
clutter.

L6. gsl::finally placement comment

File: main.cc

auto winml_cleanup_at_scope_exit = gsl::finally([&]() {
  WinML_Uninitialize(env);
});

Good pattern, and the "must happen before env destruction" comment is
correct. Worth adding "and before g_ort is reset to nullptr (if you
ever add that)" because UnregisterExecutionProviderLibrary will dive
through the C API.

L7. winappsdk_version removal — unused-flag deprecation note

The flag is gone entirely (good, since the WinAppSDK runtime is gone).
If any internal CI / dev scripts still pass --winappsdk_version=1.8,
absl will refuse with an unknown-flag error rather than ignoring it.
The PR does update chrisd/p-x64.cmd and chrisd/p-arm.cmd, but a
quick grep for winappsdk_version in any other internal pipeline is
worth doing before merge.

L8. onnxruntime.dll resolution is EXE-dir only

File: winml_standalone.cc:28-39

The shim hard-binds to <exe_dir>\onnxruntime.dll and ignores
PATH/SetDllDirectory. That matches the post-build copy in
cmake/winml_standalone_perf_test.cmake:110-112, so it's correct for
the in-tree build, but it means a developer can't drop a sideloaded
onnxruntime.dll into PATH and override. If that's intentional (which
seems likely, to lock the version against the WinML NuGet's runtime),
add a one-line comment saying so.

L9. BUILD_WINML_STANDALONE_PERF_TEST #error guard

File: winml_standalone.cc:17-19

#ifndef BUILD_WINML_STANDALONE_PERF_TEST
#error "This file should only be compiled when BUILD_WINML_STANDALONE_PERF_TEST is ON"
#endif

Great defensive guard — paired with the glob filter in
cmake/onnxruntime_unittests.cmake it ensures the regular target
fails loudly rather than silently linking in this file. ✅

L10. CMake: ${WINML_BINARY_DIR} is undocumented

File: cmake/winml_standalone_perf_test.cmake:111-115

"${WINML_BINARY_DIR}/onnxruntime.dll"
"${WINML_BINARY_DIR}/DirectML.dll"

WINML_BINARY_DIR is presumably exported by the
microsoft.windows.ai.machinelearning CMake config. If that variable
isn't set (e.g. NuGet package version drift), the copy_if_different
silently skips and you get a runtime "DLL not found" instead of a
config-time error. Add:

if(NOT WINML_BINARY_DIR)
  message(FATAL_ERROR "WINML_BINARY_DIR not set by Microsoft.Windows.AI.MachineLearning package")
endif()

L11. CMake: FetchContent from a personal-fork repo

File: cmake/winml_standalone_perf_test.cmake:27-31

GIT_REPOSITORY https://github.com/mschofie/NuGetCMakePackage
GIT_TAG dc9e92672c6eb1c11f0d29d4f94731b3404cc096

Pinning to a SHA is good. Pulling from mschofie/* (an individual
account) is a long-term liability — any account rename / repo deletion
breaks this build forever. Consider mirroring into a Microsoft-owned
repo before this lands on a shared branch.

L12. CMake glob for winml_standalone_perf_test_src

The glob picks up everything in perftest/ and perftest/windows/,
including any new file added by other PRs. That's the same pattern used
by onnxruntime_perf_test, so consistency is fine, but it means the
"don't compile winml_standalone.cc into the regular target" filter in
onnxruntime_unittests.cmake is the only line keeping the two
targets disjoint. If someone adds e.g. winml_standalone_helpers.cc,
they must remember to extend the regex. Consider switching to an
EXCLUDE REGEX ".*/winml_standalone.*\\.(cc|h)$" (prefix match) so
new files in this family are auto-excluded.


Things that look right

  • Renaming is consistent across cmake, source, batch scripts, and
    defines — no stragglers found in this branch.
  • ORT_API_MANUAL_INIT is propagated to onnx_test_runner_common
    and onnxruntime_test_utils consistently with the regular target,
    avoiding the #pragma detect_mismatch link error.
  • gsl::finally ensures WinML_Uninitialize runs before Ort::Env
    destruction, even on an exception path through real_main.
  • WinML_Uninitialize swallows exceptions in the unregister loop —
    appropriate for best-effort cleanup.
  • Removal of winrt::hresult_error catch in main.cc is correct
    given CppWinRT is no longer linked. C++ std::exception catch
    remains.
  • Glob exclusion of winml_standalone.{cc,h} from
    onnxruntime_perf_test is correct and prevents an accidental WinML
    NuGet dependency on the regular target.
  • Manifest file (onnxruntime/test/perftest/windows/app.manifest)
    exists and is referenced by the cmake target.

Suggested follow-ups (not blocking)

  1. Add a smoke-test build job (even just a CI matrix entry) for
    -Donnxruntime_BUILD_WINML_STANDALONE_PERF_TEST=ON. Without one,
    silent breakage is very easy.
  2. Move the static g_ep_catalog / g_registered_providers state into
    a class so WinML_* becomes RAII and the order-of-destruction
    guarantee in main.cc is enforced by the type system.
  3. Document the EXE deployment story (which DLLs go next to the EXE,
    from where) in a short README under onnxruntime/test/perftest/.

Generated by Copilot CLI code review on 2026-05-06.

chrisdMSFT and others added 2 commits May 6, 2026 19:32
High severity:
- H1 (main.cc): Remove ORT API version fallback. Hard-fail when the
  bundled onnxruntime.dll is older than the compile-time
  ORT_API_VERSION; falling back to an older v_N-shaped struct risks UB
  if any v25-only API is called. Newer is fine; older is not.
- H2 (winml_standalone.cc): Replace the iterator-pair std::wstring
  ctor with MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, ...).
  Verified against WinMLEpCatalogApi.cpp that library paths come back
  as UTF-8.
- H3 (winml_standalone.cc): Defensively bound `used` against `pathSize`
  and use strnlen() to determine the string length, instead of trusting
  `used - 1`. Documents the verified contract from the WinML source
  (size includes the NUL).

Medium severity:
- M1: Capture the HRESULT returned by WinMLEpCatalogEnumProviders and
  throw on failure so silent enumeration errors are surfaced.
- M2: Track which filter-requested providers were registered. If any
  requested name failed to register, throw with the missing names so
  the perf test does not silently fall back to CPU-only.
- M3: assert(!g_ep_catalog) on entry to
  WinML_InitializeAndRegisterAllProviders + comment that it must be
  called exactly once before WinML_Uninitialize.
- M4: Removed naturally with H1 (the underflowing fallback loop is
  gone).
- M5: Add a prominent link-order warning above the OrtGetApiBase
  redefinition.

Low severity / nits:
- L2: Remove unused ready_state_to_string helper.
- L4: Rename the inner lambda local from `ctx` to `ec` so it does not
  visually shadow the outer EnumContext ctx.
- L5: Drop the one-line WinML_FindAndRegisterAllProviders forwarder;
  keep only WinML_InitializeAndRegisterAllProviders.
- L6: Extend the gsl::finally cleanup comment in main.cc to note the
  ordering with respect to g_ort.
- L8: Add a comment explaining the EXE-dir-only DLL resolution is
  intentional (locks the runtime version against the WinML NuGet).
- L10 (cmake): Fail at configure time with FATAL_ERROR if
  WINML_BINARY_DIR was not set by the
  microsoft.windows.ai.machinelearning package, rather than silently
  emitting a no-op post-build copy_if_different.
- L12 (cmake): Switch the WinML standalone glob exclusion to a
  prefix match so any future winml_standalone_*.cc/h is automatically
  excluded from the regular onnxruntime_perf_test target.

L7 follow-up (stale name references):
- Rename winappsdk_onnxruntime_perf_test.md ->
  winml_standalone_perf_test.md and rewrite to drop WindowsAppSDK /
  CppWinRT details, document the new --winml_register_provider flag,
  and describe the ORT API version contract.
- Update chrisd/copy-perf-test.cmd, go.cmd, go-all.cmd,
  go-nvidia-tests.cmd, go-openvino.cmd, go-qnn.cmd, and
  simple-intel-test-ape.cmd to reference winml_standalone_perf_test
  and drop --winappsdk_version flags.

Out of scope this round: L3 (mixed I/O streams), L11 (NuGetCMakePackage
mirroring), CI matrix entry, RAII refactor of the catalog handle.

Verified: cmake configure + RelWithDebInfo build of both
winml_standalone_perf_test and onnxruntime_perf_test succeed. The
regular target's compile list confirms winml_standalone.cc is excluded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the three "Suggested follow-ups (not blocking)" from the PR
#1 review.

#1 (CI smoke build) -- .github/workflows/windows_winml_standalone.yml
    Configure + build the winml_standalone_perf_test target (and the
    regular onnxruntime_perf_test target as a sanity check) on
    windows-latest, RelWithDebInfo, x64. Triggered on push / pull_request
    on main, rel-*, winml-standalone*, rel-*-winml-standalone, and any
    user/** branch, scoped to the perftest source tree, the cmake files
    that drive the target, and the workflow file itself. No test-run
    step (would require NPU/GPU EP devices not present on hosted
    runners).

    Uses a GitHub-hosted runner instead of the 1ES self-hosted pools
    that the rest of the windows_*.yml workflows use, because those
    pools are gated to Microsoft's CI infrastructure and would not run
    on a personal fork. ilammy/msvc-dev-cmd is used for vcvars setup
    since the in-repo locate-vcvarsall-and-setup-env composite action
    targets the self-hosted images.

#2 (RAII for catalog state) -- supersedes the round-1 M3 minimal-assert
    decision. Replace the file-static g_ep_catalog and
    g_registered_providers globals plus the WinML_* free functions
    with a move-only WinMLStandaloneRegistration class
    in winml_standalone.{h,cc}. Construction opens the catalog,
    enumerates providers, registers each one (subject to the
    --winml_register_provider filter), and throws on
    WinMLEpCatalogCreate failure, WinMLEpCatalogEnumProviders failure,
    or any requested provider failing to register. Destruction
    unregisters in reverse order and releases the catalog.

    Partial-construction safety: the catalog handle is taken into the
    member as soon as WinMLEpCatalogCreate succeeds, then the rest of
    the constructor runs inside a try/catch that calls Cleanup() before
    rethrowing. This keeps the destructor and the constructor's failure
    path sharing a single noexcept Cleanup() implementation.

    Header keeps the catalog handle as a void* so winml_standalone.h
    does not pull <WinMLEpCatalog.h> (and therefore the WinML NuGet
    headers) into main.cc.

    main.cc now uses straight RAII -- the WinMLStandaloneRegistration
    object is declared right after the Ort::Env, replacing both the
    WinML_InitializeAndRegisterAllProviders call and the
    gsl::finally([&]{ WinML_Uninitialize(env); }) block. The
    <gsl/util> include is hoisted out of the BUILD_WINML_STANDALONE_PERF_TEST
    guard because main.cc still uses gsl::finally for plugin EP unregister.

#3 (deployment README) -- onnxruntime/test/perftest/WINML_STANDALONE.md
    Short, focused doc that answers exactly the reviewer's question:
    which DLLs land next to the EXE and where does each one come from.
    Includes a table mapping each file to its source NuGet/build
    artifact and the cmake mechanism that copies it, an explanation of
    why onnxruntime.dll resolution is intentionally EXE-dir-only, the
    minimum redeployment payload, and the ORT API version contract.
    Cross-links to the comprehensive winml_standalone_perf_test.md at
    the repo root for build/run details.

Verified: configure was already done previously; cmake --build build
--config RelWithDebInfo --target winml_standalone_perf_test and
--target onnxruntime_perf_test both succeed. The standalone target's
compile list contains winml_standalone.cc; the regular target's does
not (the prefix-match glob exclusion in cmake/onnxruntime_unittests.cmake
still keeps them disjoint).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chrisdMSFT added a commit that referenced this pull request May 7, 2026
Port the 16 review-fix items from #1 commit
c28d084 to this branch. The pre-fix files are byte-identical between
this branch (rel-1.24.6-winml-standalone, head 17dae90) and PR #1's
pre-fix head (e5ee69c), so the post-fix files are taken straight
from c28d084 via `git checkout c28d084 -- <path>`. Only the L12
prefix-match regex change in cmake/onnxruntime_unittests.cmake is
applied as a surgical hunk because that file diverges by ~370 lines
between the two branch bases.

Maps to review items from the PR #1 review:

H1 (main.cc): Remove the "fall back to older ORT API version" loop and
    hard-fail when the runtime DLL does not support ORT_API_VERSION.
    Falling back would return a struct laid out for an older version,
    so any newer-API call would dereference past the actual struct.
    The runtime must support the compile-time version or newer.
H2 (winml_standalone.cc): Replace the iterator-pair std::wstring
    constructor with MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
    ...) for the EP library path conversion. The previous code zero-
    extended each char to wchar_t, which silently corrupted multi-byte
    UTF-8 sequences from localized user folders.
H3 (winml_standalone.cc): Defensively bound `used` against `pathSize`
    and use strnlen() to compute the final string length, with a
    comment quoting the verified WinMLEpCatalogApi.cpp contract
    (pathSize includes the NUL; used = pathSize on success).
M1 (winml_standalone.cc): Capture the HRESULT from
    WinMLEpCatalogEnumProviders and throw on failure, instead of
    silently completing with zero registered providers.
M2 (winml_standalone.cc): When --winml_register_provider is non-empty
    and any requested provider failed to register, throw before the
    caller silently falls back to CPU.
M3 (winml_standalone.cc): Document the "must be called once" lifecycle
    of WinML_InitializeAndRegisterAllProviders and add an
    assert(!g_ep_catalog) on entry. (A follow-up commit will refactor
    the file-static globals into an RAII class.)
M4 (main.cc): Removed naturally with H1 (the `for (uint32_t v ...)`
    underflow loop is gone).
M5 (winml_standalone.cc): Add a prominent link-order warning banner
    above the OrtGetApiBase redefinition explaining the no-other-static-
    lib-defines-OrtGetApiBase contract.
L2 (winml_standalone.cc): Remove the unused ready_state_to_string
    helper.
L4 (winml_standalone.cc): Rename the lambda's inner `auto* ctx` to
    `auto* ec` so it does not shadow the outer EnumContext local.
L5 (winml_standalone.cc): Drop the one-line WinML_FindAndRegisterAllProviders
    forwarder; merge its body into WinML_InitializeAndRegisterAllProviders.
L6 (main.cc): Extend the gsl::finally cleanup comment to spell out
    that WinML_Uninitialize must run before g_ort is reset (if that is
    ever added) because UnregisterExecutionProviderLibrary dives
    through the C API.
L7 (chrisd scripts + winappsdk_onnxruntime_perf_test.md): Rename
    stale `winappsdk_onnxruntime_perf_test` references to
    `winml_standalone_perf_test` in copy-perf-test.cmd, go.cmd,
    go-all.cmd, go-nvidia-tests.cmd, go-openvino.cmd, go-qnn.cmd, and
    simple-intel-test-ape.cmd. Drop the obsolete --winappsdk_version
    flag and replace --winappsdk_register_provider with
    --winml_register_provider. Rename winappsdk_onnxruntime_perf_test.md
    to winml_standalone_perf_test.md and rewrite content to reflect
    the standalone WinML model (drop the WindowsAppSDK bootstrap, drop
    Microsoft.Windows.CppWinRT, replace BUILD_WINAPPSDK_PERF_TEST with
    BUILD_WINML_STANDALONE_PERF_TEST, document the API version
    contract).
L8 (winml_standalone.cc): Add a comment above the OrtGetApiBase
    LoadLibraryExW call explaining that EXE-dir-only resolution is
    intentional (the bundled WinML NuGet package is the contract owner
    of the runtime version).
L10 (cmake/winml_standalone_perf_test.cmake): Add an
    `if(NOT WINML_BINARY_DIR) message(FATAL_ERROR ...)` guard right
    after find_package, so a NuGet-package layout drift fails at
    configure time rather than emitting a silent post-build copy that
    leaves the EXE missing onnxruntime.dll at runtime.
L12 (cmake/onnxruntime_unittests.cmake): Switch the WinML standalone
    glob exclusion regex to a prefix match so any future
    winml_standalone_*.cc/h is automatically excluded from the regular
    onnxruntime_perf_test target without a manual regex update.

Verified: configure (chrisd\p-x64.cmd) succeeded. cmake --build build
--config RelWithDebInfo --target winml_standalone_perf_test and
--target onnxruntime_perf_test both succeed. The standalone target's
compile list contains winml_standalone.cc; the regular target's does
not.

ORT_API_VERSION on this branch is 24 (vs 25 on the chrisdMSFT/onnxruntime
PR #1 head); all fixes use the macro, so no source adjustment was
needed for the version delta.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chrisdMSFT added a commit that referenced this pull request May 7, 2026
Implement the three "Suggested follow-ups (not blocking)" from the PR
#1 review.

#1 (CI smoke build) -- .github/workflows/windows_winml_standalone.yml
    Configure + build the winml_standalone_perf_test target (and the
    regular onnxruntime_perf_test target as a sanity check) on
    windows-latest, RelWithDebInfo, x64. Triggered on push / pull_request
    on main, rel-*, winml-standalone*, rel-*-winml-standalone, and any
    user/** branch, scoped to the perftest source tree, the cmake files
    that drive the target, and the workflow file itself. No test-run
    step (would require NPU/GPU EP devices not present on hosted
    runners).

    Uses a GitHub-hosted runner instead of the 1ES self-hosted pools
    that the rest of the windows_*.yml workflows use, because those
    pools are gated to Microsoft's CI infrastructure and would not run
    on a personal fork. ilammy/msvc-dev-cmd is used for vcvars setup
    since the in-repo locate-vcvarsall-and-setup-env composite action
    targets the self-hosted images.

#2 (RAII for catalog state) -- supersedes the round-1 M3 minimal-assert
    decision. Replace the file-static g_ep_catalog and
    g_registered_providers globals plus the WinML_* free functions
    with a move-only WinMLStandaloneRegistration class
    in winml_standalone.{h,cc}. Construction opens the catalog,
    enumerates providers, registers each one (subject to the
    --winml_register_provider filter), and throws on
    WinMLEpCatalogCreate failure, WinMLEpCatalogEnumProviders failure,
    or any requested provider failing to register. Destruction
    unregisters in reverse order and releases the catalog.

    Partial-construction safety: the catalog handle is taken into the
    member as soon as WinMLEpCatalogCreate succeeds, then the rest of
    the constructor runs inside a try/catch that calls Cleanup() before
    rethrowing. This keeps the destructor and the constructor's failure
    path sharing a single noexcept Cleanup() implementation.

    Header keeps the catalog handle as a void* so winml_standalone.h
    does not pull <WinMLEpCatalog.h> (and therefore the WinML NuGet
    headers) into main.cc.

    main.cc now uses straight RAII -- the WinMLStandaloneRegistration
    object is declared right after the Ort::Env, replacing both the
    WinML_InitializeAndRegisterAllProviders call and the
    gsl::finally([&]{ WinML_Uninitialize(env); }) block. The
    <gsl/util> include is hoisted out of the BUILD_WINML_STANDALONE_PERF_TEST
    guard because main.cc still uses gsl::finally for plugin EP unregister.

#3 (deployment README) -- onnxruntime/test/perftest/WINML_STANDALONE.md
    Short, focused doc that answers exactly the reviewer's question:
    which DLLs land next to the EXE and where does each one come from.
    Includes a table mapping each file to its source NuGet/build
    artifact and the cmake mechanism that copies it, an explanation of
    why onnxruntime.dll resolution is intentionally EXE-dir-only, the
    minimum redeployment payload, and the ORT API version contract.
    Cross-links to the comprehensive winml_standalone_perf_test.md at
    the repo root for build/run details.

Verified: configure was already done previously; cmake --build build
--config RelWithDebInfo --target winml_standalone_perf_test and
--target onnxruntime_perf_test both succeed. The standalone target's
compile list contains winml_standalone.cc; the regular target's does
not (the prefix-match glob exclusion in cmake/onnxruntime_unittests.cmake
still keeps them disjoint).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chrisdMSFT pushed a commit that referenced this pull request Jun 11, 2026
…ft#28503)

### Description

Add an internal session config entry, `"session.compile_only"`, set by
`CompileModel()` before
  session initialization. The NvTensorRTRTX EP reads it in
`NvExecutionProviderInfo::FromProviderOptions()` and, when set, skips
`deserializeCudaEngine()` /
  `createExecutionContext()` in `CreateNodeComputeInfoFromGraph()`.

The EP context node is still saved — that path uses the serialized
engine buffer directly and does
not depend on the deserialized engine. A stub compute function is
registered to satisfy the
framework; it returns `NOT_IMPLEMENTED` if called, which cannot happen
in practice because
  compile-only sessions are destroyed without inference.


### Motivation and Context

`OrtCompileAPI::CompileModel()` creates an `InferenceSession` solely to
drive `EP::Compile()` and
write out the EPContext model, then destroys it without running
inference. During that session, the
NvTensorRTRTX EP was performing a full `deserializeCudaEngine()` and
`createExecutionContext()` —
uploading engine weights to the GPU and JIT-ing the engine, only to free
everything when the session
  was destroyed.

When the user then loads the EPContext model in a real session, the same
JIT and upload happen again.
   Net effect on the typical "compile, then load and run" flow:

  ```
  ONNX model
      → CompileModel()         [JIT + GPU upload #1 — discarded]
      → EP context model saved to disk
      → Session from EP context model
                               [JIT + GPU upload #2 — necessary]
      → Inference
  ```

  JIT and GPU upload run twice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants