Skip to content

[CI] Cross-platform — Part 3: Windows workflow#6086

Closed
hujc7 wants to merge 42 commits into
developfrom
jichuanh/windows-spark-ci-perception
Closed

[CI] Cross-platform — Part 3: Windows workflow#6086
hujc7 wants to merge 42 commits into
developfrom
jichuanh/windows-spark-ci-perception

Conversation

@hujc7

@hujc7 hujc7 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Same-repo version of #5700, opened from a branch in isaac-sim/IsaacLab (not a fork) so the Windows CI job receives the org secrets it needs. GitHub does not deliver workflow secrets to pull_request runs from forks, so the fork-based #5700 cannot exercise the authenticated Isaac Sim install — this PR can.

Adds .github/workflows/windows-ci.yaml — CI pipeline for Windows GPU self-hosted runners, native (non-Docker) install path.

  • Installs the develop-aligned Isaac Sim from the internal Artifactory index (via tools/resolve_isaacsim_develop.py), so native Windows tests the same Sim build as the Linux/ARM develop containers instead of the older public pip release.
  • Authenticates the internal index with the ISAACSIM_ARTIFACTORY_READONLY_USERNAME / ISAACSIM_ARTIFACTORY_READONLY_PASSWORD org secrets (anonymous Artifactory access was removed).
  • Single consolidated job: setup → deps smoke → path-io → Kit headless boot → cartpole training smoke (state + perception) → cartpole-camera perception → wheel build + reinstall.
  • All pytest invocations use --timeout so hung tests fail fast instead of hanging the job.

TEMP (revert before final review)

To save runner time/cost while iterating, the heavy non-Windows PR workflows are force-skipped (each marked TEMP): Docker + Tests, Installation Tests, Build PIP Wheel, License Check, Multi-GPU.

Notes

  • The durable answer for fork/contributor PRs is to bake the read-only Artifactory creds into the self-hosted runner image (per the platform security guidance), removing the need for workflow secrets entirely. This PR validates the wiring works with the secrets in the meantime.
  • Supersedes [CI] Cross-platform — Part 3: Windows workflow #5700.

Test plan

  • windows-ci.yaml triggers on this same-repo PR; setup step authenticates and installs develop-aligned Isaac Sim.
  • perception steps fail fast on Vulkan/runtime errors instead of hanging.

hujc7 added 30 commits May 20, 2026 23:11
Foundation for cross-platform CI. Registers four pytest markers
(windows, windows_ci, arm, arm_ci), teaches AppLauncher to recognize
them in argv so they do not leak into Isaac Sim's argparse, and moves
the AssetConverterBase USD scratch directory from a hardcoded
/tmp/IsaacLab to tempfile.gettempdir() for cross-platform compatibility.

Tags source/isaaclab/test/deps/test_torch.py and test_scipy.py with
the new markers so they are selectable by future cross-platform jobs.

Workflow files (arm-ci.yaml, windows-ci.yaml) ship in follow-up PRs.
Same shape as arm-ci.yaml but the install path is native pip + uv on
the Windows host (no Docker for Linux-based Isaac Sim wheels).

Jobs (all continue-on-error: true):
  Tier 1 — general-windows, install-windows, kit-launch-windows
  Tier 2 — path-io-windows, perception-windows

Every pytest invocation passes --timeout=N + --timeout-method=thread
(signal is unavailable on Windows) plus --continue-on-collection-errors
so a hung test cannot consume the full job slot and a broken neighbor
file does not poison the marker-driven discovery.

perception-windows wraps the cartpole-camera smoke in an inline Python
script with explicit assertions and an inner watchdog thread that aborts
the process after 180s. This replaces the previous pattern where Vulkan
init failures hung the job instead of erroring.

Tags four path-IO test files (test_configclass, test_dict,
test_episode_data, test_hdf5_dataset_file_handler) with the windows_ci
marker so path-io-windows picks them up via marker-driven discovery.
Forces run_docker_tests=false in build.yaml's changes job so all gated
test jobs skip via their existing if-gate. Must be reverted before
final review.
Kit bootstrap aborts on the Windows runner with 'Unable to bootstrap
inner kit kernel: EOF when reading a line' when stdin is not a tty and
no EULA env vars are set. Set OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA /
PRIVACY_CONSENT at the workflow level so every job inherits them.
Bare 'isaacsim' on Windows pulls only isaacsim + isaacsim-kernel; Kit
bootstrap then warns 'PYTHONPATH path doesn't exist
(...site-packages/isaacsim/exts/isaacsim.simulation_app)' / 'Unable to
expose isaacsim.simulation_app API: Extension not found', and
'from isaacsim import SimulationApp' resolves to None, so
AppLauncher dies with 'TypeError: NoneType object is not callable'.
Match install.py / wheel_builder canonical spec: isaacsim[all]>=6.0.0.
Pytest collection over source/isaaclab/test imports
sensors/test_tiled_camera_env.py whose module-level argparse.parse_args
consumes pytest's --ignore=... / -m windows_ci flags and INTERNALERRORs
collection (collected 595 items / 48 errors). The windows_ci-tagged
path-IO tests on this branch all live in test/utils, so narrow the
pytest scope to that subdir — keeps the marker filter intact without
forcing every test file in the tree to be importable bare.
…lder)

Bare 'isaacsim[all]' on Windows fails Kit startup with
'ImportError: cannot import name get_metrics_assembler_interface from
omni.metrics.assembler.core (unknown location)' — the extension is
registered but its implementation isn't on disk because the extscache
extra wasn't requested. wheel_builder/res/python_packages.toml pins
'isaacsim[all,extscache]==6.0.0.*' for exactly this reason; mirror it.
import isaaclab_tasks walks all task packages, which transitively
touches GroundPlaneCfg.physics_material -> isaaclab.sim.spawners.materials
forwarding shim, which raises 'RigidBodyMaterialCfg has moved to
isaaclab_physx.sim.spawners.materials. Install the isaaclab_physx
extension or update your import.' Install it editable before
isaaclab_assets / isaaclab_tasks so the shim resolves.
isaaclab-physx==1.1.0 declares a hard dep on isaaclab-ppisp which is
not in source/ and not on any package index, so uv refuses the install
with 'isaaclab-ppisp was not found in the package registry'. The
ppisp import in isaaclab_physx is lazy (runtime, not at import), so
--no-deps gets us a working editable install. Mirrors the same
workaround used by the ARM-side install path (see install.py).
Three distinct gaps surfaced in path-io-windows on commit 683c110:
1. test_episode_data[cuda:0] parametrize: 'Torch not compiled with CUDA
   enabled' — default torch wheel on Windows pypi is CPU-only. Install
   torch + torchvision from download.pytorch.org/whl/cu128.
2. test_hdf5_dataset_file_handler: 'No module named h5py' — h5py was
   never declared by the isaaclab core dep set on Windows. Install it.
3. test_version.py / test_wrench_composer_*.py: KeyError 'EXP_PATH' at
   collection. Those files instantiate AppLauncher at module load and
   need an Isaac Sim install path-IO does not provide.

Replace the '-m windows_ci' marker filter (which still imports every
file in test/utils for collection) with explicit windows_ci-tagged
file paths. Also drop --ignore=tools/conftest.py since no conftest sits
under utils/.
The Windows runner reports 'vkEnumeratePhysicalDevices failed. No
physical device is found.' / 'Failed to create any GPU devices' when
Kit boots with --enable_cameras=True. Kit then hangs (the in-script
3-min watchdog can't reliably preempt a C-level GIL-held call), the
job consumes its full timeout-minutes, and every other queued job on
the same runner gets cancelled.

Set the perception job's 'if' to false so it never claims the runner.
Also tighten timeout-minutes from 30 to 10 so even when re-enabled it
fails fast rather than starving siblings. Flip 'if' back to
needs.changes.outputs.run_windows_ci == 'true' once the runner is
confirmed GPU-capable.
Python thread watchdogs cannot preempt a Kit/Vulkan init that hangs
in a C call holding the GIL — observed on this runner where the
3-min in-script time.sleep + os._exit never fired and perception_smoke
held the Windows runner for the full 40-min job timeout, starving
every other job.

Replace the thread watchdog inside perception_smoke.py with a
PowerShell Start-Process + WaitForExit at the shell layer (OS-level
process kill, immune to GIL). Apply the same pattern to
kit-launch-windows's inline python invocation.

Tighten per-job timeout-minutes:
  general-windows  30 -> 15
  install-windows  45 -> 30
  kit-launch       30 -> 15
  path-io          30 -> 15
The hard upper bound is now the second line of defence; the
PowerShell watchdog catches runaway python first.
PowerShell on the Windows runner doesn't have bash on PATH:
  bash : The term 'bash' is not recognized as the name of a cmdlet ...
Git for Windows installs bash.exe at C:\Program Files\Git\bin\bash.exe;
invoke it directly with a Test-Path guard and exit-code check so
failures fast-fail.
build.sh hardcoded python3. Linux installs expose python3 (and that
remains the default), but Windows git-bash only has python (no python3
symlink), so the build was dying with 'python3: command not found'
the moment install-windows tried to run the canonical wheel build.

Make build.sh use ${PYTHON:-python3} for every interpreter call and
pass PYTHON=python from the Windows workflow before invoking it. Linux
behavior unchanged; one variable lets Windows reuse the same script.
PowerShell on the Windows runner reads the yaml as a non-UTF-8 code
page; em-dashes (U+2014) inside the Write-Host string literals got
mojibake'd to 'â€"' and tripped the parser:
  ParserError: TerminatorExpectedAtEndOfString

Replace the two affected em-dashes with ASCII '-'. Comment-line
em-dashes elsewhere in the file are harmless (tokenizer skips them)
and stay as-is to avoid touching unrelated lines.
build.sh runs 'python -m pip install build wheel' inside the venv.
uv venv ships without pip by default, so this failed with
  C:\...\env_isaaclab_uv\Scripts\python.exe: No module named pip
right after gen_pyproject.py emitted the generated pyproject.toml.

Add --seed to the install-windows venv create so pip / setuptools /
wheel land inside the venv; the other 3 jobs don't call build.sh and
keep the lighter seedless venvs.
Flips perception-windows from 'if: false' back to the standard
needs.changes.outputs.run_windows_ci gate. The PowerShell process-level
watchdog around the inline Kit boot stays as the inner guard; the
tightened 10-min job timeout-minutes is the outer guard so a Vulkan
init regression cannot starve other queued jobs again.
The watchdog used $proc.Kill($true), which compiles on .NET 5+ but
not on PowerShell 5.1's .NET Framework (Process.Kill has no (bool)
overload there). It still surfaced 'MethodCountCouldNotFindBest' on
the runner after the kill ::error was emitted. Switch to
Stop-Process -Id $proc.Id -Force -ErrorAction SilentlyContinue which
is PS5-native and idempotent.
Adds .github/actions/windows-instance-state composite action with a
single 'phase' input:

  pre  : print disk free + sizes of cache and user-state dirs
  post : print state, wipe non-cache user state, print state again

Each of the 5 Windows-runner jobs now reports state right after
checkout (BEFORE) and at the end with if: always() (AFTER), so any
poisoned state shows up immediately and the runner is left net-zero
outside intentional content caches.

Cleaned in 'post' (state, chain-risk):
  %APPDATA%\NVIDIA Corporation\Omniverse Kit
  %USERPROFILE%\Documents\Kit
  %TEMP%\Kit* / hub-* / omniverse-* crash scratch dirs
  %APPDATA%\Python\Python312\site-packages\{build,wheel}  (escaped
    from build.sh's pip install --user fallback)

Kept across runs (content-addressed, no chain):
  %LOCALAPPDATA%\uv\cache
  %LOCALAPPDATA%\pip\Cache
  %LOCALAPPDATA%\NVIDIA\Omniverse  (Kit shader cache; invalidated
    by Kit itself on version mismatch)
Extend the workflow-level env block with the headless/no-window/EULA
flags that PR #4018's known-working build.yml proved out:
  ISAACSIM_ACCEPT_EULA=YES        # different layer from ACCEPT_EULA
  HEADLESS=1, ISAAC_SIM_HEADLESS=1, ISAAC_SIM_LOW_MEMORY=1
  WINDOWS_PLATFORM=true
  OMNI_KIT_NO_WINDOW=1            # critical: blocks Kit from trying to
                                  # open a display when no desktop session
  OMNI_KIT_DISABLE_WATCHDOG=1, OMNI_KIT_TELEMETRY=0
  CARB_LOGGING_SEVERITY=error
  PYTHONUNBUFFERED=1, PYTHONIOENCODING=utf-8

Add .github/actions/windows-sim-paths/ composite action that re-activates
the caller's venv, resolves the Isaac Sim install root via
pip show isaacsim-kernel, and exports:
  ISAAC_PATH, CARB_APP_PATH (sim/kit), EXP_PATH (workspace/apps), RESOURCE_NAME

It also prepends <sim>/kit/plugins and <sim>/bin to PATH so the Vulkan
loader can find NVIDIA's ICD DLLs (likely root cause of
'vkEnumeratePhysicalDevices failed. No physical device is found.' on
this runner — DLL search defaults do not include the Sim install).

Wire into kit-launch-windows and perception-windows by splitting their
'install + launch' steps into three: install isaacsim, resolve Sim
paths (this action), boot Kit. Install-windows and path-io-windows
don't boot Kit so don't need this.

Extend the windows-instance-state action's report with nvidia-smi
output so 'no GPU' vs 'GPU present, Vulkan can't load' is visible
in every job's pre-state dump. Also harden the size measurement
against junctions/reparse points that have no Length property
(suppresses the GenericMeasurePropertyNotFound noise observed
in the previous run).
'python -m pip show isaacsim-kernel' inside the uv venv failed with
'No module named pip' because uv venvs are created without seeding
pip / setuptools / wheel by default. uv itself can introspect the
venv (it tracks its own install metadata) so 'uv pip show' is the
correct lookup here.
PowerShell treats 'Using Python 3.12.13 environment at: env_isaaclab_uv'
(uv banner on stderr) as a NativeCommandError record when captured via
'2>&1' under $ErrorActionPreference='Stop', failing the step before
parsing the Location: line. Drop the 2>&1 so stderr just streams to
the host log; rely on $LASTEXITCODE for failure detection.

Also surfaces an important data point this run captured for free:
  nvidia-smi: NVIDIA L40S, 582.53, 46068 MiB
The runner DOES have a real GPU. The earlier
'vkEnumeratePhysicalDevices failed' was DLL-discovery, not GPU
absence — which is exactly what this PATH prepend (Sim bin +
kit/plugins) is supposed to fix once the path resolution runs
cleanly.
Duplicate test_cartpole_training_smoke.py from PR #5698's branch so PR
#5700 doesn't chain on it. Cross-platform tweaks vs ARM's copy:
  - pytestmark = [arm_ci, windows_ci]   # dual marker
  - _LAUNCHER picks isaaclab.bat on Windows, isaaclab.sh elsewhere

Add training-smoke-windows job that pytests this file in the same
install + Sim-paths context as perception-windows. continue-on-error
true and timeout-minutes 30 mirror the other Windows jobs.

State case (Isaac-Cartpole-Direct-v0 / rsl_rl) should pass on TCC —
no RTX, no Vulkan touch. Perception case
(Isaac-Cartpole-RGB-Camera-Direct-v0 / rl_games) needs Vulkan and
will fail on this runner until WDDM is enabled.

Whichever of #5698 / #5700 merges first wins the test file; the other
PR will drop the duplicate on rebase.
test_cartpole_training_smoke.py invokes
  scripts/reinforcement_learning/rsl_rl/train.py    (state case)
  scripts/reinforcement_learning/rl_games/train.py  (perception case)
Both train scripts import rsl_rl / rl_games as their first non-stdlib
imports — and the previous Windows training-smoke install didn't pull
either, so both cases hit:
  ModuleNotFoundError: No module named 'rsl_rl'
  ModuleNotFoundError: No module named 'rl_games'
isaaclab_rl/setup.py declares these as extras [rsl_rl] / [rl_games];
install the editable package with both extras so the framework
packages (rsl-rl-lib + rl-games) end up in the venv.
Same coverage as before — deps smoke + path-IO + kit-launch +
cartpole training smoke + perception + wheel build — but as
sequential steps inside a single runs-on: [self-hosted, gpu-windows]
job. Why:

1. Single venv create + single isaacsim install shared across all
   test steps. Saves ~5 venv setups (~3 min each = ~15 min wall).
2. The runner gets ONE allocation, stays continuously busy, never
   sees an inter-job idle gap. Autoscaler can't tear it down and
   strand queued siblings (the cancellation cascade we kept hitting).
3. Same affinity guarantee as Linux/ARM single-job model — every
   test step touches the same runner's filesystem and Sim install.

Each test step has continue-on-error: true and writes its own
JUnit XML. A final aggregate step parses outcomes and fails the
job iff any non-perception step failed. perception is gated as
'warning, not failure' until the runner pool fixes TCC->WDDM, so
the workflow doesn't lie about overall status while still
surfacing the failure clearly.
The self-hosted Windows runner uses an NVIDIA L40S, a Data Center GPU.
On bare-metal Windows, NVIDIA's data-center driver does not expose
graphics APIs (OpenGL/Vulkan/DirectX) for these SKUs regardless of TCC
vs WDDM driver mode; per the Data Center GPU driver release notes,
vGPU is required to expose them. Kit's boot path reflects this exactly:
vkEnumeratePhysicalDevices returns no devices, gpu.foundation logs
"TCC is not supported. GPU(s) should be in WDDM mode.", and Kit then
hangs in omni.gpu_foundation_factory until the OS-level watchdog
fires.

Comment out the perception step (preserve verbatim for restoration),
drop the now-dangling perception_smoke.py artifact path and the
steps.test-perception.outcome reference in the Aggregate step, and
note in the file header that perception is disabled. The disabled-step
context block lists the three independent unblock criteria (vGPU on
L40S, swap runner SKU, or move perception coverage to Linux) so the
next maintainer can pick whichever lands first.
The cross-platform CI series adds source/isaaclab_tasks/test/
test_cartpole_training_smoke.py without a paired fragment, so the
nightly Check changelog fragments gate currently rejects the PR.
Add a .skip entry under source/isaaclab_tasks/changelog.d/ matching
the existing source/isaaclab/changelog.d/jichuanh-windows-ci.skip
convention (CI/test-only, no user-facing API change).
Three related changes that together unblock the consolidated windows-ci
job from the latent failures uncovered once the perception step stopped
masking everything else:

* Install `isaacsim[all,extscache]==6.0.0.*` BEFORE the cu128 torch
  upgrade. `isaacsim` pulls CPU torch transitively and was silently
  overwriting the cu128 wheel installed earlier; `[cuda:0]`-parametrized
  cases in Deps smoke and Path-IO then fail with "Torch not compiled
  with CUDA enabled". The new order mirrors install.py
  (_install_isaacsim() then _ensure_cuda_torch()).

* Install `source/isaaclab_newton` with `--no-deps`. cartpole_env_cfg.py
  imports `isaaclab_newton.physics` at module load, so every cartpole
  task fails with `ModuleNotFoundError: No module named 'isaaclab_newton'`
  without it. Same `--no-deps` reason as isaaclab_physx (both declare a
  bare-name dep on isaaclab_ppisp that's not yet on this branch nor on
  any index; the ppisp import is lazy at runtime). The smoke-import line
  is extended so this regression fails fast in setup, not in a later
  test step.

* Replace the em-dash in the Aggregate step's `Write-Host "::error::"`
  with an ASCII hyphen. PowerShell 5.1 reads the temp .ps1 as cp1252, so
  the 3-byte UTF-8 em-dash mis-decodes inside the string and the closing
  quote is mis-detected, raising "The string is missing the terminator".
  The path was never executed before because `$failed` was always empty
  (only perception had failed, and it was excluded from the gating set).
The temp_dir fixture used `tempfile.mkdtemp()` + `shutil.rmtree()` for
cleanup. On Windows, h5py's libhdf5 keeps an internal handle to the
file briefly after `.close()`, so `rmtree` races with the handle
release and raises `PermissionError [WinError 32]` on teardown of
`test_write_and_load_episode[cuda:0]`. The assertions had already
passed; only the cleanup was failing.

Switch to `tempfile.TemporaryDirectory(ignore_cleanup_errors=True)`
(Python 3.10+). On Linux/macOS this flag is a no-op since no cleanup
error is raised; on Windows it absorbs the libhdf5 handle-release
race without masking real failures (the test body still asserts via
the explicit `dataset_file_handler.close()` calls).

Drop the now-unused `shutil` import.
Pull in source/isaaclab_ppisp (the ppisp package missing from this branch)
and the updated install.py that includes isaaclab_ppisp and isaaclab_newton
in CORE_ISAACLAB_SUBMODULES. With ppisp present, the workflow no longer
needs --no-deps workarounds for isaaclab_physx / isaaclab_newton; the
subsequent commit collapses the hand-rolled pip sequence into a single
./isaaclab.bat -i call.
hujc7 added 12 commits May 27, 2026 21:17
Replace the hand-rolled `uv pip install ...` sequence in the setup step
with a single `.\isaaclab.bat -i 'isaacsim,rl[rsl_rl,rl_games]'` call,
now that the develop merge brings in `source/isaaclab_ppisp/` and the
updated install.py that includes `isaaclab_ppisp` and `isaaclab_newton`
in CORE_ISAACLAB_SUBMODULES.

The hand-rolled sequence had grown three latent issues, all of which
the canonical install.py path avoids:

* Install order — `_install_isaacsim()` runs before `_ensure_cuda_torch()`
  inside install.py, so isaacsim's transitive CPU torch can't shadow the
  cu128 wheel. The previous hand-rolled order had the cu128 upgrade
  first and broke `[cuda:0]`-parametrized tests.
* Missing isaaclab_newton — install.py walks CORE_ISAACLAB_SUBMODULES,
  so isaaclab_newton is installed automatically. cartpole_env_cfg.py's
  import of `isaaclab_newton.physics` no longer fails.
* No more --no-deps workarounds — with `source/isaaclab_ppisp/` present
  the renderer-backend bare-name dep resolves through the local editable
  install.

The workflow keeps the test-only `pytest pytest-timeout h5py` install
(install.py doesn't carry pytest plumbing) and the post-install smoke
import. Setup-step body shrinks from ~25 lines to ~3 substantive lines.

Matches the "Mirror Linux CI setup for new platforms" rule: same entry
point as Linux CI (`./isaaclab.sh -i`), so install-order bugs and new
core submodules are picked up automatically when install.py changes.
PowerShell / pytest commands inside YAML run: blocks render as plain
text in editors without an embedded-language highlighter, so heavy
inline commentary inside those blocks becomes visual noise rather than
documentation. Strip it.

Inter-step comments (section headers, pre-step rationale, the
disabled-perception context block) are kept — those sit at the YAML
level and read fine without syntax-highlighting help.

Net: -80 lines, mostly redundant restatement of what surrounding
identifiers and commit history already make clear.
`test_train_cartpole_perception` builds Isaac-Cartpole-RGB-Camera-Direct-v0
which boots Kit with `enable_cameras=True`, hits the L40S TCC / no-vGPU
Vulkan path, and hangs until the pytest 600s timeout fires (logs show
`Stack of MainThread` thread dumps). Same blocker as the disabled
standalone perception smoke.

Filter the training-smoke pytest invocation with `-k 'not perception'`
so the state subcase (Isaac-Cartpole-Direct-v0 + rsl_rl) is the only
case exercised on the current Windows runner pool. Latest CI run shows
the state subcase passes in ~30s. Drop the filter when the L40S vGPU
unblock criterion lands (same condition tracked in the disabled
perception step's context block).
Independent probe of the Vulkan loader on the runner, separate from
Kit. Captures nvidia-smi driver+display info, lists vulkan-1.dll and
ICD registry entries, and runs vulkaninfo --summary if available
(falls back to a ctypes-based vkCreateInstance +
vkEnumeratePhysicalDevices probe via the existing uv venv when the
SDK isn't installed). Output goes to reports/vulkan-probe.txt and is
included in the windows-ci-reports artifact.

continue-on-error: true so the probe is informational only and does
not gate the job. Added to the Aggregate $results listing for
visibility.

Background: PR 5700 perception step fails on the runner with
"vkEnumeratePhysicalDevices failed. No physical device is found." +
"TCC is not supported. GPU(s) should be in WDDM mode." Adding the
direct vulkaninfo / loader probe answers the question of what the
Vulkan ICD stack itself sees, independent of Kit's bootstrap path.
Last CI run's probe step parse-failed because PowerShell doesn't
support bash heredoc (<<'PYEOF') and the YAML block scalar couldn't
host an unindented PowerShell here-string for the embedded Python.

Move the ctypes Vulkan loader probe out of the workflow into a
standalone tools/vulkan_probe.py:

* Loads vulkan-1.dll / libvulkan.so.1 via ctypes.
* Calls vkCreateInstance + vkEnumeratePhysicalDevices.
* Reports loader-load, instance-create, and physical-device count.
* No dependencies beyond the OS Vulkan loader; cross-platform.

The workflow now invokes it with system Python on the runner. Probe
moves to the first runnable step (right after instance-state report)
so diagnostic data appears in ~30 seconds instead of after the 15-min
isaaclab.bat -i install. All other test steps gated off (`if: false`)
for now while we iterate; aggregate gates the job purely on the
probe's outcome. Disabled-perception context block left intact for
the next maintainer.
The Windows runner GPUs are now in WDDM mode, so Kit's RTX/Vulkan path
can enumerate a device. Re-enable the camera perception smoke and the
perception subcase of the cartpole training smoke that were gated off
under the data-center (TCC) driver, and add perception to the aggregate
gating and report artifacts.
Native Windows installed isaacsim from the public pip index (pinned to
the 5.1.0 release in source/isaaclab/setup.py), while the Linux/ARM CI
runs the develop-branch Isaac Sim container. Windows therefore tested a
different, older Sim than the rest of the matrix.

Resolve the develop-aligned build from the internal Artifactory index and
pin it, verifying the build's commit is on omni_isaac_sim develop when a
gitlab token is available and falling back to the newest 6.0.0 build with
a warning otherwise. Install Isaac Sim from that index, then install
IsaacLab without the isaacsim/all extras that would re-pin the public
release. Add tools/resolve_isaacsim_develop.py and its unit tests.

The internal-index egress and a develop win_amd64 wheel are CI-infra
prerequisites tracked separately.
The internal Artifactory index that serves the develop-aligned Isaac Sim
wheels dropped anonymous access, so the native Windows install path now
needs credentials.

Add ISAACSIM_ARTIFACTORY_READONLY_USERNAME / _PASSWORD to the setup step:
resolve_isaacsim_develop.py reads them from the environment and sends a
Basic auth header on the simple-index fetch, and the uv pip install builds
authenticated --extra-index-url values from them. This puts native Windows
on the same internal develop registry the Linux/ARM CI uses.
Force-skip the heavy install/build/multi-GPU PR workflows while iterating
Windows CI on this PR, to save runner time and cost during the back-and-
forth. Each guard is marked TEMP and reverts before final review; build.yaml
already does the same for the Docker test matrix.

- install-ci.yml: force run_install_tests=false
- wheel.yml: force run_build=false (detect step still runs, check stays green)
- license-check.yaml: job-level if:false
- test-multi-gpu.yaml: job-level if:false (this PR touches app_launcher.py,
  which would otherwise trigger the multi-GPU self-hosted runners)
…park-ci-perception

# Conflicts:
#	pyproject.toml
@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels Jun 9, 2026

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot — PR #6086

[CI] Cross-platform — Part 3: Windows workflow
Verdict: Good overall architecture; a few items to address before merge.


Findings

🔴 Critical: TEMP workflow-skip blocks must be reverted before merge

Files: build.yaml, install-ci.yml, license-check.yaml, test-multi-gpu.yaml, wheel.yml

Five existing workflows are force-disabled (if: false or hardcoded 'false' outputs). The PR body calls these out as "TEMP (revert before final review)" — confirming the author intends to revert them. However, if these slip through merge, the entire Linux/multi-GPU CI surface goes dark on develop. Consider gating on a [skip-other-ci] label or environment variable instead of modifying production workflow files, so there's no risk of accidental merge without revert.

🟡 Warning: windows-ci job fails on current HEAD — setup step exits non-zero

The latest CI run (job windows-ci) fails at "Setup venv + install develop-aligned Isaac Sim" (and all downstream test steps are skipped). This suggests either the Artifactory secrets are not yet provisioned for this branch/repo context, or resolve_isaacsim_develop.py can't reach the index. Since the PR explicitly validates secret wiring, this may be expected iteration — but merging with a red required check risks blocking the merge queue for other PRs if windows-ci becomes required.

🟡 Warning: --timeout-method=thread has known limitations on Windows

The workflow uses pytest --timeout-method=thread (correctly noting SIGALRM is Unix-only). However, thread-based timeouts cannot interrupt blocking native/C calls (e.g., Kit/Vulkan hangs in driver code). The Start-Process + WaitForExit watchdog pattern used in the Kit-launch and perception steps is the correct mitigation — but the test_cartpole_training_smoke.py tests use subprocess.run(timeout=600) without an external watchdog. If the subprocess itself hangs inside a C extension, subprocess.run may not reliably kill the process tree on Windows. Consider wrapping with Start-Process + WaitForExit like the other Kit-launching steps for consistency.

🔵 Suggestion: pytestmark placement in test_configclass.py is mid-import

In source/isaaclab/test/utils/test_configclass.py, the pytestmark = pytest.mark.windows_ci line is inserted between two import blocks (after from isaaclab.utils.configclass import ... but before from isaaclab.utils.dict import ...). While functionally correct (pytest reads module-level pytestmark regardless of position), this is unusual and could confuse linters or readers expecting all imports grouped. Move it after all imports for clarity.

🔵 Suggestion: Hardcoded C:\Program Files\Git\bin\bash.exe path in wheel-build step

The wheel-build step assumes Git Bash is installed at the standard path. If a self-hosted runner has Git installed elsewhere (e.g., via Chocolatey to a non-default path), this will fail silently. Consider using Get-Command bash or where.exe bash with a fallback, or document the runner prerequisite.

🔵 Suggestion: resolve_isaacsim_develop.py could benefit from a --timeout CLI arg

The script uses a hardcoded 30s timeout for HTTP requests. On slow/flaky corporate networks (common for internal Artifactory), this could cause intermittent failures. Exposing it as a CLI argument (defaulting to 30s) would improve debuggability without code changes.


Summary

Well-structured Windows CI pipeline with solid watchdog patterns for Kit hangs, proper DLL path setup, and cross-platform temp-dir fixes. The main concern is the TEMP workflow disablement — ensure those are reverted before merge to avoid silencing Linux CI on develop. The setup failure on the current run needs investigation (likely secrets provisioning) before this can go green.

Reviewed at: ad92f43

@hujc7

hujc7 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Closing. The same-repo PR was to test whether fork-PR secret withholding was the blocker — it isn't: the Windows CI guard still reports the secrets empty on this in-repo run, so the issue is that ISAACSIM_ARTIFACTORY_READONLY_* isn't scoped to isaac-sim/IsaacLab at all (not the fork rule). Continuing on the original fork PR #5700.

@hujc7 hujc7 closed this Jun 9, 2026
@hujc7 hujc7 deleted the jichuanh/windows-spark-ci-perception branch June 9, 2026 22:35
@hujc7 hujc7 restored the jichuanh/windows-spark-ci-perception branch June 9, 2026 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant