[CI] Cross-platform — Part 3: Windows workflow#6086
Conversation
Foundation for cross-platform CI. Registers four pytest markers (windows, windows_ci, arm, arm_ci), teaches AppLauncher to recognize them in argv so they do not leak into Isaac Sim's argparse, and moves the AssetConverterBase USD scratch directory from a hardcoded /tmp/IsaacLab to tempfile.gettempdir() for cross-platform compatibility. Tags source/isaaclab/test/deps/test_torch.py and test_scipy.py with the new markers so they are selectable by future cross-platform jobs. Workflow files (arm-ci.yaml, windows-ci.yaml) ship in follow-up PRs.
Same shape as arm-ci.yaml but the install path is native pip + uv on the Windows host (no Docker for Linux-based Isaac Sim wheels). Jobs (all continue-on-error: true): Tier 1 — general-windows, install-windows, kit-launch-windows Tier 2 — path-io-windows, perception-windows Every pytest invocation passes --timeout=N + --timeout-method=thread (signal is unavailable on Windows) plus --continue-on-collection-errors so a hung test cannot consume the full job slot and a broken neighbor file does not poison the marker-driven discovery. perception-windows wraps the cartpole-camera smoke in an inline Python script with explicit assertions and an inner watchdog thread that aborts the process after 180s. This replaces the previous pattern where Vulkan init failures hung the job instead of erroring. Tags four path-IO test files (test_configclass, test_dict, test_episode_data, test_hdf5_dataset_file_handler) with the windows_ci marker so path-io-windows picks them up via marker-driven discovery.
Forces run_docker_tests=false in build.yaml's changes job so all gated test jobs skip via their existing if-gate. Must be reverted before final review.
Kit bootstrap aborts on the Windows runner with 'Unable to bootstrap inner kit kernel: EOF when reading a line' when stdin is not a tty and no EULA env vars are set. Set OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA / PRIVACY_CONSENT at the workflow level so every job inherits them.
Bare 'isaacsim' on Windows pulls only isaacsim + isaacsim-kernel; Kit bootstrap then warns 'PYTHONPATH path doesn't exist (...site-packages/isaacsim/exts/isaacsim.simulation_app)' / 'Unable to expose isaacsim.simulation_app API: Extension not found', and 'from isaacsim import SimulationApp' resolves to None, so AppLauncher dies with 'TypeError: NoneType object is not callable'. Match install.py / wheel_builder canonical spec: isaacsim[all]>=6.0.0.
Pytest collection over source/isaaclab/test imports sensors/test_tiled_camera_env.py whose module-level argparse.parse_args consumes pytest's --ignore=... / -m windows_ci flags and INTERNALERRORs collection (collected 595 items / 48 errors). The windows_ci-tagged path-IO tests on this branch all live in test/utils, so narrow the pytest scope to that subdir — keeps the marker filter intact without forcing every test file in the tree to be importable bare.
…lder) Bare 'isaacsim[all]' on Windows fails Kit startup with 'ImportError: cannot import name get_metrics_assembler_interface from omni.metrics.assembler.core (unknown location)' — the extension is registered but its implementation isn't on disk because the extscache extra wasn't requested. wheel_builder/res/python_packages.toml pins 'isaacsim[all,extscache]==6.0.0.*' for exactly this reason; mirror it.
import isaaclab_tasks walks all task packages, which transitively touches GroundPlaneCfg.physics_material -> isaaclab.sim.spawners.materials forwarding shim, which raises 'RigidBodyMaterialCfg has moved to isaaclab_physx.sim.spawners.materials. Install the isaaclab_physx extension or update your import.' Install it editable before isaaclab_assets / isaaclab_tasks so the shim resolves.
isaaclab-physx==1.1.0 declares a hard dep on isaaclab-ppisp which is not in source/ and not on any package index, so uv refuses the install with 'isaaclab-ppisp was not found in the package registry'. The ppisp import in isaaclab_physx is lazy (runtime, not at import), so --no-deps gets us a working editable install. Mirrors the same workaround used by the ARM-side install path (see install.py).
Three distinct gaps surfaced in path-io-windows on commit 683c110: 1. test_episode_data[cuda:0] parametrize: 'Torch not compiled with CUDA enabled' — default torch wheel on Windows pypi is CPU-only. Install torch + torchvision from download.pytorch.org/whl/cu128. 2. test_hdf5_dataset_file_handler: 'No module named h5py' — h5py was never declared by the isaaclab core dep set on Windows. Install it. 3. test_version.py / test_wrench_composer_*.py: KeyError 'EXP_PATH' at collection. Those files instantiate AppLauncher at module load and need an Isaac Sim install path-IO does not provide. Replace the '-m windows_ci' marker filter (which still imports every file in test/utils for collection) with explicit windows_ci-tagged file paths. Also drop --ignore=tools/conftest.py since no conftest sits under utils/.
The Windows runner reports 'vkEnumeratePhysicalDevices failed. No physical device is found.' / 'Failed to create any GPU devices' when Kit boots with --enable_cameras=True. Kit then hangs (the in-script 3-min watchdog can't reliably preempt a C-level GIL-held call), the job consumes its full timeout-minutes, and every other queued job on the same runner gets cancelled. Set the perception job's 'if' to false so it never claims the runner. Also tighten timeout-minutes from 30 to 10 so even when re-enabled it fails fast rather than starving siblings. Flip 'if' back to needs.changes.outputs.run_windows_ci == 'true' once the runner is confirmed GPU-capable.
Python thread watchdogs cannot preempt a Kit/Vulkan init that hangs in a C call holding the GIL — observed on this runner where the 3-min in-script time.sleep + os._exit never fired and perception_smoke held the Windows runner for the full 40-min job timeout, starving every other job. Replace the thread watchdog inside perception_smoke.py with a PowerShell Start-Process + WaitForExit at the shell layer (OS-level process kill, immune to GIL). Apply the same pattern to kit-launch-windows's inline python invocation. Tighten per-job timeout-minutes: general-windows 30 -> 15 install-windows 45 -> 30 kit-launch 30 -> 15 path-io 30 -> 15 The hard upper bound is now the second line of defence; the PowerShell watchdog catches runaway python first.
PowerShell on the Windows runner doesn't have bash on PATH: bash : The term 'bash' is not recognized as the name of a cmdlet ... Git for Windows installs bash.exe at C:\Program Files\Git\bin\bash.exe; invoke it directly with a Test-Path guard and exit-code check so failures fast-fail.
build.sh hardcoded python3. Linux installs expose python3 (and that
remains the default), but Windows git-bash only has python (no python3
symlink), so the build was dying with 'python3: command not found'
the moment install-windows tried to run the canonical wheel build.
Make build.sh use ${PYTHON:-python3} for every interpreter call and
pass PYTHON=python from the Windows workflow before invoking it. Linux
behavior unchanged; one variable lets Windows reuse the same script.
PowerShell on the Windows runner reads the yaml as a non-UTF-8 code page; em-dashes (U+2014) inside the Write-Host string literals got mojibake'd to 'â€"' and tripped the parser: ParserError: TerminatorExpectedAtEndOfString Replace the two affected em-dashes with ASCII '-'. Comment-line em-dashes elsewhere in the file are harmless (tokenizer skips them) and stay as-is to avoid touching unrelated lines.
build.sh runs 'python -m pip install build wheel' inside the venv. uv venv ships without pip by default, so this failed with C:\...\env_isaaclab_uv\Scripts\python.exe: No module named pip right after gen_pyproject.py emitted the generated pyproject.toml. Add --seed to the install-windows venv create so pip / setuptools / wheel land inside the venv; the other 3 jobs don't call build.sh and keep the lighter seedless venvs.
Flips perception-windows from 'if: false' back to the standard needs.changes.outputs.run_windows_ci gate. The PowerShell process-level watchdog around the inline Kit boot stays as the inner guard; the tightened 10-min job timeout-minutes is the outer guard so a Vulkan init regression cannot starve other queued jobs again.
The watchdog used $proc.Kill($true), which compiles on .NET 5+ but not on PowerShell 5.1's .NET Framework (Process.Kill has no (bool) overload there). It still surfaced 'MethodCountCouldNotFindBest' on the runner after the kill ::error was emitted. Switch to Stop-Process -Id $proc.Id -Force -ErrorAction SilentlyContinue which is PS5-native and idempotent.
Adds .github/actions/windows-instance-state composite action with a
single 'phase' input:
pre : print disk free + sizes of cache and user-state dirs
post : print state, wipe non-cache user state, print state again
Each of the 5 Windows-runner jobs now reports state right after
checkout (BEFORE) and at the end with if: always() (AFTER), so any
poisoned state shows up immediately and the runner is left net-zero
outside intentional content caches.
Cleaned in 'post' (state, chain-risk):
%APPDATA%\NVIDIA Corporation\Omniverse Kit
%USERPROFILE%\Documents\Kit
%TEMP%\Kit* / hub-* / omniverse-* crash scratch dirs
%APPDATA%\Python\Python312\site-packages\{build,wheel} (escaped
from build.sh's pip install --user fallback)
Kept across runs (content-addressed, no chain):
%LOCALAPPDATA%\uv\cache
%LOCALAPPDATA%\pip\Cache
%LOCALAPPDATA%\NVIDIA\Omniverse (Kit shader cache; invalidated
by Kit itself on version mismatch)
Extend the workflow-level env block with the headless/no-window/EULA flags that PR #4018's known-working build.yml proved out: ISAACSIM_ACCEPT_EULA=YES # different layer from ACCEPT_EULA HEADLESS=1, ISAAC_SIM_HEADLESS=1, ISAAC_SIM_LOW_MEMORY=1 WINDOWS_PLATFORM=true OMNI_KIT_NO_WINDOW=1 # critical: blocks Kit from trying to # open a display when no desktop session OMNI_KIT_DISABLE_WATCHDOG=1, OMNI_KIT_TELEMETRY=0 CARB_LOGGING_SEVERITY=error PYTHONUNBUFFERED=1, PYTHONIOENCODING=utf-8 Add .github/actions/windows-sim-paths/ composite action that re-activates the caller's venv, resolves the Isaac Sim install root via pip show isaacsim-kernel, and exports: ISAAC_PATH, CARB_APP_PATH (sim/kit), EXP_PATH (workspace/apps), RESOURCE_NAME It also prepends <sim>/kit/plugins and <sim>/bin to PATH so the Vulkan loader can find NVIDIA's ICD DLLs (likely root cause of 'vkEnumeratePhysicalDevices failed. No physical device is found.' on this runner — DLL search defaults do not include the Sim install). Wire into kit-launch-windows and perception-windows by splitting their 'install + launch' steps into three: install isaacsim, resolve Sim paths (this action), boot Kit. Install-windows and path-io-windows don't boot Kit so don't need this. Extend the windows-instance-state action's report with nvidia-smi output so 'no GPU' vs 'GPU present, Vulkan can't load' is visible in every job's pre-state dump. Also harden the size measurement against junctions/reparse points that have no Length property (suppresses the GenericMeasurePropertyNotFound noise observed in the previous run).
'python -m pip show isaacsim-kernel' inside the uv venv failed with 'No module named pip' because uv venvs are created without seeding pip / setuptools / wheel by default. uv itself can introspect the venv (it tracks its own install metadata) so 'uv pip show' is the correct lookup here.
PowerShell treats 'Using Python 3.12.13 environment at: env_isaaclab_uv' (uv banner on stderr) as a NativeCommandError record when captured via '2>&1' under $ErrorActionPreference='Stop', failing the step before parsing the Location: line. Drop the 2>&1 so stderr just streams to the host log; rely on $LASTEXITCODE for failure detection. Also surfaces an important data point this run captured for free: nvidia-smi: NVIDIA L40S, 582.53, 46068 MiB The runner DOES have a real GPU. The earlier 'vkEnumeratePhysicalDevices failed' was DLL-discovery, not GPU absence — which is exactly what this PATH prepend (Sim bin + kit/plugins) is supposed to fix once the path resolution runs cleanly.
Duplicate test_cartpole_training_smoke.py from PR #5698's branch so PR #5700 doesn't chain on it. Cross-platform tweaks vs ARM's copy: - pytestmark = [arm_ci, windows_ci] # dual marker - _LAUNCHER picks isaaclab.bat on Windows, isaaclab.sh elsewhere Add training-smoke-windows job that pytests this file in the same install + Sim-paths context as perception-windows. continue-on-error true and timeout-minutes 30 mirror the other Windows jobs. State case (Isaac-Cartpole-Direct-v0 / rsl_rl) should pass on TCC — no RTX, no Vulkan touch. Perception case (Isaac-Cartpole-RGB-Camera-Direct-v0 / rl_games) needs Vulkan and will fail on this runner until WDDM is enabled. Whichever of #5698 / #5700 merges first wins the test file; the other PR will drop the duplicate on rebase.
test_cartpole_training_smoke.py invokes scripts/reinforcement_learning/rsl_rl/train.py (state case) scripts/reinforcement_learning/rl_games/train.py (perception case) Both train scripts import rsl_rl / rl_games as their first non-stdlib imports — and the previous Windows training-smoke install didn't pull either, so both cases hit: ModuleNotFoundError: No module named 'rsl_rl' ModuleNotFoundError: No module named 'rl_games' isaaclab_rl/setup.py declares these as extras [rsl_rl] / [rl_games]; install the editable package with both extras so the framework packages (rsl-rl-lib + rl-games) end up in the venv.
Same coverage as before — deps smoke + path-IO + kit-launch + cartpole training smoke + perception + wheel build — but as sequential steps inside a single runs-on: [self-hosted, gpu-windows] job. Why: 1. Single venv create + single isaacsim install shared across all test steps. Saves ~5 venv setups (~3 min each = ~15 min wall). 2. The runner gets ONE allocation, stays continuously busy, never sees an inter-job idle gap. Autoscaler can't tear it down and strand queued siblings (the cancellation cascade we kept hitting). 3. Same affinity guarantee as Linux/ARM single-job model — every test step touches the same runner's filesystem and Sim install. Each test step has continue-on-error: true and writes its own JUnit XML. A final aggregate step parses outcomes and fails the job iff any non-perception step failed. perception is gated as 'warning, not failure' until the runner pool fixes TCC->WDDM, so the workflow doesn't lie about overall status while still surfacing the failure clearly.
The self-hosted Windows runner uses an NVIDIA L40S, a Data Center GPU. On bare-metal Windows, NVIDIA's data-center driver does not expose graphics APIs (OpenGL/Vulkan/DirectX) for these SKUs regardless of TCC vs WDDM driver mode; per the Data Center GPU driver release notes, vGPU is required to expose them. Kit's boot path reflects this exactly: vkEnumeratePhysicalDevices returns no devices, gpu.foundation logs "TCC is not supported. GPU(s) should be in WDDM mode.", and Kit then hangs in omni.gpu_foundation_factory until the OS-level watchdog fires. Comment out the perception step (preserve verbatim for restoration), drop the now-dangling perception_smoke.py artifact path and the steps.test-perception.outcome reference in the Aggregate step, and note in the file header that perception is disabled. The disabled-step context block lists the three independent unblock criteria (vGPU on L40S, swap runner SKU, or move perception coverage to Linux) so the next maintainer can pick whichever lands first.
The cross-platform CI series adds source/isaaclab_tasks/test/ test_cartpole_training_smoke.py without a paired fragment, so the nightly Check changelog fragments gate currently rejects the PR. Add a .skip entry under source/isaaclab_tasks/changelog.d/ matching the existing source/isaaclab/changelog.d/jichuanh-windows-ci.skip convention (CI/test-only, no user-facing API change).
Three related changes that together unblock the consolidated windows-ci job from the latent failures uncovered once the perception step stopped masking everything else: * Install `isaacsim[all,extscache]==6.0.0.*` BEFORE the cu128 torch upgrade. `isaacsim` pulls CPU torch transitively and was silently overwriting the cu128 wheel installed earlier; `[cuda:0]`-parametrized cases in Deps smoke and Path-IO then fail with "Torch not compiled with CUDA enabled". The new order mirrors install.py (_install_isaacsim() then _ensure_cuda_torch()). * Install `source/isaaclab_newton` with `--no-deps`. cartpole_env_cfg.py imports `isaaclab_newton.physics` at module load, so every cartpole task fails with `ModuleNotFoundError: No module named 'isaaclab_newton'` without it. Same `--no-deps` reason as isaaclab_physx (both declare a bare-name dep on isaaclab_ppisp that's not yet on this branch nor on any index; the ppisp import is lazy at runtime). The smoke-import line is extended so this regression fails fast in setup, not in a later test step. * Replace the em-dash in the Aggregate step's `Write-Host "::error::"` with an ASCII hyphen. PowerShell 5.1 reads the temp .ps1 as cp1252, so the 3-byte UTF-8 em-dash mis-decodes inside the string and the closing quote is mis-detected, raising "The string is missing the terminator". The path was never executed before because `$failed` was always empty (only perception had failed, and it was excluded from the gating set).
The temp_dir fixture used `tempfile.mkdtemp()` + `shutil.rmtree()` for cleanup. On Windows, h5py's libhdf5 keeps an internal handle to the file briefly after `.close()`, so `rmtree` races with the handle release and raises `PermissionError [WinError 32]` on teardown of `test_write_and_load_episode[cuda:0]`. The assertions had already passed; only the cleanup was failing. Switch to `tempfile.TemporaryDirectory(ignore_cleanup_errors=True)` (Python 3.10+). On Linux/macOS this flag is a no-op since no cleanup error is raised; on Windows it absorbs the libhdf5 handle-release race without masking real failures (the test body still asserts via the explicit `dataset_file_handler.close()` calls). Drop the now-unused `shutil` import.
Pull in source/isaaclab_ppisp (the ppisp package missing from this branch) and the updated install.py that includes isaaclab_ppisp and isaaclab_newton in CORE_ISAACLAB_SUBMODULES. With ppisp present, the workflow no longer needs --no-deps workarounds for isaaclab_physx / isaaclab_newton; the subsequent commit collapses the hand-rolled pip sequence into a single ./isaaclab.bat -i call.
Replace the hand-rolled `uv pip install ...` sequence in the setup step with a single `.\isaaclab.bat -i 'isaacsim,rl[rsl_rl,rl_games]'` call, now that the develop merge brings in `source/isaaclab_ppisp/` and the updated install.py that includes `isaaclab_ppisp` and `isaaclab_newton` in CORE_ISAACLAB_SUBMODULES. The hand-rolled sequence had grown three latent issues, all of which the canonical install.py path avoids: * Install order — `_install_isaacsim()` runs before `_ensure_cuda_torch()` inside install.py, so isaacsim's transitive CPU torch can't shadow the cu128 wheel. The previous hand-rolled order had the cu128 upgrade first and broke `[cuda:0]`-parametrized tests. * Missing isaaclab_newton — install.py walks CORE_ISAACLAB_SUBMODULES, so isaaclab_newton is installed automatically. cartpole_env_cfg.py's import of `isaaclab_newton.physics` no longer fails. * No more --no-deps workarounds — with `source/isaaclab_ppisp/` present the renderer-backend bare-name dep resolves through the local editable install. The workflow keeps the test-only `pytest pytest-timeout h5py` install (install.py doesn't carry pytest plumbing) and the post-install smoke import. Setup-step body shrinks from ~25 lines to ~3 substantive lines. Matches the "Mirror Linux CI setup for new platforms" rule: same entry point as Linux CI (`./isaaclab.sh -i`), so install-order bugs and new core submodules are picked up automatically when install.py changes.
PowerShell / pytest commands inside YAML run: blocks render as plain text in editors without an embedded-language highlighter, so heavy inline commentary inside those blocks becomes visual noise rather than documentation. Strip it. Inter-step comments (section headers, pre-step rationale, the disabled-perception context block) are kept — those sit at the YAML level and read fine without syntax-highlighting help. Net: -80 lines, mostly redundant restatement of what surrounding identifiers and commit history already make clear.
`test_train_cartpole_perception` builds Isaac-Cartpole-RGB-Camera-Direct-v0 which boots Kit with `enable_cameras=True`, hits the L40S TCC / no-vGPU Vulkan path, and hangs until the pytest 600s timeout fires (logs show `Stack of MainThread` thread dumps). Same blocker as the disabled standalone perception smoke. Filter the training-smoke pytest invocation with `-k 'not perception'` so the state subcase (Isaac-Cartpole-Direct-v0 + rsl_rl) is the only case exercised on the current Windows runner pool. Latest CI run shows the state subcase passes in ~30s. Drop the filter when the L40S vGPU unblock criterion lands (same condition tracked in the disabled perception step's context block).
Independent probe of the Vulkan loader on the runner, separate from Kit. Captures nvidia-smi driver+display info, lists vulkan-1.dll and ICD registry entries, and runs vulkaninfo --summary if available (falls back to a ctypes-based vkCreateInstance + vkEnumeratePhysicalDevices probe via the existing uv venv when the SDK isn't installed). Output goes to reports/vulkan-probe.txt and is included in the windows-ci-reports artifact. continue-on-error: true so the probe is informational only and does not gate the job. Added to the Aggregate $results listing for visibility. Background: PR 5700 perception step fails on the runner with "vkEnumeratePhysicalDevices failed. No physical device is found." + "TCC is not supported. GPU(s) should be in WDDM mode." Adding the direct vulkaninfo / loader probe answers the question of what the Vulkan ICD stack itself sees, independent of Kit's bootstrap path.
Last CI run's probe step parse-failed because PowerShell doesn't support bash heredoc (<<'PYEOF') and the YAML block scalar couldn't host an unindented PowerShell here-string for the embedded Python. Move the ctypes Vulkan loader probe out of the workflow into a standalone tools/vulkan_probe.py: * Loads vulkan-1.dll / libvulkan.so.1 via ctypes. * Calls vkCreateInstance + vkEnumeratePhysicalDevices. * Reports loader-load, instance-create, and physical-device count. * No dependencies beyond the OS Vulkan loader; cross-platform. The workflow now invokes it with system Python on the runner. Probe moves to the first runnable step (right after instance-state report) so diagnostic data appears in ~30 seconds instead of after the 15-min isaaclab.bat -i install. All other test steps gated off (`if: false`) for now while we iterate; aggregate gates the job purely on the probe's outcome. Disabled-perception context block left intact for the next maintainer.
…ools/" This reverts commit 966e6d3.
…nfo)" This reverts commit cd1e739.
The Windows runner GPUs are now in WDDM mode, so Kit's RTX/Vulkan path can enumerate a device. Re-enable the camera perception smoke and the perception subcase of the cartpole training smoke that were gated off under the data-center (TCC) driver, and add perception to the aggregate gating and report artifacts.
Native Windows installed isaacsim from the public pip index (pinned to the 5.1.0 release in source/isaaclab/setup.py), while the Linux/ARM CI runs the develop-branch Isaac Sim container. Windows therefore tested a different, older Sim than the rest of the matrix. Resolve the develop-aligned build from the internal Artifactory index and pin it, verifying the build's commit is on omni_isaac_sim develop when a gitlab token is available and falling back to the newest 6.0.0 build with a warning otherwise. Install Isaac Sim from that index, then install IsaacLab without the isaacsim/all extras that would re-pin the public release. Add tools/resolve_isaacsim_develop.py and its unit tests. The internal-index egress and a develop win_amd64 wheel are CI-infra prerequisites tracked separately.
The internal Artifactory index that serves the develop-aligned Isaac Sim wheels dropped anonymous access, so the native Windows install path now needs credentials. Add ISAACSIM_ARTIFACTORY_READONLY_USERNAME / _PASSWORD to the setup step: resolve_isaacsim_develop.py reads them from the environment and sends a Basic auth header on the simple-index fetch, and the uv pip install builds authenticated --extra-index-url values from them. This puts native Windows on the same internal develop registry the Linux/ARM CI uses.
Force-skip the heavy install/build/multi-GPU PR workflows while iterating Windows CI on this PR, to save runner time and cost during the back-and- forth. Each guard is marked TEMP and reverts before final review; build.yaml already does the same for the Docker test matrix. - install-ci.yml: force run_install_tests=false - wheel.yml: force run_build=false (detect step still runs, check stays green) - license-check.yaml: job-level if:false - test-multi-gpu.yaml: job-level if:false (this PR touches app_launcher.py, which would otherwise trigger the multi-GPU self-hosted runners)
…park-ci-perception # Conflicts: # pyproject.toml
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot — PR #6086
[CI] Cross-platform — Part 3: Windows workflow
Verdict: Good overall architecture; a few items to address before merge.
Findings
🔴 Critical: TEMP workflow-skip blocks must be reverted before merge
Files: build.yaml, install-ci.yml, license-check.yaml, test-multi-gpu.yaml, wheel.yml
Five existing workflows are force-disabled (if: false or hardcoded 'false' outputs). The PR body calls these out as "TEMP (revert before final review)" — confirming the author intends to revert them. However, if these slip through merge, the entire Linux/multi-GPU CI surface goes dark on develop. Consider gating on a [skip-other-ci] label or environment variable instead of modifying production workflow files, so there's no risk of accidental merge without revert.
🟡 Warning: windows-ci job fails on current HEAD — setup step exits non-zero
The latest CI run (job windows-ci) fails at "Setup venv + install develop-aligned Isaac Sim" (and all downstream test steps are skipped). This suggests either the Artifactory secrets are not yet provisioned for this branch/repo context, or resolve_isaacsim_develop.py can't reach the index. Since the PR explicitly validates secret wiring, this may be expected iteration — but merging with a red required check risks blocking the merge queue for other PRs if windows-ci becomes required.
🟡 Warning: --timeout-method=thread has known limitations on Windows
The workflow uses pytest --timeout-method=thread (correctly noting SIGALRM is Unix-only). However, thread-based timeouts cannot interrupt blocking native/C calls (e.g., Kit/Vulkan hangs in driver code). The Start-Process + WaitForExit watchdog pattern used in the Kit-launch and perception steps is the correct mitigation — but the test_cartpole_training_smoke.py tests use subprocess.run(timeout=600) without an external watchdog. If the subprocess itself hangs inside a C extension, subprocess.run may not reliably kill the process tree on Windows. Consider wrapping with Start-Process + WaitForExit like the other Kit-launching steps for consistency.
🔵 Suggestion: pytestmark placement in test_configclass.py is mid-import
In source/isaaclab/test/utils/test_configclass.py, the pytestmark = pytest.mark.windows_ci line is inserted between two import blocks (after from isaaclab.utils.configclass import ... but before from isaaclab.utils.dict import ...). While functionally correct (pytest reads module-level pytestmark regardless of position), this is unusual and could confuse linters or readers expecting all imports grouped. Move it after all imports for clarity.
🔵 Suggestion: Hardcoded C:\Program Files\Git\bin\bash.exe path in wheel-build step
The wheel-build step assumes Git Bash is installed at the standard path. If a self-hosted runner has Git installed elsewhere (e.g., via Chocolatey to a non-default path), this will fail silently. Consider using Get-Command bash or where.exe bash with a fallback, or document the runner prerequisite.
🔵 Suggestion: resolve_isaacsim_develop.py could benefit from a --timeout CLI arg
The script uses a hardcoded 30s timeout for HTTP requests. On slow/flaky corporate networks (common for internal Artifactory), this could cause intermittent failures. Exposing it as a CLI argument (defaulting to 30s) would improve debuggability without code changes.
Summary
Well-structured Windows CI pipeline with solid watchdog patterns for Kit hangs, proper DLL path setup, and cross-platform temp-dir fixes. The main concern is the TEMP workflow disablement — ensure those are reverted before merge to avoid silencing Linux CI on develop. The setup failure on the current run needs investigation (likely secrets provisioning) before this can go green.
Reviewed at: ad92f43
|
Closing. The same-repo PR was to test whether fork-PR secret withholding was the blocker — it isn't: the Windows CI guard still reports the secrets empty on this in-repo run, so the issue is that |
Summary
Same-repo version of #5700, opened from a branch in
isaac-sim/IsaacLab(not a fork) so the Windows CI job receives the org secrets it needs. GitHub does not deliver workflow secrets topull_requestruns from forks, so the fork-based #5700 cannot exercise the authenticated Isaac Sim install — this PR can.Adds
.github/workflows/windows-ci.yaml— CI pipeline for Windows GPU self-hosted runners, native (non-Docker) install path.tools/resolve_isaacsim_develop.py), so native Windows tests the same Sim build as the Linux/ARM develop containers instead of the older public pip release.ISAACSIM_ARTIFACTORY_READONLY_USERNAME/ISAACSIM_ARTIFACTORY_READONLY_PASSWORDorg secrets (anonymous Artifactory access was removed).--timeoutso hung tests fail fast instead of hanging the job.TEMP (revert before final review)
To save runner time/cost while iterating, the heavy non-Windows PR workflows are force-skipped (each marked
TEMP): Docker + Tests, Installation Tests, Build PIP Wheel, License Check, Multi-GPU.Notes
Test plan