Skip to content

feat: diagnose missing render nodes and surface SYCL crash errors#8

Open
offbyonebit wants to merge 2 commits into
mainfrom
claude/exciting-dirac-BhnnF
Open

feat: diagnose missing render nodes and surface SYCL crash errors#8
offbyonebit wants to merge 2 commits into
mainfrom
claude/exciting-dirac-BhnnF

Conversation

@offbyonebit

Copy link
Copy Markdown
Owner

$(cat <<'EOF'

Summary

  • detect.py: Adds a note to DetectedGPU when a driver (xe/i915) is bound but no DRM render node appears in sysfs — the direct cause of SYCL's "No device of requested type available" crash. Exposes a new render_nodes_in_dev() helper that checks /dev/dri/ independently of sysfs.

  • launcher.py: When no log_dir is provided, stderr is now captured to a tempfile instead of discarded to /dev/null. wait_ready() calls _surface_crash_logs() on early exit, which reads the tail of that file, logs it, and — if the canonical SYCL no-device message is found — emits a numbered checklist (render nodes, render group, dmesg, sycl-ls). The temp file is deleted in stop().

  • cli.py: arc-llama doctor gains three new sections:

    1. /dev/dri/ render nodes — lists all renderD* entries (or explains why the directory is missing, e.g. in a container)
    2. sycl-ls device enumeration — runs sycl-ls if available and shows its output, making it immediately obvious whether SYCL can see the GPU
    3. Targeted WARNING block — fires when a GPU has its driver bound but no render node, printing the exact steps to resolve the issue on bare metal and in containers

Test plan

  • pytest passes (46 pass, 9 skipped — no regressions)
  • arc-llama doctor on a host with xe loaded but no /dev/dri/renderD* shows the WARNING block and numbered checklist
  • arc-llama doctor on a working system shows green render nodes and sycl-ls device list
  • llama-server crash with "No device of requested type available" now logs the checklist to the arc-llama logger

https://claude.ai/code/session_01GNwxSoERiDh1WWnS7Z2LvM
EOF
)


Generated by Claude Code

detect.py: add note when driver is bound but no DRM render node exists in
sysfs (the direct cause of "No device of requested type available"), and
expose render_nodes_in_dev() to check /dev/dri/ independently.

launcher.py: capture stderr to a temp file when no log_dir is set so
wait_ready() can read and surface the crash output.  _surface_crash_logs()
detects the canonical SYCL no-device message and emits a numbered checklist
(render nodes, render group, dmesg, sycl-ls).  Temp file is cleaned up in stop().

cli.py: extend `arc-llama doctor` with a /dev/dri/ render-node listing,
a sycl-ls device enumeration block, and a targeted WARNING block when a GPU
has its driver bound but no render node — giving users the exact steps needed
to unblock the xe driver in bare-metal and container environments.

https://claude.ai/code/session_01GNwxSoERiDh1WWnS7Z2LvM
@offbyonebit offbyonebit marked this pull request as ready for review May 27, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants