fix(hub): broaden allowedCudaVersions and lower default log level by TimPietruskyRunPod · Pull Request #222 · runpod-workers/worker-comfyui

TimPietruskyRunPod · 2026-06-02T10:16:18Z

Summary

Two related hub-config fixes that came out of a live validation of #186 (PR #197) on 5.8.5:

allowedCudaVersions in .runpod/hub.json and .runpod/tests_.json was locked to [\"12.7\", \"12.6\"]. That's an allow-list, not a minimum — so every modern Runpod host running CUDA 12.8/12.9/13.0 drivers (which can backwards-compat run a 12.6 container fine) was excluded from scheduling. Result: hub deployments landed on the small pool of legacy hosts and crash-looped with:
```
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6,
please update your driver to a newer version, or use an earlier cuda container
```
The 5.8.5 worker on a 4090 sat in this loop until I patched the live endpoint via REST API to include 12.8/12.9/13.0 — then a fresh worker spawned on a compatible host and the FLUX.1-dev-fp8 cold-start test job completed in 23.1s execution time (validating PR fix: poll ComfyUI server indefinitely while process is alive #197).
COMFY_LOG_LEVEL default was DEBUG. In a real cold-start log (1330 lines), only ~25 lines were actual worker-comfyui output — the other ~1300 were low-level comfy internals: hundreds of Backend eager selected for apply_rope1, aimdo: src/control.c:34:DEBUG: --- VRAM Stats ---, Popen([...]) subprocess noise. Switching the default to INFO keeps the handler output readable while leaving DEBUG available for users who explicitly opt in.

Related issues

Same symptom likely contributes to [BUG]: RuntimeError: Found no NVIDIA driver on your system #94 (Found no NVIDIA driver) and the unresolvable-CUDA failure mode discussed in [BUG] Sporadic errors - "ComfyUI server (127.0.0.1:8188) not reachable after multiple retries." #186.

Test plan

Reproduced the OCI failure on a fresh hub deploy of 5.8.5
After expanding allowedCudaVersions via REST on the live endpoint, fresh worker spawned and completed a FLUX.1-dev-fp8 job (23.1s exec, 1 image returned)
Re-deploy from hub after merge + release and verify cold start lands on a 12.8+ host without intervention
Confirm logs at the new INFO default no longer contain the apply_rope1 / aimdo: spam

…onfig - Expand allowedCudaVersions to include 12.8, 12.9, 13.0 in addition to 12.6/12.7. The narrow allow-list excluded modern Runpod hosts running CUDA 12.8+ drivers, causing hub deployments to crash-loop on incompatible hosts with `nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6`. - Change default COMFY_LOG_LEVEL from DEBUG to INFO. At DEBUG, >75% of worker log lines were low-level comfy internals (apply_rope1, Backend eager selected, aimdo VRAM stats, Popen subprocess calls), burying the actual handler output. DEBUG is still available explicitly when users need it.

TimPietruskyRunPod · 2026-06-02T12:45:40Z

Validation status: ✅ confirmed

Live test today: deployed a serverless endpoint from the hub listing of 5.8.5 (which has the old `allowedCudaVersions: ["12.7", "12.6"]` config). Worker landed on a host that crashed-looped with:

```
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6
```

Patched the live endpoint's `allowedCudaVersions` via the REST API to the new list `["12.6","12.7","12.8","12.9","13.0"]`. The crash-loop stopped on the next worker spawn (landed on a compatible host) and a FLUX.1-dev-fp8 cold-start job completed successfully (~23 s exec time). Validates both:

The current allow-list is too narrow (the bug)
The proposed wider allow-list resolves it (the fix)

Also verifies that #197's process-liveness logic works for cold starts that take longer than the previous 25s ceiling.

No Docker rebuild is required — this is pure hub config, takes effect on next deploy after merge + release.

This was referenced Jun 2, 2026

[BUG]: RuntimeError: Found no NVIDIA driver on your system #94

Closed

Connection Refused #67

Closed

[BUG]: Unable to run Custom Nodes / Impact Pack #86

Closed

Support nvidia blackwell (5090) #124

Closed

ci: build base image on every pull request #223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hub): broaden allowedCudaVersions and lower default log level#222

fix(hub): broaden allowedCudaVersions and lower default log level#222
TimPietruskyRunPod wants to merge 1 commit into
mainfrom
fix/hub-cuda-versions-and-log-level

TimPietruskyRunPod commented Jun 2, 2026

Uh oh!

TimPietruskyRunPod commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TimPietruskyRunPod commented Jun 2, 2026

Summary

Related issues

Test plan

Uh oh!

TimPietruskyRunPod commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants