fix(hub): broaden allowedCudaVersions and lower default log level#222
fix(hub): broaden allowedCudaVersions and lower default log level#222TimPietruskyRunPod wants to merge 1 commit into
Conversation
…onfig - Expand allowedCudaVersions to include 12.8, 12.9, 13.0 in addition to 12.6/12.7. The narrow allow-list excluded modern Runpod hosts running CUDA 12.8+ drivers, causing hub deployments to crash-loop on incompatible hosts with `nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6`. - Change default COMFY_LOG_LEVEL from DEBUG to INFO. At DEBUG, >75% of worker log lines were low-level comfy internals (apply_rope1, Backend eager selected, aimdo VRAM stats, Popen subprocess calls), burying the actual handler output. DEBUG is still available explicitly when users need it.
|
Validation status: ✅ confirmed Live test today: deployed a serverless endpoint from the hub listing of 5.8.5 (which has the old `allowedCudaVersions: ["12.7", "12.6"]` config). Worker landed on a host that crashed-looped with: ``` Patched the live endpoint's `allowedCudaVersions` via the REST API to the new list `["12.6","12.7","12.8","12.9","13.0"]`. The crash-loop stopped on the next worker spawn (landed on a compatible host) and a FLUX.1-dev-fp8 cold-start job completed successfully (~23 s exec time). Validates both:
Also verifies that #197's process-liveness logic works for cold starts that take longer than the previous 25s ceiling. No Docker rebuild is required — this is pure hub config, takes effect on next deploy after merge + release. |
Summary
Two related hub-config fixes that came out of a live validation of #186 (PR #197) on 5.8.5:
allowedCudaVersionsin.runpod/hub.jsonand.runpod/tests_.jsonwas locked to[\"12.7\", \"12.6\"]. That's an allow-list, not a minimum — so every modern Runpod host running CUDA 12.8/12.9/13.0 drivers (which can backwards-compat run a 12.6 container fine) was excluded from scheduling. Result: hub deployments landed on the small pool of legacy hosts and crash-looped with:The 5.8.5 worker on a 4090 sat in this loop until I patched the live endpoint via REST API to include 12.8/12.9/13.0 — then a fresh worker spawned on a compatible host and the FLUX.1-dev-fp8 cold-start test job completed in 23.1s execution time (validating PR fix: poll ComfyUI server indefinitely while process is alive #197).
COMFY_LOG_LEVELdefault wasDEBUG. In a real cold-start log (1330 lines), only ~25 lines were actual worker-comfyui output — the other ~1300 were low-level comfy internals: hundreds ofBackend eager selected for apply_rope1,aimdo: src/control.c:34:DEBUG: --- VRAM Stats ---,Popen([...])subprocess noise. Switching the default toINFOkeeps the handler output readable while leavingDEBUGavailable for users who explicitly opt in.Related issues
Found no NVIDIA driver) and the unresolvable-CUDA failure mode discussed in [BUG] Sporadic errors - "ComfyUI server (127.0.0.1:8188) not reachable after multiple retries." #186.Test plan
allowedCudaVersionsvia REST on the live endpoint, fresh worker spawned and completed a FLUX.1-dev-fp8 job (23.1s exec, 1 image returned)apply_rope1/aimdo:spam