Skip to content

fix(hub): broaden allowedCudaVersions and lower default log level#222

Open
TimPietruskyRunPod wants to merge 1 commit into
mainfrom
fix/hub-cuda-versions-and-log-level
Open

fix(hub): broaden allowedCudaVersions and lower default log level#222
TimPietruskyRunPod wants to merge 1 commit into
mainfrom
fix/hub-cuda-versions-and-log-level

Conversation

@TimPietruskyRunPod
Copy link
Copy Markdown
Contributor

Summary

Two related hub-config fixes that came out of a live validation of #186 (PR #197) on 5.8.5:

  1. allowedCudaVersions in .runpod/hub.json and .runpod/tests_.json was locked to [\"12.7\", \"12.6\"]. That's an allow-list, not a minimum — so every modern Runpod host running CUDA 12.8/12.9/13.0 drivers (which can backwards-compat run a 12.6 container fine) was excluded from scheduling. Result: hub deployments landed on the small pool of legacy hosts and crash-looped with:

    nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6,
    please update your driver to a newer version, or use an earlier cuda container
    

    The 5.8.5 worker on a 4090 sat in this loop until I patched the live endpoint via REST API to include 12.8/12.9/13.0 — then a fresh worker spawned on a compatible host and the FLUX.1-dev-fp8 cold-start test job completed in 23.1s execution time (validating PR fix: poll ComfyUI server indefinitely while process is alive #197).

  2. COMFY_LOG_LEVEL default was DEBUG. In a real cold-start log (1330 lines), only ~25 lines were actual worker-comfyui output — the other ~1300 were low-level comfy internals: hundreds of Backend eager selected for apply_rope1, aimdo: src/control.c:34:DEBUG: --- VRAM Stats ---, Popen([...]) subprocess noise. Switching the default to INFO keeps the handler output readable while leaving DEBUG available for users who explicitly opt in.

Related issues

Test plan

  • Reproduced the OCI failure on a fresh hub deploy of 5.8.5
  • After expanding allowedCudaVersions via REST on the live endpoint, fresh worker spawned and completed a FLUX.1-dev-fp8 job (23.1s exec, 1 image returned)
  • Re-deploy from hub after merge + release and verify cold start lands on a 12.8+ host without intervention
  • Confirm logs at the new INFO default no longer contain the apply_rope1 / aimdo: spam

…onfig

- Expand allowedCudaVersions to include 12.8, 12.9, 13.0 in addition to
  12.6/12.7. The narrow allow-list excluded modern Runpod hosts running
  CUDA 12.8+ drivers, causing hub deployments to crash-loop on
  incompatible hosts with `nvidia-container-cli: requirement error:
  unsatisfied condition: cuda>=12.6`.

- Change default COMFY_LOG_LEVEL from DEBUG to INFO. At DEBUG, >75% of
  worker log lines were low-level comfy internals (apply_rope1,
  Backend eager selected, aimdo VRAM stats, Popen subprocess calls),
  burying the actual handler output. DEBUG is still available
  explicitly when users need it.
@TimPietruskyRunPod
Copy link
Copy Markdown
Contributor Author

Validation status: ✅ confirmed

Live test today: deployed a serverless endpoint from the hub listing of 5.8.5 (which has the old `allowedCudaVersions: ["12.7", "12.6"]` config). Worker landed on a host that crashed-looped with:

```
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.6
```

Patched the live endpoint's `allowedCudaVersions` via the REST API to the new list `["12.6","12.7","12.8","12.9","13.0"]`. The crash-loop stopped on the next worker spawn (landed on a compatible host) and a FLUX.1-dev-fp8 cold-start job completed successfully (~23 s exec time). Validates both:

  1. The current allow-list is too narrow (the bug)
  2. The proposed wider allow-list resolves it (the fix)

Also verifies that #197's process-liveness logic works for cold starts that take longer than the previous 25s ceiling.

No Docker rebuild is required — this is pure hub config, takes effect on next deploy after merge + release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants