feat: multi-GPU parallel session execution#9263
Conversation
Run one generation session per configured GPU concurrently, with a tiled progress preview. Multi-user isolation is unchanged. Backed by five seams: - Per-thread device context (TorchDevice.set/get/clear_session_device); choose_torch_device() consults it first, so all device-selecting call sites resolve to the calling worker's GPU with no per-node changes. - Per-device model caches: build_model_manager builds one ModelCache per generation device; ModelLoadService.ram_cache resolves by current thread device; ram_caches fans out clear/drop/shutdown. - Atomic concurrent dequeue: a dequeue lock makes select+claim atomic so concurrent workers never claim the same item (works on FIFO; round-robin from invoke-ai#9086 slots in later). - Worker pool: one _SessionWorker per device, each pinning torch.cuda.set_device and its session device, with its own runner and cancel event; cancellation routes via an {item_id -> worker} lookup. Single-device installs keep the exact legacy single-worker behavior. Profiling disabled when >1 worker. - New config `generation_devices`; unset = legacy single-worker mode. Frontend: the canvas staging area already tiles per queue item; the main ImageViewer now tracks progress per session and renders a tile grid (ProgressImageTiles) when more than one session is active. Also adds a lock to ObjectSerializerForwardCache for concurrent access. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_model_load_device_routing mutated the process-wide get_config() singleton (device = "cuda:0") to exercise the per-thread cache routing, but never restored it. The leaked CUDA device was then picked up by a later test (test_model_load::test_loading) via choose_torch_device(), which crashed with "Torch not compiled with CUDA enabled" on the CUDA-less CI runner. Add an autouse fixture to save/restore device and clear any pinned session device. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n_devices Regenerate openapi.json (make frontend-openapi) and the frontend schema.ts types (make frontend-typegen) so they include the new generation_devices config field, fixing the openapi-checks and typegen-checks CI jobs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`make frontend-openapi` used a bare `python` from a different environment that emitted the CacheStats @DataClass docstring as a schema description. CI generates the schema via `uv run`, which does not, so openapi-checks failed on the diff. Regenerate with the uv-locked environment to drop the stray description while keeping the generation_devices field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o prevent meta-device corruption Parallel multi-GPU session workers could intermittently crash with "unrecognized device meta" (denoise) or "Cannot copy out of meta tensor; no data!" (l2i), because model loading relies on process-global, non-thread-safe monkey-patches. accelerate.init_empty_weights() (used directly by the loaders and implicitly by diffusers' default low_cpu_mem_usage=True in from_pretrained) swaps torch.nn.Module.register_parameter globally for the duration of a load, routing every newly-registered parameter to the meta device. The model cache's VRAM load/unload runs nn.Module.load_state_dict(assign=True), whose assign path does setattr -> __setattr__ -> register_parameter. When one worker's VRAM move overlapped another worker's from_pretrained, the move's real weights got hijacked onto meta and blew up on the next .to(device). Introduce MODEL_LOAD_LOCK, a write-preferring readers-writer lock: - write lock = model construction (_load_and_cache, load_model_from_path), exclusive. - read lock = VRAM load/unload (ModelCache.lock(), repair_required_tensors_on_device). VRAM transfers across GPUs still overlap each other; they only block while a construction holds the write lock. The lock is always acquired before any per-cache lock to keep a consistent order and avoid an AB-BA deadlock with the writer's make_room/put. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions Image.open() is lazy: it reads the header but defers pixel decoding (and holds the file handle open) until the first .load()/.copy()/.convert(). The opened object was cached and the same object handed to every caller, so in multi-GPU parallel mode two session-processor worker threads could call .copy() on it concurrently and race on the shared file handle and decoder state. This surfaced as "broken data stream when reading image file" and "AssertionError: self.png is not None" during inpainting with batch >1. Force the decode (image.load()) before the object enters the cache so the cached object is safe for concurrent reads, and guard the cache structures (__cache / __cache_ids) with a lock since they are now mutated from multiple threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generation progress bars (under the Invoke button and the Viewer tab) both read a single global $lastProgressEvent atom, which every session overwrites. With parallel multi-GPU sessions this made the bar jump back and forth between sessions. Track progress per queue item id and render one bar per in-flight session, stacked vertically, each removed as its session reaches a terminal state. - stores.ts: add $progressEvents (map keyed by item_id), $activeProgressEvents (sorted), and set/clear helpers. - setEventListeners.tsx: populate per-item progress on invocation_progress; clear per item on terminal status; clear all on connect/disconnect/queue cleared. - ProgressBar.tsx: render a vertical stack of bars (one per active session) with a single-bar fallback for the idle / model-loading window; add containerProps so dockview tabs can position the stack. - Dockview tab call sites: move positioning into containerProps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
$progressEvents is only referenced within stores.ts (via the $activeProgressEvents computed and the set/clear helpers), so exporting it tripped knip's unused-exports check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With 4 GPUs the stacked per-session progress bars grew past the bottom strip of the dockview tab and overlapped the "Viewer" label. Add a fitHeightPx prop: in fit mode the stack is capped to the available strip (10px below the ~40px tab's centered label) and the bars flex to share it, shrinking below their natural height only once they no longer fit. With 1-2 sessions the bars keep their familiar thin height; with 3+ they scale down to stay within the strip. The sidebar bar is unaffected and continues to stack at natural height (it has the vertical room). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fault generation_devices now accepts "auto" (the new default), which expands to every visible CUDA device — so multi-GPU parallel generation works out of the box without manually listing devices. On GPU-less systems "auto" resolves to the single cpu/mps device, preserving serial behavior. - config_default.py: type is now Union[Literal["auto"], list[str]], default "auto"; validator accepts "auto" or a list of device strings. - devices.py: add TorchDevice.get_generation_devices(), the single resolver that expands "auto", normalizes, and deduplicates. - session_processor / model_manager: both consumers use the resolver instead of iterating the raw config value (which would have iterated the characters of the "auto" string). - Regenerated docs/src/generated/settings.json. - Tests for the resolver (auto-with/without-CUDA, dedup, empty). An explicit single-device list (e.g. [cuda:0]) or an empty list opts out of parallelism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Render device badges as "cuda:0 (RTX 3090 #1)" so identical cards can be told apart. Strips the "NVIDIA GeForce" vendor prefix and adds a 1-based "#N" suffix only when multiple cards share a name. The full device name remains available as the badge tooltip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Help users track which CUDA device is processing each session: - Model-load log: "Loaded model ... onto cuda device #N in ..s" - Denoise progress bars: "Denoising (#N)" across all architectures (SD1.5/SDXL, FLUX, FLUX2, Z-Image, Anima, SD3, CogView4) - Progress preview circle: GPU number centered in the ring, via a new `device` field on InvocationProgressEvent (resolved from the worker's thread-local session device) - Session Queue: new "GPU #" column between STATUS and TIME, backed by a `device` column on session_queue (migration_32) recorded when a worker claims an item Adds TorchDevice.get_session_device_label()/get_session_device_index() helpers and a frontend getCudaDeviceIndex() parser (with tests). Shows the number on CUDA only; CPU/MPS show nothing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # invokeai/frontend/web/src/features/gallery/components/ImageViewer/CurrentImagePreview.tsx # invokeai/frontend/web/src/features/gallery/components/ImageViewer/context.tsx # invokeai/frontend/web/src/services/events/setEventListeners.tsx
|
I can take a look at this but I don't have the ability to test it. Recommend assigning to a second person as well. |
Resolve migration_32 conflict: main's migration_32 (model_relationships FK repair) is kept, and the multi-gpu device-column migration is moved to migration_33 and registered after it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rcles - Startup log lists each generation device with its GPU number and id, e.g. "Using torch device: [AMD Radeon PRO W7900 #1 (cuda:0), ...]". Single-device setups keep the bare device name. - Canvas progress circles now show the CUDA device index in the center, matching the viewer panel. - Progress-circle tooltips show the device name and number on hover. - Both are hidden when only a single GPU is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I have converted back to a draft temporarily while I work out excessive RAM consumption issues. |
|
Some comments with the current code (not posted as a review):
|
… caches In multi-GPU mode the model manager builds one ModelCache per generation device, each with storage_device="cpu" and its own RAM-resident copy of every model. A model loaded on N GPUs therefore occupied N copies in RAM, and each cache sized itself against max_cache_ram_gb independently, so RAM use during the text/reference-image encoding phases skyrocketed and the system swapped — worst when two images rendered at once. This deduplicates the CPU-resident weights and makes RAM accounting global. - SharedCpuWeightsStore: process-/manager-global, refcounted store of one canonical CPU state_dict per model key. The first device to load a key registers its weights; subsequent devices adopt the canonical tensors and re-point their module's params at them (load_state_dict(assign=True)), freeing the duplicate. Weights live once in RAM regardless of GPU count; freed only when the last device releases. Per-device modules are kept (params are device-shuffled in place, so two GPUs need two modules), but their CPU-resident params alias the shared tensors. - RamBudget: single system-wide RAM authority. Splits RAM into shared (counted once via the store) and non-shared (per-instance). ModelCache eviction now runs against the global, deduplicated total and re-checks availability each iteration, since evicting a model another device still holds frees no RAM. build_model_manager wires one store + one budget into all device caches; the cap is max_cache_ram_gb as a true system-wide limit, else the sum of per-cache heuristics. Passing ram_budget=None preserves the prior local accounting. - LoRA/patch safety: direct LoRA patching did an in-place copy_ on the weight, which would corrupt the now-shared canonical tensor (and taint keep_ram_copy even with one GPU) when patching a CPU-resident weight. Switched to an out-of-place add (memory- equivalent) so the canonical tensor is never mutated; fixed the FluxControlLoRA expansion path to target the module's live parameter. Sidecar patching and FreeU/Seamless (which patch forward methods) were already safe. Validated on 2x AMD W7900 / ROCm: correct inference on both GPUs from one shared copy (full + partial load + Q8_0 GGUF quantized), concurrent load/unload without corruption, and LoRA isolation across devices. ~40 new tests; existing suites unchanged. Adds scripts/multigpu_ram_driver.py to drive concurrent dual-GPU generations via the queue API and measure peak RSS / leak drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(multi-GPU) With one session-processor worker per device, multiple queue items can be in_progress at once. cancel_by_batch_ids(), cancel_by_destination() and cancel_by_queue_id() excluded in_progress rows from their bulk UPDATE and then canceled only the single get_current() item (LIMIT 1), so on multi-GPU the other running items kept consuming a GPU and could still produce output after the user requested cancellation. Each running item must be canceled via _set_queue_item_status(), which emits the QueueItemStatusChangedEvent that the processor maps to the worker running that item_id and uses to set its cancel event. Add _cancel_in_progress_matching() to cancel every in-progress item matching the same filter (with user-id scoping preserved) and call it from all three bulk-cancel methods. The returned `canceled` count now includes canceled in-progress items. Adds regression tests that dequeue two items onto separate devices and assert every bulk cancel API moves all matching in_progress items to canceled and emits a cancel event for each (and that user-scoped cancel leaves another user's in-progress item running). Reported by JPPhoto in review of invoke-ai#9263. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vice guards, refcount leak) Fixes from the code review of PR invoke-ai#9263: - Cancellation could be silently lost around dequeue: the per-iteration worker.cancel_event.clear() ran AFTER dequeue + gc.collect() + logging, so a cancel arriving in that window was set by the status handler and then wiped. Move the clear to before dequeue, and after claiming an item re-check (cancel_event + a fresh DB status read via _is_queue_item_terminal) and skip running if it is already terminal, closing both race windows. The runner's stale queue_item.status check could not catch this. - delete_by_destination only stopped one in-progress item (get_current) before deleting all matching rows, leaving other GPU workers running (and then failing to update a deleted row). Cancel every matching in-progress item via _cancel_in_progress_matching first. - generation_devices validation: a bare non-"auto" string (e.g. "cuda:0") was iterated character-by-character; an empty list silently fell back to one device. Reject both with a clear message. - get_generation_devices now fails fast on a CUDA device that does not exist (index past device_count, or CUDA unavailable) instead of starting a worker that errors cryptically at first allocation. - Shared-weights wrappers: if the canonical re-point (load_state_dict assign=True) threw after acquire(), the reference was leaked (the wrapper never entered the cache). Compute size metadata first, make acquire the last step, and release on failure. Adds tests for each: post-dequeue terminal guard, delete_by_destination cancellation, generation_devices validation, absent-device rejection, and acquire-released-on-repoint-failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@JPPhoto thanks for the careful review — both points were spot on. Bulk cancellation in multi-GPU mode — fixed.
All three are covered by new regression tests (two Per-user fairness — correct, this PR intentionally does not implement it; |
- Apply ruff 0.11.2 formatting to the files flagged by `ruff format --check`. - The new fail-fast guard in get_generation_devices() (reject a CUDA device that doesn't exist) made the pre-existing test_get_generation_devices_explicit_list_is_deduplicated fail on CPU-only CI runners, since it passes a cuda list with no CUDA present. Mock torch.cuda.is_available/device_count in that test (matching the existing pattern in this file) so it validates dedup on any runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
There are some more issues that need resolving:
|
Three RAM fixes for multi-GPU (and one that helps single-GPU too), addressing transient spikes to ~100% RAM and swapping during text-encode/transformer loads: 1. Cap the global RAM-cache budget at a safe fraction of system RAM. When max_cache_ram_gb is unset, the budget was the *sum* of the per-device cache heuristics, so N GPUs each claiming ~50% of RAM summed to ~N*50% and starved the OS. Now clamp the sum to ModelCache.calc_system_ram_headroom_bytes() (50% of RAM - 2GB baseline, floored at 4GB). Promote the sizing magic numbers to named constants shared by the per-device heuristic and the global cap. 2. Adopt already-resident CPU weights across devices at load time. When a second device loads a model another device already holds, deep-copy a registered meta-weight structural clone and assign the shared canonical weights, instead of re-reading the model from disk and materializing a full transient second copy. Loader-agnostic (one mechanism in ModelLoader, no per-loader code): works for diffusers, single-file checkpoint, GGUF and transformers models, and preserves registered hooks (e.g. fp8 layerwise-cast). Best-effort with a meta-tensor self-check and fallback to a normal disk load on any failure. Skipped on single-device installs. 3. Dequantize FLUX.2 FP8 checkpoints straight to bf16. _dequantize_fp8_weights materialized the whole model in float32 (~36GB for 9B) before a later cast to bf16; now the multiply is done in float32 but stored bf16 per-weight, so the model is never held in float32. Numerically identical; halves the cold-load transient (helps single-GPU too). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Lots of changes in commit 2d3802a . Previously each GPU had its own RAM cache, which meant that the same model could be loaded and stay resident twice, doubling the amount of RAM needed. These changes:
|
The Qwen Image VAE encode/decode invocations called model_on_device() without a working-memory estimate, unlike every other VAE family (SD/SDXL/SD3/CogView4/FLUX). So the model cache reserved only its small default working memory, never offloaded a large resident transformer (the VAE weights themselves are tiny), and the VAE's forward-pass activations then OOM'd VRAM — e.g. a ~40GB Qwen Image Edit transformer left ~1GB free while decode needed ~5GB. Reproduces single-GPU; unrelated to the multi-GPU RAM work. Add estimate_vae_working_memory_qwen_image() (same per-output-pixel scaling as the other estimators, handling the 5D Qwen latents) and pass it from both the i2l (encode, used for reference images in Image Edit) and l2i (decode) nodes, so the cache offloads the transformer before the VAE runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The FLUX.2 VAE encoder's mid-block self-attention scales quadratically with the input's spatial size, and on ROCm scaled_dot_product_attention falls back to a materialized attention matrix. Encoding a reference image (kontext) at full size therefore allocated ~15GB in a single attention call at 1024px — and hundreds of GB at the 2024px reference cap — OOMing VRAM regardless of how much other model memory was freed. Tile the reference-image encode to bound per-tile attention. The VAE's default tile size equals its sample_size (1024), whose per-tile attention still OOMs, so force a 512px tile (with a matching latent tile size derived from the config). Save/restore the VAE's tiling config since it is a shared, cached instance, so the final image decode does not inherit these settings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ModelCache._get_vram_in_use() called torch.cuda.memory_allocated() with no device argument, while _get_vram_available() reads memory_allocated(execution_device). The formula relies on those two canceling. In multi-GPU mode each worker calls torch.cuda.set_device for its own GPU, so the process-current device flips between workers; the no-argument call can then read a different (e.g. idle) GPU's allocation, breaking the cancellation and inflating "available" VRAM toward the card total. The cache then believes there is room and never offloads, so VRAM offloading effectively ignores device_working_mem_gb in multi-GPU. Single-GPU was unaffected (current device always equals the execution device). Query self._execution_device in both _get_vram_in_use() and the cache-state debug log. Add a regression test asserting the per-cache execution device is used. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… decode peak The Qwen Image VAE is a 3D-conv (video) VAE whose decode allocates large conv3d feature maps. A ~1MP decode was measured to peak at ~17 GiB of VRAM — far above what the generic 2200/1100 SD/FLUX constants reserved (~4.6 GiB), so the cache concluded the decode "fit" alongside the resident 20GB transformer + 15GB text encoder, never offloaded them, and OOMed. The offload only frees ~(working_mem - free) bytes, so the reservation must both cover the real peak and be large enough to trigger the offload of models the decode doesn't need. Raise the Qwen decode/encode constants (13000/6500) to match the measured peak. It's linear in output pixels, so it over-reserves past ~1.5MP (where the decode can exceed the card even after offloading) — that case is covered by force_tiled_decode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Qwen Image latents-to-image node hardcoded vae.disable_tiling(), ignoring the global force_tiled_decode setting that the SD/SDXL l2i node honors. Wire it up the same way so users can opt into tiled VAE decode for very large outputs that exceed VRAM even after the transformer/text encoder are offloaded. Off by default, so normal-size decodes are unchanged (full-frame, no tile blending). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The preview-panel progress circle re-renders on every InvocationProgressEvent. The parent passes a fresh progressEvent object each event, so the CircularProgress re-rendered constantly; during the indeterminate phases (everything except denoising) that restarted its CSS spin animation each time, which looked like the disk flashing. (Determinate denoising was unaffected because the value genuinely changes per step.) Split the circle into a memoized, ref-forwarding subcomponent keyed on its visual props (isIndeterminate, value, device label) so message-only updates no longer re-render it and the spin animation stays continuous. The Tooltip still anchors to it via the forwarded ref. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
This PR adds multi-GPU parallel generation: on a machine with more than one GPU, InvokeAI runs several generation sessions concurrently — one per GPU — instead of draining the queue one job at a time. Jobs are distributed fairly across users so a single user's large batch can't monopolize every GPU while others wait.
It's controlled by a new
generation_devicesconfig setting (defaults toauto= use every available CUDA GPU). Setting it to a single device, or leaving CUDA out of the picture, preserves the previous serial behavior exactly. The choice of GPUs can also be controlled via a new section of the Settings dialogue (restart required to take effect).Demo (turn on the sound!)
invoke-mgpu.mp4
How it works — the change is built around five small backend seams plus a frontend update, rather than per-node edits:
invokeai/backend/util/devices.py): a thread-localset/get/clear_session_deviceonTorchDevice;choose_torch_device()consults it first. This is the lynchpin — the ~79 existing call sites resolve to the worker's GPU with no per-node changes.model_manager_default,model_load_default): oneModelCacheper device, resolved by the current thread's device, with fan-out for clear/drop/shutdown. Model construction is serialized against VRAM moves to prevent meta-device corruption.session_queue_sqlite.dequeue): a lock makes select+claim atomic so concurrent workers never grab the same queue item.session_processor_default): one_SessionWorkerper device, each pinningtorch.cuda.set_device+ the session device, with its own runner and cancel event; cancellation is routed per item. Profiling is disabled when more than one worker is active.LocktoObjectSerializerForwardCacheand madeDiskImageFileStoragethread-safe for parallel sessions.Frontend: during parallel generation the progress display stacks one progress bar per active session (each disappears as its session finishes), and the image viewer tiles per-session progress previews when ≥2 sessions are active.
Related Issues / Discussions
QA Instructions
On a multi-GPU machine:
generation_devices: auto), enqueue a batch larger than the GPU count and confirm multiple sessions run simultaneously (one per GPU), with stacked progress bars and tiled previews in the viewer.generation_devices: [cuda:0]and confirm generation runs serially, exactly as before this PR.generation_devices: [cuda:0, cuda:2]and confirm only those devices are used.autoresolves to the one best device and behavior is unchanged.New automated tests cover device routing (
test_model_load_device_routing.py), dequeue concurrency (test_session_queue_dequeue_concurrency.py), and device resolution (test_devices.py).Merge Plan
Standard merge. No DB schema or redux migrations. Touches the session processor and model cache, so worth a careful look from those areas' owners.
Checklist
What's Newcopy (if doing a release after this PR)