feat: multi-GPU parallel session execution by lstein · Pull Request #9263 · invoke-ai/InvokeAI

lstein · 2026-06-03T02:13:05Z

Summary

This PR adds multi-GPU parallel generation: on a machine with more than one GPU, InvokeAI runs several generation sessions concurrently — one per GPU — instead of draining the queue one job at a time. Jobs are distributed fairly across users so a single user's large batch can't monopolize every GPU while others wait.

It's controlled by a new generation_devices config setting (defaults to auto = use every available CUDA GPU). Setting it to a single device, or leaving CUDA out of the picture, preserves the previous serial behavior exactly. The choice of GPUs can also be controlled via a new section of the Settings dialogue (restart required to take effect).

Demo (turn on the sound!)

invoke-mgpu.mp4

How it works — the change is built around five small backend seams plus a frontend update, rather than per-node edits:

Device context (invokeai/backend/util/devices.py): a thread-local set/get/clear_session_device on TorchDevice; choose_torch_device() consults it first. This is the lynchpin — the ~79 existing call sites resolve to the worker's GPU with no per-node changes.
Per-device model caches (model_manager_default, model_load_default): one ModelCache per device, resolved by the current thread's device, with fan-out for clear/drop/shutdown. Model construction is serialized against VRAM moves to prevent meta-device corruption.
Atomic dequeue (session_queue_sqlite.dequeue): a lock makes select+claim atomic so concurrent workers never grab the same queue item.
Worker pool (session_processor_default): one _SessionWorker per device, each pinning torch.cuda.set_device + the session device, with its own runner and cancel event; cancellation is routed per item. Profiling is disabled when more than one worker is active.
Concurrency hardening: added a Lock to ObjectSerializerForwardCache and made DiskImageFileStorage thread-safe for parallel sessions.

Frontend: during parallel generation the progress display stacks one progress bar per active session (each disappears as its session finishes), and the image viewer tiles per-session progress previews when ≥2 sessions are active.

Related Issues / Discussions

Builds conceptually alongside [feat] Round robin job scheduling in multiuser mode #9086 (round-robin per-user dequeue). This PR does not depend on [feat] Round robin job scheduling in multiuser mode #9086 being merged — the atomic claim works on the current FIFO dequeue, and [feat] Round robin job scheduling in multiuser mode #9086's round-robin CTE can slot in cleanly when it lands.

QA Instructions

On a multi-GPU machine:

With default config (generation_devices: auto), enqueue a batch larger than the GPU count and confirm multiple sessions run simultaneously (one per GPU), with stacked progress bars and tiled previews in the viewer.
Set generation_devices: [cuda:0] and confirm generation runs serially, exactly as before this PR.
Set generation_devices: [cuda:0, cuda:2] and confirm only those devices are used.
Cancel an in-flight item and confirm only that session stops.
On a single-GPU / CPU / MPS machine, confirm auto resolves to the one best device and behavior is unchanged.

New automated tests cover device routing (test_model_load_device_routing.py), dequeue concurrency (test_session_queue_dequeue_concurrency.py), and device resolution (test_devices.py).

Merge Plan

Standard merge. No DB schema or redux migrations. Touches the session processor and model cache, so worth a careful look from those areas' owners.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration — N/A, no slice changes
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

Run one generation session per configured GPU concurrently, with a tiled progress preview. Multi-user isolation is unchanged. Backed by five seams: - Per-thread device context (TorchDevice.set/get/clear_session_device); choose_torch_device() consults it first, so all device-selecting call sites resolve to the calling worker's GPU with no per-node changes. - Per-device model caches: build_model_manager builds one ModelCache per generation device; ModelLoadService.ram_cache resolves by current thread device; ram_caches fans out clear/drop/shutdown. - Atomic concurrent dequeue: a dequeue lock makes select+claim atomic so concurrent workers never claim the same item (works on FIFO; round-robin from invoke-ai#9086 slots in later). - Worker pool: one _SessionWorker per device, each pinning torch.cuda.set_device and its session device, with its own runner and cancel event; cancellation routes via an {item_id -> worker} lookup. Single-device installs keep the exact legacy single-worker behavior. Profiling disabled when >1 worker. - New config `generation_devices`; unset = legacy single-worker mode. Frontend: the canvas staging area already tiles per queue item; the main ImageViewer now tracks progress per session and renders a tile grid (ProgressImageTiles) when more than one session is active. Also adds a lock to ObjectSerializerForwardCache for concurrent access. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test_model_load_device_routing mutated the process-wide get_config() singleton (device = "cuda:0") to exercise the per-thread cache routing, but never restored it. The leaked CUDA device was then picked up by a later test (test_model_load::test_loading) via choose_torch_device(), which crashed with "Torch not compiled with CUDA enabled" on the CUDA-less CI runner. Add an autouse fixture to save/restore device and clear any pinned session device. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…n_devices Regenerate openapi.json (make frontend-openapi) and the frontend schema.ts types (make frontend-typegen) so they include the new generation_devices config field, fixing the openapi-checks and typegen-checks CI jobs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`make frontend-openapi` used a bare `python` from a different environment that emitted the CacheStats @DataClass docstring as a schema description. CI generates the schema via `uv run`, which does not, so openapi-checks failed on the diff. Regenerate with the uv-locked environment to drop the stray description while keeping the generation_devices field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…o prevent meta-device corruption Parallel multi-GPU session workers could intermittently crash with "unrecognized device meta" (denoise) or "Cannot copy out of meta tensor; no data!" (l2i), because model loading relies on process-global, non-thread-safe monkey-patches. accelerate.init_empty_weights() (used directly by the loaders and implicitly by diffusers' default low_cpu_mem_usage=True in from_pretrained) swaps torch.nn.Module.register_parameter globally for the duration of a load, routing every newly-registered parameter to the meta device. The model cache's VRAM load/unload runs nn.Module.load_state_dict(assign=True), whose assign path does setattr -> __setattr__ -> register_parameter. When one worker's VRAM move overlapped another worker's from_pretrained, the move's real weights got hijacked onto meta and blew up on the next .to(device). Introduce MODEL_LOAD_LOCK, a write-preferring readers-writer lock: - write lock = model construction (_load_and_cache, load_model_from_path), exclusive. - read lock = VRAM load/unload (ModelCache.lock(), repair_required_tensors_on_device). VRAM transfers across GPUs still overlap each other; they only block while a construction holds the write lock. The lock is always acquired before any per-cache lock to keep a consistent order and avoid an AB-BA deadlock with the writer's make_room/put. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ions Image.open() is lazy: it reads the header but defers pixel decoding (and holds the file handle open) until the first .load()/.copy()/.convert(). The opened object was cached and the same object handed to every caller, so in multi-GPU parallel mode two session-processor worker threads could call .copy() on it concurrently and race on the shared file handle and decoder state. This surfaced as "broken data stream when reading image file" and "AssertionError: self.png is not None" during inpainting with batch >1. Force the decode (image.load()) before the object enters the cache so the cached object is safe for concurrent reads, and guard the cache structures (__cache / __cache_ids) with a lock since they are now mutated from multiple threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The generation progress bars (under the Invoke button and the Viewer tab) both read a single global $lastProgressEvent atom, which every session overwrites. With parallel multi-GPU sessions this made the bar jump back and forth between sessions. Track progress per queue item id and render one bar per in-flight session, stacked vertically, each removed as its session reaches a terminal state. - stores.ts: add $progressEvents (map keyed by item_id), $activeProgressEvents (sorted), and set/clear helpers. - setEventListeners.tsx: populate per-item progress on invocation_progress; clear per item on terminal status; clear all on connect/disconnect/queue cleared. - ProgressBar.tsx: render a vertical stack of bars (one per active session) with a single-bar fallback for the idle / model-loading window; add containerProps so dockview tabs can position the stack. - Dockview tab call sites: move positioning into containerProps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

$progressEvents is only referenced within stores.ts (via the $activeProgressEvents computed and the set/clear helpers), so exporting it tripped knip's unused-exports check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

With 4 GPUs the stacked per-session progress bars grew past the bottom strip of the dockview tab and overlapped the "Viewer" label. Add a fitHeightPx prop: in fit mode the stack is capped to the available strip (10px below the ~40px tab's centered label) and the bars flex to share it, shrinking below their natural height only once they no longer fit. With 1-2 sessions the bars keep their familiar thin height; with 3+ they scale down to stay within the strip. The sidebar bar is unaffected and continues to stack at natural height (it has the vertical room). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…fault generation_devices now accepts "auto" (the new default), which expands to every visible CUDA device — so multi-GPU parallel generation works out of the box without manually listing devices. On GPU-less systems "auto" resolves to the single cpu/mps device, preserving serial behavior. - config_default.py: type is now Union[Literal["auto"], list[str]], default "auto"; validator accepts "auto" or a list of device strings. - devices.py: add TorchDevice.get_generation_devices(), the single resolver that expands "auto", normalizes, and deduplicates. - session_processor / model_manager: both consumers use the resolver instead of iterating the raw config value (which would have iterated the characters of the "auto" string). - Regenerated docs/src/generated/settings.json. - Tests for the resolver (auto-with/without-CUDA, dedup, empty). An explicit single-device list (e.g. [cuda:0]) or an empty list opts out of parallelism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Render device badges as "cuda:0 (RTX 3090 #1)" so identical cards can be told apart. Strips the "NVIDIA GeForce" vendor prefix and adds a 1-based "#N" suffix only when multiple cards share a name. The full device name remains available as the badge tooltip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Help users track which CUDA device is processing each session: - Model-load log: "Loaded model ... onto cuda device #N in ..s" - Denoise progress bars: "Denoising (#N)" across all architectures (SD1.5/SDXL, FLUX, FLUX2, Z-Image, Anima, SD3, CogView4) - Progress preview circle: GPU number centered in the ring, via a new `device` field on InvocationProgressEvent (resolved from the worker's thread-local session device) - Session Queue: new "GPU #" column between STATUS and TIME, backed by a `device` column on session_queue (migration_32) recorded when a worker claims an item Adds TorchDevice.get_session_device_label()/get_session_device_index() helpers and a frontend getCudaDeviceIndex() parser (with tests). Shows the number on CUDA only; CPU/MPS show nothing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts: # invokeai/frontend/web/src/features/gallery/components/ImageViewer/CurrentImagePreview.tsx # invokeai/frontend/web/src/features/gallery/components/ImageViewer/context.tsx # invokeai/frontend/web/src/services/events/setEventListeners.tsx

JPPhoto · 2026-06-23T14:25:04Z

I can take a look at this but I don't have the ability to test it. Recommend assigning to a second person as well.

Resolve migration_32 conflict: main's migration_32 (model_relationships FK repair) is kept, and the multi-gpu device-column migration is moved to migration_33 and registered after it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rcles - Startup log lists each generation device with its GPU number and id, e.g. "Using torch device: [AMD Radeon PRO W7900 #1 (cuda:0), ...]". Single-device setups keep the bare device name. - Canvas progress circles now show the CUDA device index in the center, matching the viewer panel. - Progress-circle tooltips show the device name and number on hover. - Both are hidden when only a single GPU is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lstein · 2026-06-25T12:30:34Z

I have converted back to a draft temporarily while I work out excessive RAM consumption issues.

JPPhoto · 2026-06-25T13:43:52Z

Some comments with the current code (not posted as a review):

invokeai/invokeai/app/services/session_queue/session_queue_sqlite.py:478 and :651

Bulk cancellation only cancels one active item in multi-GPU mode. The new processor starts one worker per device and each worker calls dequeue() independently, so there can be multiple in_progress rows at the same time. But cancel_by_batch_ids(), cancel_by_destination(), and cancel_by_queue_id() all exclude status = 'in_progress' from the bulk update, then call get_current() once. get_current() uses LIMIT 1 at line 287, so only one active item is moved to canceled; other active items in the same queue/batch/destination continue running.

Trigger: two GPUs are running two items from the same batch or destination, then the user cancels that batch/destination or queue.

Consequence: cancellation appears to succeed but at least one generation keeps consuming GPU and can still produce output after the user requested cancellation.

To expose this issue, add a test that dequeues two items into in_progress with the same batch/destination, calls each bulk cancel API, and asserts all matching in_progress items become canceled and emit cancel events.
invokeai/invokeai/app/services/session_queue/session_queue_sqlite.py:234

The documented per-user fairness is not implemented. The PR docs say a single user's large batch cannot monopolize every GPU, but dequeue() still orders only by priority DESC, item_id ASC. With two workers, if user A enqueues a large batch and user B enqueues after it, every worker keeps claiming A's lower item_id rows until A's older batch is drained. There is no user grouping, round-robin CTE, or last-served-user state in this PR.

Trigger: multiuser mode, multiple GPUs, user A enqueues many items before user B enqueues one item at the same priority.

Consequence: user B can wait behind user A's entire older batch, contradicting the user-facing behavior described in docs/src/content/docs/configuration/invokeai-yaml.mdx:119.

To expose this issue, add a queue test that enqueues many pending items for user A, then one for user B, simulates repeated concurrent dequeue() calls, and asserts the second user is selected before A monopolizes every worker cycle.

… caches In multi-GPU mode the model manager builds one ModelCache per generation device, each with storage_device="cpu" and its own RAM-resident copy of every model. A model loaded on N GPUs therefore occupied N copies in RAM, and each cache sized itself against max_cache_ram_gb independently, so RAM use during the text/reference-image encoding phases skyrocketed and the system swapped — worst when two images rendered at once. This deduplicates the CPU-resident weights and makes RAM accounting global. - SharedCpuWeightsStore: process-/manager-global, refcounted store of one canonical CPU state_dict per model key. The first device to load a key registers its weights; subsequent devices adopt the canonical tensors and re-point their module's params at them (load_state_dict(assign=True)), freeing the duplicate. Weights live once in RAM regardless of GPU count; freed only when the last device releases. Per-device modules are kept (params are device-shuffled in place, so two GPUs need two modules), but their CPU-resident params alias the shared tensors. - RamBudget: single system-wide RAM authority. Splits RAM into shared (counted once via the store) and non-shared (per-instance). ModelCache eviction now runs against the global, deduplicated total and re-checks availability each iteration, since evicting a model another device still holds frees no RAM. build_model_manager wires one store + one budget into all device caches; the cap is max_cache_ram_gb as a true system-wide limit, else the sum of per-cache heuristics. Passing ram_budget=None preserves the prior local accounting. - LoRA/patch safety: direct LoRA patching did an in-place copy_ on the weight, which would corrupt the now-shared canonical tensor (and taint keep_ram_copy even with one GPU) when patching a CPU-resident weight. Switched to an out-of-place add (memory- equivalent) so the canonical tensor is never mutated; fixed the FluxControlLoRA expansion path to target the module's live parameter. Sidecar patching and FreeU/Seamless (which patch forward methods) were already safe. Validated on 2x AMD W7900 / ROCm: correct inference on both GPUs from one shared copy (full + partial load + Q8_0 GGUF quantized), concurrent load/unload without corruption, and LoRA isolation across devices. ~40 new tests; existing suites unchanged. Adds scripts/multigpu_ram_driver.py to drive concurrent dual-GPU generations via the queue API and measure peak RSS / leak drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(multi-GPU) With one session-processor worker per device, multiple queue items can be in_progress at once. cancel_by_batch_ids(), cancel_by_destination() and cancel_by_queue_id() excluded in_progress rows from their bulk UPDATE and then canceled only the single get_current() item (LIMIT 1), so on multi-GPU the other running items kept consuming a GPU and could still produce output after the user requested cancellation. Each running item must be canceled via _set_queue_item_status(), which emits the QueueItemStatusChangedEvent that the processor maps to the worker running that item_id and uses to set its cancel event. Add _cancel_in_progress_matching() to cancel every in-progress item matching the same filter (with user-id scoping preserved) and call it from all three bulk-cancel methods. The returned `canceled` count now includes canceled in-progress items. Adds regression tests that dequeue two items onto separate devices and assert every bulk cancel API moves all matching in_progress items to canceled and emits a cancel event for each (and that user-scoped cancel leaves another user's in-progress item running). Reported by JPPhoto in review of invoke-ai#9263. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…vice guards, refcount leak) Fixes from the code review of PR invoke-ai#9263: - Cancellation could be silently lost around dequeue: the per-iteration worker.cancel_event.clear() ran AFTER dequeue + gc.collect() + logging, so a cancel arriving in that window was set by the status handler and then wiped. Move the clear to before dequeue, and after claiming an item re-check (cancel_event + a fresh DB status read via _is_queue_item_terminal) and skip running if it is already terminal, closing both race windows. The runner's stale queue_item.status check could not catch this. - delete_by_destination only stopped one in-progress item (get_current) before deleting all matching rows, leaving other GPU workers running (and then failing to update a deleted row). Cancel every matching in-progress item via _cancel_in_progress_matching first. - generation_devices validation: a bare non-"auto" string (e.g. "cuda:0") was iterated character-by-character; an empty list silently fell back to one device. Reject both with a clear message. - get_generation_devices now fails fast on a CUDA device that does not exist (index past device_count, or CUDA unavailable) instead of starting a worker that errors cryptically at first allocation. - Shared-weights wrappers: if the canonical re-point (load_state_dict assign=True) threw after acquire(), the reference was leaked (the wrapper never entered the cache). Compute size metadata first, make acquire the last step, and release on failure. Adds tests for each: post-dequeue terminal guard, delete_by_destination cancellation, generation_devices validation, absent-device rejection, and acquire-released-on-repoint-failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lstein · 2026-06-26T01:22:21Z

@JPPhoto thanks for the careful review — both points were spot on.

Bulk cancellation in multi-GPU mode — fixed. cancel_by_batch_ids, cancel_by_destination, and cancel_by_queue_id now cancel every matching in_progress item (via a shared _cancel_in_progress_matching helper) instead of only the single get_current() item, so each running worker receives its QueueItemStatusChangedEvent and stops. While in there I found and fixed two siblings:

delete_by_destination had the same one-item limitation (it canceled one, then deleted all matching rows — leaving the other workers running against deleted rows).
A genuine cancel-loss race in the worker loop: cancel_event.clear() ran after dequeue() + gc.collect(), so a cancel arriving in that window got wiped. The clear now happens before dequeue, plus a fresh post-claim status re-check so a cancel that races the claim still skips the run.

All three are covered by new regression tests (two in_progress items dequeued onto separate devices, asserting all matching items are canceled and emit cancel events; user-scoping preserved). They fail on the old code and pass on the fix.

Per-user fairness — correct, this PR intentionally does not implement it; dequeue() is still priority DESC, item_id ASC. The round-robin scheduler is coming in #9086, which I've checked for compatibility with this branch: the two are orthogonal (round-robin picks which pending item; multi-GPU atomically claims it under a lock), and they actually compose well — an in_progress item counts as "recently served" via the existing started_at trigger, so round-robin naturally spreads the concurrent GPU slots across distinct users. The plan is to land #9086 first and rebase this on top (minor migration renumber + a small dequeue() merge), so fairness arrives through that PR rather than being duplicated here.

- Apply ruff 0.11.2 formatting to the files flagged by `ruff format --check`. - The new fail-fast guard in get_generation_devices() (reject a CUDA device that doesn't exist) made the pre-existing test_get_generation_devices_explicit_list_is_deduplicated fail on CPU-only CI runners, since it passes a cuda list with no CUDA present. Mock torch.cuda.is_available/device_count in that test (matching the existing pattern in this file) so it validates dedup on any runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JPPhoto · 2026-06-26T20:37:28Z

There are some more issues that need resolving:

invokeai/invokeai/backend/model_manager/load/model_cache/shared_cpu_weights.py:64

Stale shared CPU weights can survive a model setting change and be adopted by a rebuilt per-device cache. SharedCpuWeightsStore.acquire() returns the existing canonical state dict for the same cache key. drop_model() only marks locked entries stale at invokeai/invokeai/backend/model_manager/load/model_cache/model_cache.py:1068, so their shared-store reference remains live until unlock. If another device cache reloads the same model key before that unlock, the wrapper re-points the newly built module at the old canonical tensors at invokeai/invokeai/backend/model_manager/load/model_cache/cached_model/cached_model_with_partial_load.py:61 and cached_model_only_full_load.py:52. This defeats the invalidation added for load-affecting settings in invokeai/invokeai/app/api/routers/model_manager.py:445.

Trigger: one GPU is running a model while an admin toggles a load-affecting setting such as fp8_storage; another GPU then loads that model before the first GPU unlocks.

Consequence: the "rebuilt" cache entry can silently use the pre-change canonical weights and keep them alive under the same key.

To expose this issue, add a test that loads the same key into two caches sharing a SharedCpuWeightsStore, locks one entry, calls drop_model() on both caches, then puts a changed model under the same key into the unlocked cache and asserts it does not adopt the old canonical tensors.
invokeai/invokeai/app/api/routers/app_info.py:151

The runtime settings API validates only the device string pattern, not the same constraints enforced at startup. UpdateAppGenerationSettingsRequest.validate_generation_devices() accepts values like ["cuda:99"] if they match the regex, and update_runtime_config() persists them at lines 219 and 226. Startup later calls TorchDevice.get_generation_devices() from the model manager path, where unavailable or out-of-range CUDA devices raise at invokeai/invokeai/backend/util/devices.py:200. Empty lists are also accepted by the request model but rejected by InvokeAIAppConfig at invokeai/invokeai/app/services/config/config_default.py:272, producing an unhandled server error instead of a 422.

Trigger: an admin or API client PATCHes /api/v1/app/runtime_config with generation_devices: ["cuda:99"] or [].

Consequence: the server can persist a config that fails on restart, or return a 500 for input the request schema accepted.

To expose this issue, add a test that PATCHes out-of-range CUDA and empty-list values and asserts the API rejects them with 422 without mutating or writing config.
invokeai/invokeai/app/services/session_queue/session_queue_sqlite.py:235

The PR and docs still promise per-user fairness, but dequeue remains strict priority DESC, item_id ASC. The author comment says fairness is intentionally deferred to PR 9086, but this PR's docs still say a single user's large batch cannot monopolize every GPU at invokeai/docs/src/content/docs/configuration/invokeai-yaml.mdx:119 and generated settings repeat that at invokeai/docs/src/generated/settings.json:496.

Trigger: user A enqueues a large batch before user B at the same priority on a multi-GPU system.

Consequence: all workers keep claiming user A's older rows until they drain, so user B can still be starved despite the user-facing claim.

To expose this issue, add a queue test that enqueues many items for user A, then one for user B, calls dequeue() repeatedly for multiple worker slots, and asserts user B is selected before A monopolizes all slots. Otherwise hold off on the fairness claim from docs until PR 9086 is merged.
invokeai/docs/src/content/docs/configuration/invokeai-yaml.mdx:132

The docs advertise generation_devices: [] as a valid serial fallback, but the config validator now rejects empty lists at invokeai/invokeai/app/services/config/config_default.py:272 and tests assert rejection in tests/app/services/config/test_config_generation_devices.py:31. The same docs also say weights are duplicated in RAM per GPU at line 144, while this PR now adds shared CPU weights to avoid that.

Trigger: Set generation_devices: [].

Consequence: InvokeAI rejects the config instead of starting serially as documented.

Expected docs fix: update docs/src/content/docs/configuration/invokeai-yaml.mdx and regenerated settings copy to match the accepted values and the new shared-RAM behavior.

Three RAM fixes for multi-GPU (and one that helps single-GPU too), addressing transient spikes to ~100% RAM and swapping during text-encode/transformer loads: 1. Cap the global RAM-cache budget at a safe fraction of system RAM. When max_cache_ram_gb is unset, the budget was the *sum* of the per-device cache heuristics, so N GPUs each claiming ~50% of RAM summed to ~N*50% and starved the OS. Now clamp the sum to ModelCache.calc_system_ram_headroom_bytes() (50% of RAM - 2GB baseline, floored at 4GB). Promote the sizing magic numbers to named constants shared by the per-device heuristic and the global cap. 2. Adopt already-resident CPU weights across devices at load time. When a second device loads a model another device already holds, deep-copy a registered meta-weight structural clone and assign the shared canonical weights, instead of re-reading the model from disk and materializing a full transient second copy. Loader-agnostic (one mechanism in ModelLoader, no per-loader code): works for diffusers, single-file checkpoint, GGUF and transformers models, and preserves registered hooks (e.g. fp8 layerwise-cast). Best-effort with a meta-tensor self-check and fallback to a normal disk load on any failure. Skipped on single-device installs. 3. Dequantize FLUX.2 FP8 checkpoints straight to bf16. _dequantize_fp8_weights materialized the whole model in float32 (~36GB for 9B) before a later cast to bf16; now the multiply is done in float32 but stored bf16 per-weight, so the model is never held in float32. Numerically identical; halves the cold-load transient (helps single-GPU too). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lstein · 2026-06-26T21:06:54Z

Lots of changes in commit 2d3802a . Previously each GPU had its own RAM cache, which meant that the same model could be loaded and stay resident twice, doubling the amount of RAM needed. These changes:

De-duplicate models such that the same model is only resident once.
Handles the case of the model being modified by a LoRA in one GPU session and not in the other. The RAM copy holds the canonical unmodified model and the LoRA modified version only exists transiently in VRAM of the model that needs it. Same logic applies to reference images and controlnets.
I was able to test on a machine with the interesting configuration of two 48 GB VRAM GPUs and 96 GB RAM. This exposed a bunch of places where model loading was being handled inefficiently and causing RAM spikes. A variety of checks have been implemented to avoid double-loading, OOMs and thrashing.

The Qwen Image VAE encode/decode invocations called model_on_device() without a working-memory estimate, unlike every other VAE family (SD/SDXL/SD3/CogView4/FLUX). So the model cache reserved only its small default working memory, never offloaded a large resident transformer (the VAE weights themselves are tiny), and the VAE's forward-pass activations then OOM'd VRAM — e.g. a ~40GB Qwen Image Edit transformer left ~1GB free while decode needed ~5GB. Reproduces single-GPU; unrelated to the multi-GPU RAM work. Add estimate_vae_working_memory_qwen_image() (same per-output-pixel scaling as the other estimators, handling the 5D Qwen latents) and pass it from both the i2l (encode, used for reference images in Image Edit) and l2i (decode) nodes, so the cache offloads the transformer before the VAE runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The FLUX.2 VAE encoder's mid-block self-attention scales quadratically with the input's spatial size, and on ROCm scaled_dot_product_attention falls back to a materialized attention matrix. Encoding a reference image (kontext) at full size therefore allocated ~15GB in a single attention call at 1024px — and hundreds of GB at the 2024px reference cap — OOMing VRAM regardless of how much other model memory was freed. Tile the reference-image encode to bound per-tile attention. The VAE's default tile size equals its sample_size (1024), whose per-tile attention still OOMs, so force a 512px tile (with a matching latent tile size derived from the config). Save/restore the VAE's tiling config since it is a shared, cached instance, so the final image decode does not inherit these settings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ModelCache._get_vram_in_use() called torch.cuda.memory_allocated() with no device argument, while _get_vram_available() reads memory_allocated(execution_device). The formula relies on those two canceling. In multi-GPU mode each worker calls torch.cuda.set_device for its own GPU, so the process-current device flips between workers; the no-argument call can then read a different (e.g. idle) GPU's allocation, breaking the cancellation and inflating "available" VRAM toward the card total. The cache then believes there is room and never offloads, so VRAM offloading effectively ignores device_working_mem_gb in multi-GPU. Single-GPU was unaffected (current device always equals the execution device). Query self._execution_device in both _get_vram_in_use() and the cache-state debug log. Add a regression test asserting the per-cache execution device is used. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… decode peak The Qwen Image VAE is a 3D-conv (video) VAE whose decode allocates large conv3d feature maps. A ~1MP decode was measured to peak at ~17 GiB of VRAM — far above what the generic 2200/1100 SD/FLUX constants reserved (~4.6 GiB), so the cache concluded the decode "fit" alongside the resident 20GB transformer + 15GB text encoder, never offloaded them, and OOMed. The offload only frees ~(working_mem - free) bytes, so the reservation must both cover the real peak and be large enough to trigger the offload of models the decode doesn't need. Raise the Qwen decode/encode constants (13000/6500) to match the measured peak. It's linear in output pixels, so it over-reserves past ~1.5MP (where the decode can exceed the card even after offloading) — that case is covered by force_tiled_decode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Qwen Image latents-to-image node hardcoded vae.disable_tiling(), ignoring the global force_tiled_decode setting that the SD/SDXL l2i node honors. Wire it up the same way so users can opt into tiled VAE decode for very large outputs that exceed VRAM even after the transformer/text encoder are offloaded. Off by default, so normal-size decodes are unchanged (full-frame, no tile blending). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The preview-panel progress circle re-renders on every InvocationProgressEvent. The parent passes a fresh progressEvent object each event, so the CircularProgress re-rendered constantly; during the indeterminate phases (everything except denoising) that restarted its CSS spin animation each time, which looked like the disk flashing. (Determinate denoising was unaffected because the value genuinely changes per step.) Split the circle into a memoized, ref-forwarding subcomponent keyed on its visual props (isIndeterminate, value, device label) so message-only updates no longer re-render it and the spin animation stays continuous. The Tooltip still anchors to it via the forwarded ref. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lstein and others added 14 commits May 31, 2026 23:26

fix(backend): fix outpainting crash caused by model download collisions

7011446

chore(frontend): typegen+openapi

4209780

docs(multi-gpu): add configuration information

914c577

chore(frontend): typegen + openapi again

a928a75

lstein requested review from JPPhoto, Pfannkuchensack, blessedcoolant and dunkeroni as code owners June 3, 2026 02:13

github-actions Bot added api python PRs that change python files backend PRs that change backend files services PRs that change app services frontend PRs that change frontend files python-tests PRs that change python tests docs PRs that change docs labels Jun 3, 2026

lstein assigned JPPhoto Jun 3, 2026

lstein added the 6.14.x label Jun 3, 2026

lstein added this to Invoke - Community Roadmap Jun 3, 2026

lstein moved this to 6.14.x Theme: USER EXPERIENCE in Invoke - Community Roadmap Jun 3, 2026

Merge branch 'main' into lstein/feat/multi-gpu

5e4e864

lstein and others added 4 commits June 3, 2026 15:23

chore(frontend): openapi

cdcf7df

Merge branch 'main' into lstein/feat/multi-gpu

d521ba4

github-actions Bot added the invocations PRs that change invocations label Jun 12, 2026

lstein added 2 commits June 11, 2026 22:52

Merge branch 'main' into lstein/feat/multi-gpu

c48dfc8

lstein and others added 3 commits June 24, 2026 18:36

Merge branch 'main' into lstein/feat/multi-gpu

9c1e516

lstein marked this pull request as draft June 25, 2026 12:30

lstein and others added 3 commits June 25, 2026 20:27

lstein marked this pull request as ready for review June 26, 2026 01:53

lstein and others added 7 commits June 26, 2026 22:53

Merge branch 'main' into lstein/feat/multi-gpu

b275c38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: multi-GPU parallel session execution#9263

feat: multi-GPU parallel session execution#9263
lstein wants to merge 38 commits into
invoke-ai:mainfrom
lstein:lstein/feat/multi-gpu

lstein commented Jun 3, 2026 •

edited

Loading

Uh oh!

JPPhoto commented Jun 23, 2026

Uh oh!

lstein commented Jun 25, 2026

Uh oh!

JPPhoto commented Jun 25, 2026

Uh oh!

lstein commented Jun 26, 2026

Uh oh!

JPPhoto commented Jun 26, 2026

Uh oh!

lstein commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lstein commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

JPPhoto commented Jun 23, 2026

Uh oh!

lstein commented Jun 25, 2026

Uh oh!

JPPhoto commented Jun 25, 2026

Uh oh!

lstein commented Jun 26, 2026

Uh oh!

JPPhoto commented Jun 26, 2026

Uh oh!

lstein commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lstein commented Jun 3, 2026 •

edited

Loading