Skip to content

feat: multi-GPU parallel session execution#9263

Open
lstein wants to merge 38 commits into
invoke-ai:mainfrom
lstein:lstein/feat/multi-gpu
Open

feat: multi-GPU parallel session execution#9263
lstein wants to merge 38 commits into
invoke-ai:mainfrom
lstein:lstein/feat/multi-gpu

Conversation

@lstein

@lstein lstein commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds multi-GPU parallel generation: on a machine with more than one GPU, InvokeAI runs several generation sessions concurrently — one per GPU — instead of draining the queue one job at a time. Jobs are distributed fairly across users so a single user's large batch can't monopolize every GPU while others wait.

It's controlled by a new generation_devices config setting (defaults to auto = use every available CUDA GPU). Setting it to a single device, or leaving CUDA out of the picture, preserves the previous serial behavior exactly. The choice of GPUs can also be controlled via a new section of the Settings dialogue (restart required to take effect).

Demo (turn on the sound!)

invoke-mgpu.mp4

How it works — the change is built around five small backend seams plus a frontend update, rather than per-node edits:

  • Device context (invokeai/backend/util/devices.py): a thread-local set/get/clear_session_device on TorchDevice; choose_torch_device() consults it first. This is the lynchpin — the ~79 existing call sites resolve to the worker's GPU with no per-node changes.
  • Per-device model caches (model_manager_default, model_load_default): one ModelCache per device, resolved by the current thread's device, with fan-out for clear/drop/shutdown. Model construction is serialized against VRAM moves to prevent meta-device corruption.
  • Atomic dequeue (session_queue_sqlite.dequeue): a lock makes select+claim atomic so concurrent workers never grab the same queue item.
  • Worker pool (session_processor_default): one _SessionWorker per device, each pinning torch.cuda.set_device + the session device, with its own runner and cancel event; cancellation is routed per item. Profiling is disabled when more than one worker is active.
  • Concurrency hardening: added a Lock to ObjectSerializerForwardCache and made DiskImageFileStorage thread-safe for parallel sessions.

Frontend: during parallel generation the progress display stacks one progress bar per active session (each disappears as its session finishes), and the image viewer tiles per-session progress previews when ≥2 sessions are active.

Related Issues / Discussions

QA Instructions

On a multi-GPU machine:

  • With default config (generation_devices: auto), enqueue a batch larger than the GPU count and confirm multiple sessions run simultaneously (one per GPU), with stacked progress bars and tiled previews in the viewer.
  • Set generation_devices: [cuda:0] and confirm generation runs serially, exactly as before this PR.
  • Set generation_devices: [cuda:0, cuda:2] and confirm only those devices are used.
  • Cancel an in-flight item and confirm only that session stops.
  • On a single-GPU / CPU / MPS machine, confirm auto resolves to the one best device and behavior is unchanged.

New automated tests cover device routing (test_model_load_device_routing.py), dequeue concurrency (test_session_queue_dequeue_concurrency.py), and device resolution (test_devices.py).

Merge Plan

Standard merge. No DB schema or redux migrations. Touches the session processor and model cache, so worth a careful look from those areas' owners.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration — N/A, no slice changes
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

lstein and others added 14 commits May 31, 2026 23:26
Run one generation session per configured GPU concurrently, with a tiled
progress preview. Multi-user isolation is unchanged. Backed by five seams:

- Per-thread device context (TorchDevice.set/get/clear_session_device);
  choose_torch_device() consults it first, so all device-selecting call sites
  resolve to the calling worker's GPU with no per-node changes.
- Per-device model caches: build_model_manager builds one ModelCache per
  generation device; ModelLoadService.ram_cache resolves by current thread
  device; ram_caches fans out clear/drop/shutdown.
- Atomic concurrent dequeue: a dequeue lock makes select+claim atomic so
  concurrent workers never claim the same item (works on FIFO; round-robin
  from invoke-ai#9086 slots in later).
- Worker pool: one _SessionWorker per device, each pinning torch.cuda.set_device
  and its session device, with its own runner and cancel event; cancellation
  routes via an {item_id -> worker} lookup. Single-device installs keep the
  exact legacy single-worker behavior. Profiling disabled when >1 worker.
- New config `generation_devices`; unset = legacy single-worker mode.

Frontend: the canvas staging area already tiles per queue item; the main
ImageViewer now tracks progress per session and renders a tile grid
(ProgressImageTiles) when more than one session is active.

Also adds a lock to ObjectSerializerForwardCache for concurrent access.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_model_load_device_routing mutated the process-wide get_config()
singleton (device = "cuda:0") to exercise the per-thread cache routing,
but never restored it. The leaked CUDA device was then picked up by a
later test (test_model_load::test_loading) via choose_torch_device(),
which crashed with "Torch not compiled with CUDA enabled" on the
CUDA-less CI runner. Add an autouse fixture to save/restore device and
clear any pinned session device.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n_devices

Regenerate openapi.json (make frontend-openapi) and the frontend
schema.ts types (make frontend-typegen) so they include the new
generation_devices config field, fixing the openapi-checks and
typegen-checks CI jobs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`make frontend-openapi` used a bare `python` from a different environment
that emitted the CacheStats @DataClass docstring as a schema description.
CI generates the schema via `uv run`, which does not, so openapi-checks
failed on the diff. Regenerate with the uv-locked environment to drop the
stray description while keeping the generation_devices field.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o prevent meta-device corruption

Parallel multi-GPU session workers could intermittently crash with "unrecognized
device meta" (denoise) or "Cannot copy out of meta tensor; no data!" (l2i), because
model loading relies on process-global, non-thread-safe monkey-patches.

accelerate.init_empty_weights() (used directly by the loaders and implicitly by
diffusers' default low_cpu_mem_usage=True in from_pretrained) swaps
torch.nn.Module.register_parameter globally for the duration of a load, routing every
newly-registered parameter to the meta device. The model cache's VRAM load/unload runs
nn.Module.load_state_dict(assign=True), whose assign path does setattr -> __setattr__ ->
register_parameter. When one worker's VRAM move overlapped another worker's from_pretrained,
the move's real weights got hijacked onto meta and blew up on the next .to(device).

Introduce MODEL_LOAD_LOCK, a write-preferring readers-writer lock:
- write lock = model construction (_load_and_cache, load_model_from_path), exclusive.
- read lock  = VRAM load/unload (ModelCache.lock(), repair_required_tensors_on_device).

VRAM transfers across GPUs still overlap each other; they only block while a construction
holds the write lock. The lock is always acquired before any per-cache lock to keep a
consistent order and avoid an AB-BA deadlock with the writer's make_room/put.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions

Image.open() is lazy: it reads the header but defers pixel decoding (and
holds the file handle open) until the first .load()/.copy()/.convert(). The
opened object was cached and the same object handed to every caller, so in
multi-GPU parallel mode two session-processor worker threads could call
.copy() on it concurrently and race on the shared file handle and decoder
state. This surfaced as "broken data stream when reading image file" and
"AssertionError: self.png is not None" during inpainting with batch >1.

Force the decode (image.load()) before the object enters the cache so the
cached object is safe for concurrent reads, and guard the cache structures
(__cache / __cache_ids) with a lock since they are now mutated from multiple
threads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generation progress bars (under the Invoke button and the Viewer tab)
both read a single global $lastProgressEvent atom, which every session
overwrites. With parallel multi-GPU sessions this made the bar jump back
and forth between sessions.

Track progress per queue item id and render one bar per in-flight session,
stacked vertically, each removed as its session reaches a terminal state.

- stores.ts: add $progressEvents (map keyed by item_id),
  $activeProgressEvents (sorted), and set/clear helpers.
- setEventListeners.tsx: populate per-item progress on invocation_progress;
  clear per item on terminal status; clear all on connect/disconnect/queue
  cleared.
- ProgressBar.tsx: render a vertical stack of bars (one per active session)
  with a single-bar fallback for the idle / model-loading window; add
  containerProps so dockview tabs can position the stack.
- Dockview tab call sites: move positioning into containerProps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
$progressEvents is only referenced within stores.ts (via the
$activeProgressEvents computed and the set/clear helpers), so exporting
it tripped knip's unused-exports check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With 4 GPUs the stacked per-session progress bars grew past the bottom
strip of the dockview tab and overlapped the "Viewer" label.

Add a fitHeightPx prop: in fit mode the stack is capped to the available
strip (10px below the ~40px tab's centered label) and the bars flex to
share it, shrinking below their natural height only once they no longer
fit. With 1-2 sessions the bars keep their familiar thin height; with 3+
they scale down to stay within the strip. The sidebar bar is unaffected
and continues to stack at natural height (it has the vertical room).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fault

generation_devices now accepts "auto" (the new default), which expands to
every visible CUDA device — so multi-GPU parallel generation works out of
the box without manually listing devices. On GPU-less systems "auto"
resolves to the single cpu/mps device, preserving serial behavior.

- config_default.py: type is now Union[Literal["auto"], list[str]],
  default "auto"; validator accepts "auto" or a list of device strings.
- devices.py: add TorchDevice.get_generation_devices(), the single resolver
  that expands "auto", normalizes, and deduplicates.
- session_processor / model_manager: both consumers use the resolver
  instead of iterating the raw config value (which would have iterated the
  characters of the "auto" string).
- Regenerated docs/src/generated/settings.json.
- Tests for the resolver (auto-with/without-CUDA, dedup, empty).

An explicit single-device list (e.g. [cuda:0]) or an empty list opts out
of parallelism.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added api python PRs that change python files backend PRs that change backend files services PRs that change app services frontend PRs that change frontend files python-tests PRs that change python tests docs PRs that change docs labels Jun 3, 2026
@lstein lstein added the 6.14.x label Jun 3, 2026
@lstein lstein moved this to 6.14.x Theme: USER EXPERIENCE in Invoke - Community Roadmap Jun 3, 2026
lstein and others added 4 commits June 3, 2026 15:23
Render device badges as "cuda:0 (RTX 3090 #1)" so identical cards can be
told apart. Strips the "NVIDIA GeForce" vendor prefix and adds a 1-based
"#N" suffix only when multiple cards share a name. The full device name
remains available as the badge tooltip.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Help users track which CUDA device is processing each session:

- Model-load log: "Loaded model ... onto cuda device #N in ..s"
- Denoise progress bars: "Denoising (#N)" across all architectures
  (SD1.5/SDXL, FLUX, FLUX2, Z-Image, Anima, SD3, CogView4)
- Progress preview circle: GPU number centered in the ring, via a new
  `device` field on InvocationProgressEvent (resolved from the worker's
  thread-local session device)
- Session Queue: new "GPU #" column between STATUS and TIME, backed by a
  `device` column on session_queue (migration_32) recorded when a worker
  claims an item

Adds TorchDevice.get_session_device_label()/get_session_device_index()
helpers and a frontend getCudaDeviceIndex() parser (with tests). Shows the
number on CUDA only; CPU/MPS show nothing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the invocations PRs that change invocations label Jun 12, 2026
lstein added 2 commits June 11, 2026 22:52
# Conflicts:
#	invokeai/frontend/web/src/features/gallery/components/ImageViewer/CurrentImagePreview.tsx
#	invokeai/frontend/web/src/features/gallery/components/ImageViewer/context.tsx
#	invokeai/frontend/web/src/services/events/setEventListeners.tsx
@JPPhoto

JPPhoto commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

I can take a look at this but I don't have the ability to test it. Recommend assigning to a second person as well.

lstein and others added 3 commits June 24, 2026 18:36
Resolve migration_32 conflict: main's migration_32 (model_relationships FK
repair) is kept, and the multi-gpu device-column migration is moved to
migration_33 and registered after it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rcles

- Startup log lists each generation device with its GPU number and id,
  e.g. "Using torch device: [AMD Radeon PRO W7900 #1 (cuda:0), ...]".
  Single-device setups keep the bare device name.
- Canvas progress circles now show the CUDA device index in the center,
  matching the viewer panel.
- Progress-circle tooltips show the device name and number on hover.
- Both are hidden when only a single GPU is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lstein lstein marked this pull request as draft June 25, 2026 12:30
@lstein

lstein commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

I have converted back to a draft temporarily while I work out excessive RAM consumption issues.

@JPPhoto

JPPhoto commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Some comments with the current code (not posted as a review):

  • invokeai/invokeai/app/services/session_queue/session_queue_sqlite.py:478 and :651

    Bulk cancellation only cancels one active item in multi-GPU mode. The new processor starts one worker per device and each worker calls dequeue() independently, so there can be multiple in_progress rows at the same time. But cancel_by_batch_ids(), cancel_by_destination(), and cancel_by_queue_id() all exclude status = 'in_progress' from the bulk update, then call get_current() once. get_current() uses LIMIT 1 at line 287, so only one active item is moved to canceled; other active items in the same queue/batch/destination continue running.

    Trigger: two GPUs are running two items from the same batch or destination, then the user cancels that batch/destination or queue.

    Consequence: cancellation appears to succeed but at least one generation keeps consuming GPU and can still produce output after the user requested cancellation.

    To expose this issue, add a test that dequeues two items into in_progress with the same batch/destination, calls each bulk cancel API, and asserts all matching in_progress items become canceled and emit cancel events.

  • invokeai/invokeai/app/services/session_queue/session_queue_sqlite.py:234

    The documented per-user fairness is not implemented. The PR docs say a single user's large batch cannot monopolize every GPU, but dequeue() still orders only by priority DESC, item_id ASC. With two workers, if user A enqueues a large batch and user B enqueues after it, every worker keeps claiming A's lower item_id rows until A's older batch is drained. There is no user grouping, round-robin CTE, or last-served-user state in this PR.

    Trigger: multiuser mode, multiple GPUs, user A enqueues many items before user B enqueues one item at the same priority.

    Consequence: user B can wait behind user A's entire older batch, contradicting the user-facing behavior described in docs/src/content/docs/configuration/invokeai-yaml.mdx:119.

    To expose this issue, add a queue test that enqueues many pending items for user A, then one for user B, simulates repeated concurrent dequeue() calls, and asserts the second user is selected before A monopolizes every worker cycle.

lstein and others added 3 commits June 25, 2026 20:27
… caches

In multi-GPU mode the model manager builds one ModelCache per generation device,
each with storage_device="cpu" and its own RAM-resident copy of every model. A model
loaded on N GPUs therefore occupied N copies in RAM, and each cache sized itself
against max_cache_ram_gb independently, so RAM use during the text/reference-image
encoding phases skyrocketed and the system swapped — worst when two images rendered
at once.

This deduplicates the CPU-resident weights and makes RAM accounting global.

- SharedCpuWeightsStore: process-/manager-global, refcounted store of one canonical
  CPU state_dict per model key. The first device to load a key registers its weights;
  subsequent devices adopt the canonical tensors and re-point their module's params at
  them (load_state_dict(assign=True)), freeing the duplicate. Weights live once in RAM
  regardless of GPU count; freed only when the last device releases. Per-device modules
  are kept (params are device-shuffled in place, so two GPUs need two modules), but
  their CPU-resident params alias the shared tensors.

- RamBudget: single system-wide RAM authority. Splits RAM into shared (counted once via
  the store) and non-shared (per-instance). ModelCache eviction now runs against the
  global, deduplicated total and re-checks availability each iteration, since evicting a
  model another device still holds frees no RAM. build_model_manager wires one store +
  one budget into all device caches; the cap is max_cache_ram_gb as a true system-wide
  limit, else the sum of per-cache heuristics. Passing ram_budget=None preserves the
  prior local accounting.

- LoRA/patch safety: direct LoRA patching did an in-place copy_ on the weight, which
  would corrupt the now-shared canonical tensor (and taint keep_ram_copy even with one
  GPU) when patching a CPU-resident weight. Switched to an out-of-place add (memory-
  equivalent) so the canonical tensor is never mutated; fixed the FluxControlLoRA
  expansion path to target the module's live parameter. Sidecar patching and
  FreeU/Seamless (which patch forward methods) were already safe.

Validated on 2x AMD W7900 / ROCm: correct inference on both GPUs from one shared copy
(full + partial load + Q8_0 GGUF quantized), concurrent load/unload without corruption,
and LoRA isolation across devices. ~40 new tests; existing suites unchanged.

Adds scripts/multigpu_ram_driver.py to drive concurrent dual-GPU generations via the
queue API and measure peak RSS / leak drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(multi-GPU)

With one session-processor worker per device, multiple queue items can be in_progress
at once. cancel_by_batch_ids(), cancel_by_destination() and cancel_by_queue_id() excluded
in_progress rows from their bulk UPDATE and then canceled only the single get_current()
item (LIMIT 1), so on multi-GPU the other running items kept consuming a GPU and could
still produce output after the user requested cancellation.

Each running item must be canceled via _set_queue_item_status(), which emits the
QueueItemStatusChangedEvent that the processor maps to the worker running that item_id and
uses to set its cancel event. Add _cancel_in_progress_matching() to cancel every in-progress
item matching the same filter (with user-id scoping preserved) and call it from all three
bulk-cancel methods. The returned `canceled` count now includes canceled in-progress items.

Adds regression tests that dequeue two items onto separate devices and assert every bulk
cancel API moves all matching in_progress items to canceled and emits a cancel event for
each (and that user-scoped cancel leaves another user's in-progress item running).

Reported by JPPhoto in review of invoke-ai#9263.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vice guards, refcount leak)

Fixes from the code review of PR invoke-ai#9263:

- Cancellation could be silently lost around dequeue: the per-iteration
  worker.cancel_event.clear() ran AFTER dequeue + gc.collect() + logging, so a cancel
  arriving in that window was set by the status handler and then wiped. Move the clear to
  before dequeue, and after claiming an item re-check (cancel_event + a fresh DB status read
  via _is_queue_item_terminal) and skip running if it is already terminal, closing both race
  windows. The runner's stale queue_item.status check could not catch this.

- delete_by_destination only stopped one in-progress item (get_current) before deleting all
  matching rows, leaving other GPU workers running (and then failing to update a deleted row).
  Cancel every matching in-progress item via _cancel_in_progress_matching first.

- generation_devices validation: a bare non-"auto" string (e.g. "cuda:0") was iterated
  character-by-character; an empty list silently fell back to one device. Reject both with a
  clear message.

- get_generation_devices now fails fast on a CUDA device that does not exist (index past
  device_count, or CUDA unavailable) instead of starting a worker that errors cryptically at
  first allocation.

- Shared-weights wrappers: if the canonical re-point (load_state_dict assign=True) threw after
  acquire(), the reference was leaked (the wrapper never entered the cache). Compute size
  metadata first, make acquire the last step, and release on failure.

Adds tests for each: post-dequeue terminal guard, delete_by_destination cancellation,
generation_devices validation, absent-device rejection, and acquire-released-on-repoint-failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lstein

lstein commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

@JPPhoto thanks for the careful review — both points were spot on.

Bulk cancellation in multi-GPU mode — fixed. cancel_by_batch_ids, cancel_by_destination, and cancel_by_queue_id now cancel every matching in_progress item (via a shared _cancel_in_progress_matching helper) instead of only the single get_current() item, so each running worker receives its QueueItemStatusChangedEvent and stops. While in there I found and fixed two siblings:

  • delete_by_destination had the same one-item limitation (it canceled one, then deleted all matching rows — leaving the other workers running against deleted rows).
  • A genuine cancel-loss race in the worker loop: cancel_event.clear() ran after dequeue() + gc.collect(), so a cancel arriving in that window got wiped. The clear now happens before dequeue, plus a fresh post-claim status re-check so a cancel that races the claim still skips the run.

All three are covered by new regression tests (two in_progress items dequeued onto separate devices, asserting all matching items are canceled and emit cancel events; user-scoping preserved). They fail on the old code and pass on the fix.

Per-user fairness — correct, this PR intentionally does not implement it; dequeue() is still priority DESC, item_id ASC. The round-robin scheduler is coming in #9086, which I've checked for compatibility with this branch: the two are orthogonal (round-robin picks which pending item; multi-GPU atomically claims it under a lock), and they actually compose well — an in_progress item counts as "recently served" via the existing started_at trigger, so round-robin naturally spreads the concurrent GPU slots across distinct users. The plan is to land #9086 first and rebase this on top (minor migration renumber + a small dequeue() merge), so fairness arrives through that PR rather than being duplicated here.

- Apply ruff 0.11.2 formatting to the files flagged by `ruff format --check`.
- The new fail-fast guard in get_generation_devices() (reject a CUDA device that
  doesn't exist) made the pre-existing test_get_generation_devices_explicit_list_is_deduplicated
  fail on CPU-only CI runners, since it passes a cuda list with no CUDA present. Mock
  torch.cuda.is_available/device_count in that test (matching the existing pattern in this
  file) so it validates dedup on any runner.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lstein lstein marked this pull request as ready for review June 26, 2026 01:53
@JPPhoto

JPPhoto commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

There are some more issues that need resolving:

  • invokeai/invokeai/backend/model_manager/load/model_cache/shared_cpu_weights.py:64

    Stale shared CPU weights can survive a model setting change and be adopted by a rebuilt per-device cache. SharedCpuWeightsStore.acquire() returns the existing canonical state dict for the same cache key. drop_model() only marks locked entries stale at invokeai/invokeai/backend/model_manager/load/model_cache/model_cache.py:1068, so their shared-store reference remains live until unlock. If another device cache reloads the same model key before that unlock, the wrapper re-points the newly built module at the old canonical tensors at invokeai/invokeai/backend/model_manager/load/model_cache/cached_model/cached_model_with_partial_load.py:61 and cached_model_only_full_load.py:52. This defeats the invalidation added for load-affecting settings in invokeai/invokeai/app/api/routers/model_manager.py:445.

    Trigger: one GPU is running a model while an admin toggles a load-affecting setting such as fp8_storage; another GPU then loads that model before the first GPU unlocks.

    Consequence: the "rebuilt" cache entry can silently use the pre-change canonical weights and keep them alive under the same key.

    To expose this issue, add a test that loads the same key into two caches sharing a SharedCpuWeightsStore, locks one entry, calls drop_model() on both caches, then puts a changed model under the same key into the unlocked cache and asserts it does not adopt the old canonical tensors.

  • invokeai/invokeai/app/api/routers/app_info.py:151

    The runtime settings API validates only the device string pattern, not the same constraints enforced at startup. UpdateAppGenerationSettingsRequest.validate_generation_devices() accepts values like ["cuda:99"] if they match the regex, and update_runtime_config() persists them at lines 219 and 226. Startup later calls TorchDevice.get_generation_devices() from the model manager path, where unavailable or out-of-range CUDA devices raise at invokeai/invokeai/backend/util/devices.py:200. Empty lists are also accepted by the request model but rejected by InvokeAIAppConfig at invokeai/invokeai/app/services/config/config_default.py:272, producing an unhandled server error instead of a 422.

    Trigger: an admin or API client PATCHes /api/v1/app/runtime_config with generation_devices: ["cuda:99"] or [].

    Consequence: the server can persist a config that fails on restart, or return a 500 for input the request schema accepted.

    To expose this issue, add a test that PATCHes out-of-range CUDA and empty-list values and asserts the API rejects them with 422 without mutating or writing config.

  • invokeai/invokeai/app/services/session_queue/session_queue_sqlite.py:235

    The PR and docs still promise per-user fairness, but dequeue remains strict priority DESC, item_id ASC. The author comment says fairness is intentionally deferred to PR 9086, but this PR's docs still say a single user's large batch cannot monopolize every GPU at invokeai/docs/src/content/docs/configuration/invokeai-yaml.mdx:119 and generated settings repeat that at invokeai/docs/src/generated/settings.json:496.

    Trigger: user A enqueues a large batch before user B at the same priority on a multi-GPU system.

    Consequence: all workers keep claiming user A's older rows until they drain, so user B can still be starved despite the user-facing claim.

    To expose this issue, add a queue test that enqueues many items for user A, then one for user B, calls dequeue() repeatedly for multiple worker slots, and asserts user B is selected before A monopolizes all slots. Otherwise hold off on the fairness claim from docs until PR 9086 is merged.

  • invokeai/docs/src/content/docs/configuration/invokeai-yaml.mdx:132

    The docs advertise generation_devices: [] as a valid serial fallback, but the config validator now rejects empty lists at invokeai/invokeai/app/services/config/config_default.py:272 and tests assert rejection in tests/app/services/config/test_config_generation_devices.py:31. The same docs also say weights are duplicated in RAM per GPU at line 144, while this PR now adds shared CPU weights to avoid that.

    Trigger: Set generation_devices: [].

    Consequence: InvokeAI rejects the config instead of starting serially as documented.

    Expected docs fix: update docs/src/content/docs/configuration/invokeai-yaml.mdx and regenerated settings copy to match the accepted values and the new shared-RAM behavior.

Three RAM fixes for multi-GPU (and one that helps single-GPU too), addressing
transient spikes to ~100% RAM and swapping during text-encode/transformer loads:

1. Cap the global RAM-cache budget at a safe fraction of system RAM. When
   max_cache_ram_gb is unset, the budget was the *sum* of the per-device cache
   heuristics, so N GPUs each claiming ~50% of RAM summed to ~N*50% and starved
   the OS. Now clamp the sum to ModelCache.calc_system_ram_headroom_bytes()
   (50% of RAM - 2GB baseline, floored at 4GB). Promote the sizing magic numbers
   to named constants shared by the per-device heuristic and the global cap.

2. Adopt already-resident CPU weights across devices at load time. When a second
   device loads a model another device already holds, deep-copy a registered
   meta-weight structural clone and assign the shared canonical weights, instead
   of re-reading the model from disk and materializing a full transient second
   copy. Loader-agnostic (one mechanism in ModelLoader, no per-loader code):
   works for diffusers, single-file checkpoint, GGUF and transformers models,
   and preserves registered hooks (e.g. fp8 layerwise-cast). Best-effort with a
   meta-tensor self-check and fallback to a normal disk load on any failure.
   Skipped on single-device installs.

3. Dequantize FLUX.2 FP8 checkpoints straight to bf16. _dequantize_fp8_weights
   materialized the whole model in float32 (~36GB for 9B) before a later cast to
   bf16; now the multiply is done in float32 but stored bf16 per-weight, so the
   model is never held in float32. Numerically identical; halves the cold-load
   transient (helps single-GPU too).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lstein

lstein commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Lots of changes in commit 2d3802a . Previously each GPU had its own RAM cache, which meant that the same model could be loaded and stay resident twice, doubling the amount of RAM needed. These changes:

  1. De-duplicate models such that the same model is only resident once.
  2. Handles the case of the model being modified by a LoRA in one GPU session and not in the other. The RAM copy holds the canonical unmodified model and the LoRA modified version only exists transiently in VRAM of the model that needs it. Same logic applies to reference images and controlnets.
  3. I was able to test on a machine with the interesting configuration of two 48 GB VRAM GPUs and 96 GB RAM. This exposed a bunch of places where model loading was being handled inefficiently and causing RAM spikes. A variety of checks have been implemented to avoid double-loading, OOMs and thrashing.

lstein and others added 7 commits June 26, 2026 22:53
The Qwen Image VAE encode/decode invocations called model_on_device() without a
working-memory estimate, unlike every other VAE family (SD/SDXL/SD3/CogView4/FLUX).
So the model cache reserved only its small default working memory, never offloaded
a large resident transformer (the VAE weights themselves are tiny), and the VAE's
forward-pass activations then OOM'd VRAM — e.g. a ~40GB Qwen Image Edit transformer
left ~1GB free while decode needed ~5GB. Reproduces single-GPU; unrelated to the
multi-GPU RAM work.

Add estimate_vae_working_memory_qwen_image() (same per-output-pixel scaling as the
other estimators, handling the 5D Qwen latents) and pass it from both the i2l
(encode, used for reference images in Image Edit) and l2i (decode) nodes, so the
cache offloads the transformer before the VAE runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The FLUX.2 VAE encoder's mid-block self-attention scales quadratically with the
input's spatial size, and on ROCm scaled_dot_product_attention falls back to a
materialized attention matrix. Encoding a reference image (kontext) at full size
therefore allocated ~15GB in a single attention call at 1024px — and hundreds of
GB at the 2024px reference cap — OOMing VRAM regardless of how much other model
memory was freed.

Tile the reference-image encode to bound per-tile attention. The VAE's default
tile size equals its sample_size (1024), whose per-tile attention still OOMs, so
force a 512px tile (with a matching latent tile size derived from the config).
Save/restore the VAE's tiling config since it is a shared, cached instance, so the
final image decode does not inherit these settings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ModelCache._get_vram_in_use() called torch.cuda.memory_allocated() with no device
argument, while _get_vram_available() reads memory_allocated(execution_device).
The formula relies on those two canceling. In multi-GPU mode each worker calls
torch.cuda.set_device for its own GPU, so the process-current device flips between
workers; the no-argument call can then read a different (e.g. idle) GPU's
allocation, breaking the cancellation and inflating "available" VRAM toward the
card total. The cache then believes there is room and never offloads, so VRAM
offloading effectively ignores device_working_mem_gb in multi-GPU. Single-GPU was
unaffected (current device always equals the execution device).

Query self._execution_device in both _get_vram_in_use() and the cache-state debug
log. Add a regression test asserting the per-cache execution device is used.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… decode peak

The Qwen Image VAE is a 3D-conv (video) VAE whose decode allocates large conv3d
feature maps. A ~1MP decode was measured to peak at ~17 GiB of VRAM — far above
what the generic 2200/1100 SD/FLUX constants reserved (~4.6 GiB), so the cache
concluded the decode "fit" alongside the resident 20GB transformer + 15GB text
encoder, never offloaded them, and OOMed. The offload only frees ~(working_mem -
free) bytes, so the reservation must both cover the real peak and be large enough
to trigger the offload of models the decode doesn't need.

Raise the Qwen decode/encode constants (13000/6500) to match the measured peak.
It's linear in output pixels, so it over-reserves past ~1.5MP (where the decode
can exceed the card even after offloading) — that case is covered by
force_tiled_decode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Qwen Image latents-to-image node hardcoded vae.disable_tiling(), ignoring the
global force_tiled_decode setting that the SD/SDXL l2i node honors. Wire it up the
same way so users can opt into tiled VAE decode for very large outputs that exceed
VRAM even after the transformer/text encoder are offloaded. Off by default, so
normal-size decodes are unchanged (full-frame, no tile blending).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The preview-panel progress circle re-renders on every InvocationProgressEvent. The
parent passes a fresh progressEvent object each event, so the CircularProgress
re-rendered constantly; during the indeterminate phases (everything except
denoising) that restarted its CSS spin animation each time, which looked like the
disk flashing. (Determinate denoising was unaffected because the value genuinely
changes per step.)

Split the circle into a memoized, ref-forwarding subcomponent keyed on its visual
props (isIndeterminate, value, device label) so message-only updates no longer
re-render it and the spin animation stays continuous. The Tooltip still anchors to
it via the forwarded ref.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.14.x api backend PRs that change backend files docs PRs that change docs frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests services PRs that change app services

Projects

Status: 6.14.x Theme: USER EXPERIENCE

Development

Successfully merging this pull request may close these issues.

2 participants