feat(multi-gpu): offload text encoders to idle GPUs#137
Open
lstein wants to merge 1 commit into
Open
Conversation
Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one generation device is configured and a GPU is idle, a session's text/prompt encoder runs on the idle GPU instead of the one running its denoise pipeline. This avoids evicting the denoise model from VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Under full load (no idle GPU) behavior is unchanged. Mechanism: - New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own device's lock for the whole run; an encoder node try-borrows an idle device's lock for the duration of the node. This makes a borrowed encoder and a native session mutually exclusive on a GPU -- preventing the shared-encoder corruption that produced garbled images -- and is deadlock-free (borrows are non-blocking; a session only ever blocks on its own device). - DefaultSessionRunner re-pins the worker thread to the borrowed device for the whole encoder node; conditioning is stored on the CPU and the denoiser picks it up on its own GPU afterward. - Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2 klein, z-image, flux_redux). Inspired by invoke-ai#9310; supersedes it. Tests: device-pool lock semantics, two concurrency regression tests asserting a session and a borrow never use a GPU at the same time, the runner offload context-manager behavior, and a marker-wiring check. Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support the feature in a node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4dccb13 to
0037a21
Compare
This was referenced Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR depends on invoke-ai#9263 and targets its branch (
lstein/feat/multi-gpu), notmain. It should be reviewed and merged only after invoke-ai#9263 has been reviewed, accepted, and merged. Once invoke-ai#9263 lands, this PR's base can be retargeted tomain.It was inspired by invoke-ai#9310 (split-GPU text encoder) and supersedes that PR — it delivers the same idea, but reworked to compose with the multi-GPU parallel-generation architecture from invoke-ai#9263 and to reuse that branch's existing per-device caches, device-aware VRAM accounting, and shared CPU-weights store rather than re-adding them.
Remember to give full credit to @Jacid23 for the concept and initial implementation.
Summary
On a multi-GPU machine, invoke-ai#9263 runs one generation session per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. This PR uses that idle capacity: a session's text/prompt encoder runs on a currently-idle GPU instead of the GPU running its denoise pipeline.
Controlled by a new
offload_text_encoders_to_idle_gpussetting (default on). With a single device, or under full multi-GPU load (no idle GPU), encoders run on the session's own GPU exactly as before.How it works
GENERATION_DEVICE_POOLarbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own GPU's lock for the whole run; an encoder node try-borrows an idle GPU's lock for the duration of that node. A borrowed encoder and a native session are therefore mutually exclusive on a GPU, and the design is deadlock-free (borrows are non-blocking try-acquires; a session only ever blocks on its own device).DefaultSessionRunnertemporarily re-pins the worker thread to the borrowed GPU for the whole encoder node. The encoder loads into and runs on that GPU; its conditioning is stored on the CPU (as encoder nodes already do) and the denoiser picks it up on its own GPU afterward — so the cross-GPU handoff needs no node changes.@invocation(idle_gpu_offloadable=True), mirroring the existingbottleneckClassVarmarker (no API-schema impact). Applied to the text/prompt encoder nodes:compel(+ SDXL/refiner),flux_text_encoder,sd3_text_encoder,qwen_image_text_encoder,anima_text_encoder,cogview4_text_encoder,flux2_klein_text_encoder,z_image_text_encoder, andflux_redux.Why the per-device lock
An earlier iteration routed the encoder into the idle GPU's cache without exclusivity. Because two sessions using the same model/prompt resolve to the same encoder cache key, they ended up sharing one model object and running concurrent forward passes + in-place LoRA patching on it — producing garbled images. The per-device lock makes a borrow and a native session mutually exclusive on a GPU, which fixes this;
prevent_auto_evictfrom invoke-ai#9310 is intentionally not ported, so a borrowed encoder yields its GPU's VRAM (via normal LRU) the moment that GPU is claimed for a real session.Tests
tests/backend/util/test_device_pool.py— arbiter lock semantics (borrow exclusion, session/borrow mutual exclusion, startup-race ordering, deterministic selection) plus a multi-threaded regression test asserting a GPU is never used by a session and a borrow at the same time.tests/app/services/session_processor/test_encoder_offload.py— the runner offload context manager (re-pin/restore, no-offload-when-busy, flag-off, restore-on-exception), theidle_gpu_offloadablemarker wiring on real nodes, and a two-worker concurrency regression exercising the real offload path.Docs
configuration/invokeai-yaml.mdx— documentsoffload_text_encoders_to_idle_gpus.development/Guides/creating-nodes.mdx— explains how (and when) a node should setidle_gpu_offloadable=True.Verification
test_torch_cuda_allocator.py::test_configure_torch_cuda_allocator_configures_backend, requires a working CUDAcudaMallocAsyncallocator and fails on a CPU-only box; it touches none of this PR's code.)ruff checkandruff format --checkclean;openapi.json/schema.tsregenerated (only the new config field).🤖 Generated with Claude Code