feat(multi-gpu): offload text encoders to idle GPUs by lstein · Pull Request #137 · lstein/InvokeAI

lstein · 2026-06-28T23:33:57Z

⚠️ Merge order

This PR depends on invoke-ai#9263 and targets its branch (lstein/feat/multi-gpu), not main. It should be reviewed and merged only after invoke-ai#9263 has been reviewed, accepted, and merged. Once invoke-ai#9263 lands, this PR's base can be retargeted to main.

It was inspired by invoke-ai#9310 (split-GPU text encoder) and supersedes that PR — it delivers the same idea, but reworked to compose with the multi-GPU parallel-generation architecture from invoke-ai#9263 and to reuse that branch's existing per-device caches, device-aware VRAM accounting, and shared CPU-weights store rather than re-adding them.

Remember to give full credit to @Jacid23 for the concept and initial implementation.

Summary

On a multi-GPU machine, invoke-ai#9263 runs one generation session per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. This PR uses that idle capacity: a session's text/prompt encoder runs on a currently-idle GPU instead of the GPU running its denoise pipeline.

Avoids evicting the denoise model from VRAM just to make room for the encoder.
Lets a cached encoder be reused across generations, making repeated single-session generations noticeably smoother.
Purely a placement optimization — generated images are unchanged.

Controlled by a new offload_text_encoders_to_idle_gpus setting (default on). With a single device, or under full multi-GPU load (no idle GPU), encoders run on the session's own GPU exactly as before.

How it works

GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own GPU's lock for the whole run; an encoder node try-borrows an idle GPU's lock for the duration of that node. A borrowed encoder and a native session are therefore mutually exclusive on a GPU, and the design is deadlock-free (borrows are non-blocking try-acquires; a session only ever blocks on its own device).
DefaultSessionRunner temporarily re-pins the worker thread to the borrowed GPU for the whole encoder node. The encoder loads into and runs on that GPU; its conditioning is stored on the CPU (as encoder nodes already do) and the denoiser picks it up on its own GPU afterward — so the cross-GPU handoff needs no node changes.
Per-node opt-in via @invocation(idle_gpu_offloadable=True), mirroring the existing bottleneck ClassVar marker (no API-schema impact). Applied to the text/prompt encoder nodes: compel (+ SDXL/refiner), flux_text_encoder, sd3_text_encoder, qwen_image_text_encoder, anima_text_encoder, cogview4_text_encoder, flux2_klein_text_encoder, z_image_text_encoder, and flux_redux.

Why the per-device lock

An earlier iteration routed the encoder into the idle GPU's cache without exclusivity. Because two sessions using the same model/prompt resolve to the same encoder cache key, they ended up sharing one model object and running concurrent forward passes + in-place LoRA patching on it — producing garbled images. The per-device lock makes a borrow and a native session mutually exclusive on a GPU, which fixes this; prevent_auto_evict from invoke-ai#9310 is intentionally not ported, so a borrowed encoder yields its GPU's VRAM (via normal LRU) the moment that GPU is claimed for a real session.

Tests

tests/backend/util/test_device_pool.py — arbiter lock semantics (borrow exclusion, session/borrow mutual exclusion, startup-race ordering, deterministic selection) plus a multi-threaded regression test asserting a GPU is never used by a session and a borrow at the same time.
tests/app/services/session_processor/test_encoder_offload.py — the runner offload context manager (re-pin/restore, no-offload-when-busy, flag-off, restore-on-exception), the idle_gpu_offloadable marker wiring on real nodes, and a two-worker concurrency regression exercising the real offload path.

Docs

configuration/invokeai-yaml.mdx — documents offload_text_encoders_to_idle_gpus.
development/Guides/creating-nodes.mdx — explains how (and when) a node should set idle_gpu_offloadable=True.

Verification

Full backend test suite: 2138 passed / 127 skipped. (One unrelated failure, test_torch_cuda_allocator.py::test_configure_torch_cuda_allocator_configures_backend, requires a working CUDA cudaMallocAsync allocator and fails on a CPU-only box; it touches none of this PR's code.)
ruff check and ruff format --check clean; openapi.json / schema.ts regenerated (only the new config field).
Manually verified on a dual-GPU machine: single-session offload, parallel sessions with the same model, and parallel sessions with two different models/encoders all produce correct images.

🤖 Generated with Claude Code

@invocation

Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one generation device is configured and a GPU is idle, a session's text/prompt encoder runs on the idle GPU instead of the one running its denoise pipeline. This avoids evicting the denoise model from VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Under full load (no idle GPU) behavior is unchanged. Mechanism: - New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own device's lock for the whole run; an encoder node try-borrows an idle device's lock for the duration of the node. This makes a borrowed encoder and a native session mutually exclusive on a GPU -- preventing the shared-encoder corruption that produced garbled images -- and is deadlock-free (borrows are non-blocking; a session only ever blocks on its own device). - DefaultSessionRunner re-pins the worker thread to the borrowed device for the whole encoder node; conditioning is stored on the CPU and the denoiser picks it up on its own GPU afterward. - Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2 klein, z-image, flux_redux). Inspired by invoke-ai#9310; supersedes it. Tests: device-pool lock semantics, two concurrency regression tests asserting a session and a borrow never use a GPU at the same time, the runner offload context-manager behavior, and a marker-wiring check. Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support the feature in a node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added python python-tests services invocations backend frontend docs labels Jun 28, 2026

lstein mentioned this pull request Jun 28, 2026

feat(multi-gpu): offload text encoders to idle GPUs invoke-ai/InvokeAI#9311

Closed

lstein force-pushed the lstein/feat/multi-gpu-use-idle branch from 4dccb13 to 0037a21 Compare June 29, 2026 00:39

This was referenced Jun 29, 2026

Add split GPU text encoder cache invoke-ai/InvokeAI#9310

Open

feat: multi-GPU parallel session execution invoke-ai/InvokeAI#9263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multi-gpu): offload text encoders to idle GPUs#137

feat(multi-gpu): offload text encoders to idle GPUs#137
lstein wants to merge 1 commit into
lstein/feat/multi-gpufrom
lstein/feat/multi-gpu-use-idle

lstein commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lstein commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Merge order

Summary

How it works

Why the per-device lock

Tests

Docs

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lstein commented Jun 28, 2026 •

edited

Loading