Skip to content

feat(multi-gpu): offload text encoders to idle GPUs#137

Open
lstein wants to merge 1 commit into
lstein/feat/multi-gpufrom
lstein/feat/multi-gpu-use-idle
Open

feat(multi-gpu): offload text encoders to idle GPUs#137
lstein wants to merge 1 commit into
lstein/feat/multi-gpufrom
lstein/feat/multi-gpu-use-idle

Conversation

@lstein

@lstein lstein commented Jun 28, 2026

Copy link
Copy Markdown
Owner

⚠️ Merge order

This PR depends on invoke-ai#9263 and targets its branch (lstein/feat/multi-gpu), not main. It should be reviewed and merged only after invoke-ai#9263 has been reviewed, accepted, and merged. Once invoke-ai#9263 lands, this PR's base can be retargeted to main.

It was inspired by invoke-ai#9310 (split-GPU text encoder) and supersedes that PR — it delivers the same idea, but reworked to compose with the multi-GPU parallel-generation architecture from invoke-ai#9263 and to reuse that branch's existing per-device caches, device-aware VRAM accounting, and shared CPU-weights store rather than re-adding them.

Remember to give full credit to @Jacid23 for the concept and initial implementation.

Summary

On a multi-GPU machine, invoke-ai#9263 runs one generation session per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. This PR uses that idle capacity: a session's text/prompt encoder runs on a currently-idle GPU instead of the GPU running its denoise pipeline.

  • Avoids evicting the denoise model from VRAM just to make room for the encoder.
  • Lets a cached encoder be reused across generations, making repeated single-session generations noticeably smoother.
  • Purely a placement optimization — generated images are unchanged.

Controlled by a new offload_text_encoders_to_idle_gpus setting (default on). With a single device, or under full multi-GPU load (no idle GPU), encoders run on the session's own GPU exactly as before.

How it works

  • GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own GPU's lock for the whole run; an encoder node try-borrows an idle GPU's lock for the duration of that node. A borrowed encoder and a native session are therefore mutually exclusive on a GPU, and the design is deadlock-free (borrows are non-blocking try-acquires; a session only ever blocks on its own device).
  • DefaultSessionRunner temporarily re-pins the worker thread to the borrowed GPU for the whole encoder node. The encoder loads into and runs on that GPU; its conditioning is stored on the CPU (as encoder nodes already do) and the denoiser picks it up on its own GPU afterward — so the cross-GPU handoff needs no node changes.
  • Per-node opt-in via @invocation(idle_gpu_offloadable=True), mirroring the existing bottleneck ClassVar marker (no API-schema impact). Applied to the text/prompt encoder nodes: compel (+ SDXL/refiner), flux_text_encoder, sd3_text_encoder, qwen_image_text_encoder, anima_text_encoder, cogview4_text_encoder, flux2_klein_text_encoder, z_image_text_encoder, and flux_redux.

Why the per-device lock

An earlier iteration routed the encoder into the idle GPU's cache without exclusivity. Because two sessions using the same model/prompt resolve to the same encoder cache key, they ended up sharing one model object and running concurrent forward passes + in-place LoRA patching on it — producing garbled images. The per-device lock makes a borrow and a native session mutually exclusive on a GPU, which fixes this; prevent_auto_evict from invoke-ai#9310 is intentionally not ported, so a borrowed encoder yields its GPU's VRAM (via normal LRU) the moment that GPU is claimed for a real session.

Tests

  • tests/backend/util/test_device_pool.py — arbiter lock semantics (borrow exclusion, session/borrow mutual exclusion, startup-race ordering, deterministic selection) plus a multi-threaded regression test asserting a GPU is never used by a session and a borrow at the same time.
  • tests/app/services/session_processor/test_encoder_offload.py — the runner offload context manager (re-pin/restore, no-offload-when-busy, flag-off, restore-on-exception), the idle_gpu_offloadable marker wiring on real nodes, and a two-worker concurrency regression exercising the real offload path.

Docs

  • configuration/invokeai-yaml.mdx — documents offload_text_encoders_to_idle_gpus.
  • development/Guides/creating-nodes.mdx — explains how (and when) a node should set idle_gpu_offloadable=True.

Verification

  • Full backend test suite: 2138 passed / 127 skipped. (One unrelated failure, test_torch_cuda_allocator.py::test_configure_torch_cuda_allocator_configures_backend, requires a working CUDA cudaMallocAsync allocator and fails on a CPU-only box; it touches none of this PR's code.)
  • ruff check and ruff format --check clean; openapi.json / schema.ts regenerated (only the new config field).
  • Manually verified on a dual-GPU machine: single-session offload, parallel sessions with the same model, and parallel sessions with two different models/encoders all produce correct images.

🤖 Generated with Claude Code

Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one
generation device is configured and a GPU is idle, a session's text/prompt
encoder runs on the idle GPU instead of the one running its denoise pipeline.
This avoids evicting the denoise model from VRAM to make room for the encoder,
and lets a cached encoder be reused across generations. Under full load (no
idle GPU) behavior is unchanged.

Mechanism:
- New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a
  per-device exclusive-use lock. A native session blocking-acquires its own
  device's lock for the whole run; an encoder node try-borrows an idle device's
  lock for the duration of the node. This makes a borrowed encoder and a native
  session mutually exclusive on a GPU -- preventing the shared-encoder
  corruption that produced garbled images -- and is deadlock-free (borrows are
  non-blocking; a session only ever blocks on its own device).
- DefaultSessionRunner re-pins the worker thread to the borrowed device for the
  whole encoder node; conditioning is stored on the CPU and the denoiser picks
  it up on its own GPU afterward.
- Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the
  existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder
  nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2
  klein, z-image, flux_redux).

Inspired by invoke-ai#9310; supersedes it.

Tests: device-pool lock semantics, two concurrency regression tests asserting a
session and a borrow never use a GPU at the same time, the runner offload
context-manager behavior, and a marker-wiring check.

Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support
the feature in a node).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant