Skip to content

Add split GPU text encoder cache#9310

Open
Jacid23 wants to merge 1 commit into
invoke-ai:mainfrom
Jacid23:codex/dual-gpu-text-encoder
Open

Add split GPU text encoder cache#9310
Jacid23 wants to merge 1 commit into
invoke-ai:mainfrom
Jacid23:codex/dual-gpu-text-encoder

Conversation

@Jacid23

@Jacid23 Jacid23 commented Jun 28, 2026

Copy link
Copy Markdown

Summary

  • Add an optional split-GPU text encoder mode for systems with multiple CUDA GPUs.
  • When enabled, selected text encoders are loaded on the secondary CUDA device while the main generation model stays on the primary execution device.
  • Add active load/unload sync endpoints so turning the toggle off releases the secondary GPU cache instead of leaving the encoder resident.
  • Add compact hardware/cache status in the UI and a model-cache sleep timer setting for idle cleanup.

Why

Text encoder loads can force the denoise model to unload/reload on single-device cache paths. On dual-GPU systems, keeping text encoders resident on the other CUDA device avoids that churn and makes repeated generation materially smoother.

Behavior

  • The UI control is only useful when at least two CUDA devices are available.
  • Disabling the toggle actively drops the split-GPU text encoder cache so that GPU can be used elsewhere.
  • CPU offload behavior is not changed.

Verification

  • pnpm lint:prettier
  • pnpm lint:tsc
  • pnpm lint:knip
  • OpenAPI schema generated output matches checked-in openapi.json
  • Typegen output is stable after regeneration

Notes

This branch was prepared from upstream/main and squashed to one focused commit. It does not include local fork/runtime update scripts, batch-specific files, or unrelated compatibility work.

@github-actions github-actions Bot added api python PRs that change python files Root backend PRs that change backend files services PRs that change app services frontend PRs that change frontend files labels Jun 28, 2026
@Jacid23 Jacid23 marked this pull request as ready for review June 28, 2026 03:04
@lstein lstein self-assigned this Jun 28, 2026
@lstein lstein moved this to 6.14.x Theme: USER EXPERIENCE in Invoke - Community Roadmap Jun 28, 2026
@lstein lstein added the 6.14.x label Jun 28, 2026
@lstein

lstein commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

This is a great idea. Heads up that this is a generic multi-GPU PR coming down the pike (#5997) and this will need some adaptation to work with that scheme. I'll be working on an integration.

lstein added a commit to lstein/InvokeAI that referenced this pull request Jun 29, 2026
Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one
generation device is configured and a GPU is idle, a session's text/prompt
encoder runs on the idle GPU instead of the one running its denoise pipeline.
This avoids evicting the denoise model from VRAM to make room for the encoder,
and lets a cached encoder be reused across generations. Under full load (no
idle GPU) behavior is unchanged.

Mechanism:
- New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a
  per-device exclusive-use lock. A native session blocking-acquires its own
  device's lock for the whole run; an encoder node try-borrows an idle device's
  lock for the duration of the node. This makes a borrowed encoder and a native
  session mutually exclusive on a GPU -- preventing the shared-encoder
  corruption that produced garbled images -- and is deadlock-free (borrows are
  non-blocking; a session only ever blocks on its own device).
- DefaultSessionRunner re-pins the worker thread to the borrowed device for the
  whole encoder node; conditioning is stored on the CPU and the denoiser picks
  it up on its own GPU afterward.
- Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the
  existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder
  nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2
  klein, z-image, flux_redux).

Inspired by invoke-ai#9310; supersedes it.

Tests: device-pool lock semantics, two concurrency regression tests asserting a
session and a borrow never use a GPU at the same time, the runner offload
context-manager behavior, and a marker-wiring check.

Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support
the feature in a node).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lstein

lstein commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Because of pending #9263 , this PR will be in conflict and can't be merged. However, the idea has been folded into a pending PR in my personal repository that will be posted here after 9263 goes in. It is in lstein#137 if you'd like to take a look. I will give full credit to @Jacid23 for the concept and initial implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.14.x api backend PRs that change backend files frontend PRs that change frontend files python PRs that change python files Root services PRs that change app services

Projects

Status: 6.14.x Theme: USER EXPERIENCE

Development

Successfully merging this pull request may close these issues.

2 participants