Add split GPU text encoder cache by Jacid23 · Pull Request #9310 · invoke-ai/InvokeAI

Jacid23 · 2026-06-28T02:40:15Z

Summary

Add an optional split-GPU text encoder mode for systems with multiple CUDA GPUs.
When enabled, selected text encoders are loaded on the secondary CUDA device while the main generation model stays on the primary execution device.
Add active load/unload sync endpoints so turning the toggle off releases the secondary GPU cache instead of leaving the encoder resident.
Add compact hardware/cache status in the UI and a model-cache sleep timer setting for idle cleanup.

Why

Text encoder loads can force the denoise model to unload/reload on single-device cache paths. On dual-GPU systems, keeping text encoders resident on the other CUDA device avoids that churn and makes repeated generation materially smoother.

Behavior

The UI control is only useful when at least two CUDA devices are available.
Disabling the toggle actively drops the split-GPU text encoder cache so that GPU can be used elsewhere.
CPU offload behavior is not changed.

Verification

pnpm lint:prettier
pnpm lint:tsc
pnpm lint:knip
OpenAPI schema generated output matches checked-in openapi.json
Typegen output is stable after regeneration

Notes

This branch was prepared from upstream/main and squashed to one focused commit. It does not include local fork/runtime update scripts, batch-specific files, or unrelated compatibility work.

lstein · 2026-06-28T14:03:02Z

This is a great idea. Heads up that this is a generic multi-GPU PR coming down the pike (#5997) and this will need some adaptation to work with that scheme. I'll be working on an integration.

@invocation

Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one generation device is configured and a GPU is idle, a session's text/prompt encoder runs on the idle GPU instead of the one running its denoise pipeline. This avoids evicting the denoise model from VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Under full load (no idle GPU) behavior is unchanged. Mechanism: - New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own device's lock for the whole run; an encoder node try-borrows an idle device's lock for the duration of the node. This makes a borrowed encoder and a native session mutually exclusive on a GPU -- preventing the shared-encoder corruption that produced garbled images -- and is deadlock-free (borrows are non-blocking; a session only ever blocks on its own device). - DefaultSessionRunner re-pins the worker thread to the borrowed device for the whole encoder node; conditioning is stored on the CPU and the denoiser picks it up on its own GPU afterward. - Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2 klein, z-image, flux_redux). Inspired by invoke-ai#9310; supersedes it. Tests: device-pool lock semantics, two concurrency regression tests asserting a session and a borrow never use a GPU at the same time, the runner offload context-manager behavior, and a marker-wiring check. Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support the feature in a node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lstein · 2026-06-29T01:01:25Z

Because of pending #9263 , this PR will be in conflict and can't be merged. However, the idea has been folded into a pending PR in my personal repository that will be posted here after 9263 goes in. It is in lstein#137 if you'd like to take a look. I will give full credit to @Jacid23 for the concept and initial implementation.

feat: add split gpu text encoder cache

3440563

github-actions Bot added api python PRs that change python files Root backend PRs that change backend files services PRs that change app services frontend PRs that change frontend files labels Jun 28, 2026

Jacid23 marked this pull request as ready for review June 28, 2026 03:04

Jacid23 requested review from JPPhoto, Pfannkuchensack, blessedcoolant, dunkeroni and lstein as code owners June 28, 2026 03:04

lstein self-assigned this Jun 28, 2026

lstein added this to Invoke - Community Roadmap Jun 28, 2026

lstein moved this to 6.14.x Theme: USER EXPERIENCE in Invoke - Community Roadmap Jun 28, 2026

lstein added the 6.14.x label Jun 28, 2026

This was referenced Jun 28, 2026

feat(multi-gpu): offload text encoders to idle GPUs #9311

Closed

feat(multi-gpu): offload text encoders to idle GPUs lstein/InvokeAI#137

Open

lstein mentioned this pull request Jun 29, 2026

feat: multi-GPU parallel session execution #9263

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add split GPU text encoder cache#9310

Add split GPU text encoder cache#9310
Jacid23 wants to merge 1 commit into
invoke-ai:mainfrom
Jacid23:codex/dual-gpu-text-encoder

Jacid23 commented Jun 28, 2026 •

edited

Loading

Uh oh!

lstein commented Jun 28, 2026

Uh oh!

lstein commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jacid23 commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Behavior

Verification

Notes

Uh oh!

lstein commented Jun 28, 2026

Uh oh!

lstein commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jacid23 commented Jun 28, 2026 •

edited

Loading