Skip to content

feat(multi-gpu): offload text encoders to idle GPUs#9311

Closed
lstein wants to merge 7269 commits into
lstein/feat/multi-gpufrom
lstein/feat/multi-gpu-use-idle
Closed

feat(multi-gpu): offload text encoders to idle GPUs#9311
lstein wants to merge 7269 commits into
lstein/feat/multi-gpufrom
lstein/feat/multi-gpu-use-idle

Conversation

@lstein

@lstein lstein commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

⚠️ Merge order

This PR depends on #9263 and targets its branch (lstein/feat/multi-gpu), not main. It should be reviewed and merged only after #9263 has been reviewed, accepted, and merged. Once #9263 lands, this PR's base can be retargeted to main.

It was inspired by #9310 (split-GPU text encoder) and supersedes that PR — it delivers the same idea, but reworked to compose with the multi-GPU parallel-generation architecture from #9263 and to reuse that branch's existing per-device caches, device-aware VRAM accounting, and shared CPU-weights store rather than re-adding them.

Summary

On a multi-GPU machine, #9263 runs one generation session per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. This PR uses that idle capacity: a session's text/prompt encoder runs on a currently-idle GPU instead of the GPU running its denoise pipeline.

  • Avoids evicting the denoise model from VRAM just to make room for the encoder.
  • Lets a cached encoder be reused across generations, making repeated single-session generations noticeably smoother.
  • Purely a placement optimization — generated images are unchanged.

Controlled by a new offload_text_encoders_to_idle_gpus setting (default on). With a single device, or under full multi-GPU load (no idle GPU), encoders run on the session's own GPU exactly as before.

How it works

  • GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own GPU's lock for the whole run; an encoder node try-borrows an idle GPU's lock for the duration of that node. A borrowed encoder and a native session are therefore mutually exclusive on a GPU, and the design is deadlock-free (borrows are non-blocking try-acquires; a session only ever blocks on its own device).
  • DefaultSessionRunner temporarily re-pins the worker thread to the borrowed GPU for the whole encoder node. The encoder loads into and runs on that GPU; its conditioning is stored on the CPU (as encoder nodes already do) and the denoiser picks it up on its own GPU afterward — so the cross-GPU handoff needs no node changes.
  • Per-node opt-in via @invocation(idle_gpu_offloadable=True), mirroring the existing bottleneck ClassVar marker (no API-schema impact). Applied to the text/prompt encoder nodes: compel (+ SDXL/refiner), flux_text_encoder, sd3_text_encoder, qwen_image_text_encoder, anima_text_encoder, cogview4_text_encoder, flux2_klein_text_encoder, z_image_text_encoder, and flux_redux.

Why the per-device lock

An earlier iteration routed the encoder into the idle GPU's cache without exclusivity. Because two sessions using the same model/prompt resolve to the same encoder cache key, they ended up sharing one model object and running concurrent forward passes + in-place LoRA patching on it — producing garbled images. The per-device lock makes a borrow and a native session mutually exclusive on a GPU, which fixes this; prevent_auto_evict from #9310 is intentionally not ported, so a borrowed encoder yields its GPU's VRAM (via normal LRU) the moment that GPU is claimed for a real session.

Tests

  • tests/backend/util/test_device_pool.py — arbiter lock semantics (borrow exclusion, session/borrow mutual exclusion, startup-race ordering, deterministic selection) plus a multi-threaded regression test asserting a GPU is never used by a session and a borrow at the same time.
  • tests/app/services/session_processor/test_encoder_offload.py — the runner offload context manager (re-pin/restore, no-offload-when-busy, flag-off, restore-on-exception), the idle_gpu_offloadable marker wiring on real nodes, and a two-worker concurrency regression exercising the real offload path.

Docs

  • configuration/invokeai-yaml.mdx — documents offload_text_encoders_to_idle_gpus.
  • development/Guides/creating-nodes.mdx — explains how (and when) a node should set idle_gpu_offloadable=True.

Verification

  • Full backend test suite: 2138 passed / 127 skipped. (One unrelated failure, test_torch_cuda_allocator.py::test_configure_torch_cuda_allocator_configures_backend, requires a working CUDA cudaMallocAsync allocator and fails on a CPU-only box; it touches none of this PR's code.)
  • ruff check and ruff format --check clean; openapi.json / schema.ts regenerated (only the new config field).
  • Manually verified on a dual-GPU machine: single-session offload, parallel sessions with the same model, and parallel sessions with two different models/encoders all produce correct images.

🤖 Generated with Claude Code

weblate and others added 30 commits February 28, 2026 15:09
* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2205 of 2250 strings)

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation files

Updated by "Remove blank strings" hook in Weblate.

Co-authored-by: Hosted Weblate <hosted@weblate.org>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2210 of 2259 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2224 of 2272 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2252 of 2295 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2264 of 2309 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Russian)

Currently translated at 60.7% (1419 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2290 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2319 of 2372 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

---------

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Co-authored-by: DustyShoe <warukeichi@gmail.com>
…hoami() (#8913)

`get_token_permission` is deprecated and will be removed in huggingface_hub 1.0.
Use `whoami()` to validate the token instead, as recommended by the deprecation warning.
Merged Z-Image checkpoints (e.g. models with LoRAs baked in) may bundle
text encoder weights (text_encoders.*) or other non-transformer keys
alongside the transformer weights. These cause load_state_dict() to fail
with strict=True. Instead of disabling strict mode, explicitly whitelist
valid ZImageTransformer2DModel key prefixes and discard everything else.

Also moves RAM allocation after filtering so it doesn't over-allocate
for discarded keys.

Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>
…art (#8932)

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
* feat(model_manager): add export/import for model settings

Add the ability to export model settings (default_settings, trigger_phrases,
cpu_only) as JSON and import them back. The model name is used as the
filename for exports.

https://claude.ai/code/session_01LXKjbRjfzcG3d3vzk3xRCh

* fix(ui): reset settings forms after import so updated values display immediately

The useForm defaultValues only apply on mount, so importing model settings
updated the backend but the forms kept showing stale values. Added useEffect
to reset forms when the underlying model config changes. Also fixed lint
errors (strict equality, missing React import).

* fix(ui): harden model settings export/import

Prevent cross-model-type import errors by filtering imported fields
against the target model's supported fields, showing clear warnings
for incompatible or partially compatible settings instead of raw
pydantic validation errors. Also fix falsy checks for empty arrays
and objects in export, disable export button when nothing to export,
add client-side validation and FileReader error handling on import.

* Chore pnpm fix

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
* Fix: Kill the server with one keyboard interrupt (#94)

* Initial plan

* Handle KeyboardInterrupt in run_app to allow single Ctrl+C shutdown

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* Force os._exit(0) on KeyboardInterrupt to avoid hanging on background threads

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Fix graceful shutdown to wait for download/install worker threads (#102)

* Initial plan

* Replace os._exit(0) with ApiDependencies.shutdown() on KeyboardInterrupt

Instead of immediately force-exiting the process on CTRL+C, call
ApiDependencies.shutdown() to gracefully stop the download and install
manager services, allowing active work to complete or cancel cleanly
before the process exits.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* Make stop() idempotent in download and model install services

When CTRL+C is pressed, uvicorn's graceful shutdown triggers the FastAPI
lifespan which calls ApiDependencies.shutdown(), then a KeyboardInterrupt
propagates from run_until_complete() hitting the except block which tries
to call ApiDependencies.shutdown() a second time.

Change both stop() methods to return silently (instead of raising) when
the service is not running. This handles:
- Double-shutdown: lifespan already stopped the services
- Early interrupt: services were never fully started

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Fix shutdown hang on session processor thread lock (#108)

* Initial plan

* Fix shutdown hang: wake session processor thread on stop() and mark daemon

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* Fix: shut down asyncio executor on KeyboardInterrupt to prevent post-generation hang (#112)

Fix: cancel pending asyncio tasks before loop.close() to suppress destroyed-task warnings
Fix: suppress stack trace when dispatching events after event loop is closed on shutdown
Fix: cancel in-progress generation on stop() to prevent core dump during mid-flight Ctrl+C

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
…ions (#8920)

* Persist selected board and auto-select most recent image across browser sessions (#92)

* Persist selectedBoardId across browser sessions

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* fix(frontend): make appStarted listener async so image auto-selection works on startup

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* chore(frontend): remove unwanted package-lock.json

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
…nation directory (#104) (#8931)

* Initial plan

* Fix race condition in _do_download when scanning for .downloading files



* chore(backend): update copyright

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
* fix(prompt): add more punctuations, fixes attention hotkeys removing them from prompt.

* fix(prompt): improve numeric weighting calculation

* feat(prompts): add numeric attention preference toggle to settings

* feat(prompts): use attention style preference, rewrite to accomodate prompt functions

* fix(prompts): account for weirdness with quotes

account for mismatching quotes, missing quotes and other quote entities

* fix(prompts): add tests, qol improvements, code cleanup

* fix(prompts): test lint

* fix(prompts): remove unused exports

* fix(prompts): separator whitespace serialization

---------

Co-authored-by: joshistoast <me@joshcorbett.com>
Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
The reidentify endpoint overwrote the model's relative path with an
absolute path from the prober, and unconditionally accessed
trigger_phrases which doesn't exist on all config types (e.g. IP
Adapters), causing an AttributeError.

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
)

* perf(flux2): optimize model loading order to prevent cache eviction (fixes #7513)

* Update flux2_klein_text_encoder.py

* Update flux2_klein_text_encoder.py version

---------

Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
…ansformer-only keys (#8938)

LoRAs trained with musubi-tuner (and potentially other trainers) that
only target transformer blocks (double_blocks/single_blocks) without
embedding layers (txt_in/vector_in/context_embedder) were incorrectly
classified as Flux 1. Add fallback detection using attention projection
hidden_size and MLP ratio from transformer block tensors

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2205 of 2250 strings)

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation files

Updated by "Remove blank strings" hook in Weblate.

Co-authored-by: Hosted Weblate <hosted@weblate.org>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2210 of 2259 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2224 of 2272 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2252 of 2295 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2264 of 2309 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Russian)

Currently translated at 60.7% (1419 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2290 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2319 of 2372 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2327 of 2380 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

---------

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Co-authored-by: DustyShoe <warukeichi@gmail.com>
* Added SQL injection tests

* Updated tests after multi-user merge

* ruff:format

---------

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
* Add user management UI for admin and regular users (#106)

* Add user management UI and backend API endpoints

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Fix user management feedback: cancel/back navigation, system user filter, tooltip fix

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Make Back button on User Management page more prominent

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* chore(frontend): typegen

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* Add Confirm Password field to My Profile password change form (#110)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
* fix(gallery): restore arrow-key browsing and extract shared prev/next navigation

* Added same behavior to Upscale mode and autofocus to gallery after using hotkeys Ctrl+Enter and Ctrl+Shift+Enter

* restore arrow navigation focus flow across viewer states

* fix(gallery): stabilize arrow-key browsing, remove viewer UI flicker, and optimize code

---------

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
* docs(multiuser): update multiuser mode documentation

* Update docs/multiuser/user_guide.md

Co-authored-by: dunkeroni <dunkeroni@gmail.com>

* Update docs/multiuser/user_guide.md

Co-authored-by: dunkeroni <dunkeroni@gmail.com>

* Update docs/multiuser/user_guide.md

Co-authored-by: dunkeroni <dunkeroni@gmail.com>

* slight wording change

* add info about the host interface binding option

---------

Co-authored-by: dunkeroni <dunkeroni@gmail.com>
Co-authored-by: Contributor <contributor@example.com>
Co-authored-by: Contributor <contributor@example.com>
* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2205 of 2250 strings)

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation files

Updated by "Remove blank strings" hook in Weblate.

Co-authored-by: Hosted Weblate <hosted@weblate.org>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2210 of 2259 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2224 of 2272 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2252 of 2295 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2264 of 2309 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Russian)

Currently translated at 60.7% (1419 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2290 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2319 of 2372 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2327 of 2380 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2328 of 2382 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.5% (2370 of 2429 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

---------

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Co-authored-by: DustyShoe <warukeichi@gmail.com>
* feat: add strict_password_checking config option to relax password requirements

- Add `strict_password_checking: bool = Field(default=False)` to InvokeAIAppConfig
- Add `get_password_strength()` function to password_utils.py (returns weak/moderate/strong)
- Add `strict_password_checking` field to SetupStatusResponse API endpoint
- Update users_base.py and users_default.py to accept `strict_password_checking` param
- Update auth.py router to pass config.strict_password_checking to all user service calls
- Create shared frontend utility passwordUtils.ts for password strength validation
- Update AdministratorSetup, UserProfile, UserManagement components to:
  - Fetch strict_password_checking from setup status endpoint
  - Show colored strength indicators (red/yellow/blue) in non-strict mode
  - Allow any non-empty password in non-strict mode
  - Maintain strict validation behavior when strict_password_checking=True
- Update SetupStatusResponse type in auth.ts endpoint
- Add passwordStrength and passwordHelperRelaxed translation keys to en.json
- Add tests for new get_password_strength() function

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* Changes before error encountered

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* chore(backend): docstrings

* chore(frontend): typegen

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>
…ory (#8954)

When deleting a file-based model (e.g. LoRA), the previous logic used
rmtree on the parent directory, which would delete all files in that
folder — even unrelated ones. Now only the specific model file is
removed, and the parent directory is cleaned up only if empty afterward.
…dels (#8960)

* fix(ui): resolve models by name+base+type when recalling metadata for reinstalled models

When a model (IP Adapter, ControlNet, etc.) is deleted and reinstalled,
it gets a new UUID key. Previously, metadata recall would fail because
it only looked up models by their stored UUID key. Now the recall falls
back to searching by name+base+type, allowing reinstalled models with
the same name to be correctly resolved.

https://claude.ai/code/session_01XYubzMK363BXGTvfJJqFnX

* Add hash-based model recall fallback for reinstalled models

When a model is deleted and reinstalled, it gets a new UUID key but
retains the same BLAKE3 content hash. This adds hash as a middle
fallback stage in model resolution (key → hash → name+base+type),
making recall more robust.

Changes:
- Add /api/v2/models/get_by_hash backend endpoint (uses existing
  search_by_hash from model records store)
- Add getModelConfigByHash RTK Query endpoint in frontend
- Add hash fallback to both resolveModel and parseModelIdentifier

https://claude.ai/code/session_01XYubzMK363BXGTvfJJqFnX

* Chore pnpm fix

* Chore typegen

---------

Co-authored-by: Claude <noreply@anthropic.com>
* Repair partially loaded Qwen models after cancel to avoid device mismatches

* ruff

* Repair CogView4 text encoder after canceled partial loads

* Avoid MPS CI crash in repair regression test

* Fix MPS device assertion in repair test
* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2205 of 2250 strings)

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation files

Updated by "Remove blank strings" hook in Weblate.

Co-authored-by: Hosted Weblate <hosted@weblate.org>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2210 of 2259 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2224 of 2272 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2252 of 2295 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2264 of 2309 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Russian)

Currently translated at 60.7% (1419 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2290 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2319 of 2372 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2327 of 2380 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2328 of 2382 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.5% (2370 of 2429 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Finnish)

Currently translated at 1.5% (37 of 2429 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/fi/

* translationBot(ui): update translation (Italian)

Currently translated at 97.5% (2373 of 2433 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

---------

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Co-authored-by: DustyShoe <warukeichi@gmail.com>
Co-authored-by: Ilmari Laakkonen <ilmarille@gmail.com>
* change submenu icon to phosphor

* Use PiIntersectSquareBold
* chore: bump version to 6.12.0

* chore: update What's New text
* Add chained collect node

* test(frontend): align parseSchema fixtures with collect v1.1 and normalize undefined fields in assertions

* fix(nodes): block collect-to-collect links when inferred item types differ

---------

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2205 of 2250 strings)

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation files

Updated by "Remove blank strings" hook in Weblate.

Co-authored-by: Hosted Weblate <hosted@weblate.org>
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/
Translation: InvokeAI/Web UI

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2210 of 2259 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.8% (2224 of 2272 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2252 of 2295 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 98.0% (2264 of 2309 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Russian)

Currently translated at 60.7% (1419 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/

* translationBot(ui): update translation (Italian)

Currently translated at 98.1% (2290 of 2334 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2319 of 2372 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2327 of 2380 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.7% (2328 of 2382 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Italian)

Currently translated at 97.5% (2370 of 2429 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Finnish)

Currently translated at 1.5% (37 of 2429 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/fi/

* translationBot(ui): update translation (Italian)

Currently translated at 97.5% (2373 of 2433 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

* translationBot(ui): update translation (Japanese)

Currently translated at 87.1% (2120 of 2433 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ja/

* translationBot(ui): update translation (Italian)

Currently translated at 97.5% (2374 of 2433 strings)

Translation: InvokeAI/Web UI
Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/

---------

Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com>
Co-authored-by: DustyShoe <warukeichi@gmail.com>
Co-authored-by: Ilmari Laakkonen <ilmarille@gmail.com>
Co-authored-by: 嶋田豪介 <shimada_gosuke@cyberagent.co.jp>
lstein and others added 13 commits June 25, 2026 21:40
- Apply ruff 0.11.2 formatting to the files flagged by `ruff format --check`.
- The new fail-fast guard in get_generation_devices() (reject a CUDA device that
  doesn't exist) made the pre-existing test_get_generation_devices_explicit_list_is_deduplicated
  fail on CPU-only CI runners, since it passes a cuda list with no CUDA present. Mock
  torch.cuda.is_available/device_count in that test (matching the existing pattern in this
  file) so it validates dedup on any runner.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three RAM fixes for multi-GPU (and one that helps single-GPU too), addressing
transient spikes to ~100% RAM and swapping during text-encode/transformer loads:

1. Cap the global RAM-cache budget at a safe fraction of system RAM. When
   max_cache_ram_gb is unset, the budget was the *sum* of the per-device cache
   heuristics, so N GPUs each claiming ~50% of RAM summed to ~N*50% and starved
   the OS. Now clamp the sum to ModelCache.calc_system_ram_headroom_bytes()
   (50% of RAM - 2GB baseline, floored at 4GB). Promote the sizing magic numbers
   to named constants shared by the per-device heuristic and the global cap.

2. Adopt already-resident CPU weights across devices at load time. When a second
   device loads a model another device already holds, deep-copy a registered
   meta-weight structural clone and assign the shared canonical weights, instead
   of re-reading the model from disk and materializing a full transient second
   copy. Loader-agnostic (one mechanism in ModelLoader, no per-loader code):
   works for diffusers, single-file checkpoint, GGUF and transformers models,
   and preserves registered hooks (e.g. fp8 layerwise-cast). Best-effort with a
   meta-tensor self-check and fallback to a normal disk load on any failure.
   Skipped on single-device installs.

3. Dequantize FLUX.2 FP8 checkpoints straight to bf16. _dequantize_fp8_weights
   materialized the whole model in float32 (~36GB for 9B) before a later cast to
   bf16; now the multiply is done in float32 but stored bf16 per-weight, so the
   model is never held in float32. Numerically identical; halves the cold-load
   transient (helps single-GPU too).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Qwen Image VAE encode/decode invocations called model_on_device() without a
working-memory estimate, unlike every other VAE family (SD/SDXL/SD3/CogView4/FLUX).
So the model cache reserved only its small default working memory, never offloaded
a large resident transformer (the VAE weights themselves are tiny), and the VAE's
forward-pass activations then OOM'd VRAM — e.g. a ~40GB Qwen Image Edit transformer
left ~1GB free while decode needed ~5GB. Reproduces single-GPU; unrelated to the
multi-GPU RAM work.

Add estimate_vae_working_memory_qwen_image() (same per-output-pixel scaling as the
other estimators, handling the 5D Qwen latents) and pass it from both the i2l
(encode, used for reference images in Image Edit) and l2i (decode) nodes, so the
cache offloads the transformer before the VAE runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The FLUX.2 VAE encoder's mid-block self-attention scales quadratically with the
input's spatial size, and on ROCm scaled_dot_product_attention falls back to a
materialized attention matrix. Encoding a reference image (kontext) at full size
therefore allocated ~15GB in a single attention call at 1024px — and hundreds of
GB at the 2024px reference cap — OOMing VRAM regardless of how much other model
memory was freed.

Tile the reference-image encode to bound per-tile attention. The VAE's default
tile size equals its sample_size (1024), whose per-tile attention still OOMs, so
force a 512px tile (with a matching latent tile size derived from the config).
Save/restore the VAE's tiling config since it is a shared, cached instance, so the
final image decode does not inherit these settings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ModelCache._get_vram_in_use() called torch.cuda.memory_allocated() with no device
argument, while _get_vram_available() reads memory_allocated(execution_device).
The formula relies on those two canceling. In multi-GPU mode each worker calls
torch.cuda.set_device for its own GPU, so the process-current device flips between
workers; the no-argument call can then read a different (e.g. idle) GPU's
allocation, breaking the cancellation and inflating "available" VRAM toward the
card total. The cache then believes there is room and never offloads, so VRAM
offloading effectively ignores device_working_mem_gb in multi-GPU. Single-GPU was
unaffected (current device always equals the execution device).

Query self._execution_device in both _get_vram_in_use() and the cache-state debug
log. Add a regression test asserting the per-cache execution device is used.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… decode peak

The Qwen Image VAE is a 3D-conv (video) VAE whose decode allocates large conv3d
feature maps. A ~1MP decode was measured to peak at ~17 GiB of VRAM — far above
what the generic 2200/1100 SD/FLUX constants reserved (~4.6 GiB), so the cache
concluded the decode "fit" alongside the resident 20GB transformer + 15GB text
encoder, never offloaded them, and OOMed. The offload only frees ~(working_mem -
free) bytes, so the reservation must both cover the real peak and be large enough
to trigger the offload of models the decode doesn't need.

Raise the Qwen decode/encode constants (13000/6500) to match the measured peak.
It's linear in output pixels, so it over-reserves past ~1.5MP (where the decode
can exceed the card even after offloading) — that case is covered by
force_tiled_decode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Qwen Image latents-to-image node hardcoded vae.disable_tiling(), ignoring the
global force_tiled_decode setting that the SD/SDXL l2i node honors. Wire it up the
same way so users can opt into tiled VAE decode for very large outputs that exceed
VRAM even after the transformer/text encoder are offloaded. Off by default, so
normal-size decodes are unchanged (full-frame, no tile blending).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The preview-panel progress circle re-renders on every InvocationProgressEvent. The
parent passes a fresh progressEvent object each event, so the CircularProgress
re-rendered constantly; during the indeterminate phases (everything except
denoising) that restarted its CSS spin animation each time, which looked like the
disk flashing. (Determinate denoising was unaffected because the value genuinely
changes per step.)

Split the circle into a memoized, ref-forwarding subcomponent keyed on its visual
props (isIndeterminate, value, device label) so message-only updates no longer
re-render it and the spin animation stays continuous. The Tooltip still anchors to
it via the forwarded ref.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cket double-emit) (#9288)

* feat(api): add append mode to recall reference images

POST /api/v1/recall/{queue_id}?append=true now asks the frontend to add
the recalled reference images (ip_adapters and model-free
reference_images) to its existing list instead of replacing it. The flag
rides inside the event's parameters dict so the generated client schema
needs no regeneration, and is injected after the persistence loop so it
is never stored as a recall parameter. Mutually exclusive with strict.

The frontend dispatches refImagesRecalled with replace:false in append
mode, and skips the dispatch entirely when nothing resolved so a failed
append can never clear the user's current reference images.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(sockets): emit recall event once to owner+admin room union

RecallParametersUpdatedEvent was emitted in two separate socket.io
calls — one to the owner's user room, one to the admin room. A socket
that belongs to both (the "system" user in single-user mode is also an
admin, so it joins user:system AND admin) received the event twice.

That double delivery was invisible for the scalar/replace recall fields,
which are idempotent, but the append-mode reference-image recall pushes
rather than replaces — so each append showed up as two copies of the
same reference image in the InvokeAI canvas.

Emit once to the room union [user_room, "admin"] instead. python-socketio
deduplicates recipients across a room list, so a socket in both rooms is
delivered to exactly once, while genuinely distinct owner/admin sockets
still each receive it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(api): regenerate openapi.json + schema.ts for recall append param

Rebuilds the committed OpenAPI schema and generated TypeScript types so the
update_recall_parameters operation advertises the new append query
parameter. Generated via 'make frontend-openapi' / 'frontend-typegen'
equivalent; the only change is the added append param + its docstring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: document append query parameter for recall API

Documents the new append=true query parameter on POST /api/v1/recall/{queue_id}:
- new Query parameters subsection covering strict and append
- mutual exclusivity (strict+append -> 400) with error body
- append-mode cURL example
- updated WebSocket Events + frontend log sample for the merged reference-image list

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>
* fix metadata overrides with empty string values

* chore(backend): ruff

---------

Co-authored-by: wunianze666-netizen <wunianze666@gmail.com>
Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>
Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>
* fix(z-image): repair regional guidance forward after diffusers refactor

Z-Image Regional Guidance crashed with "split_with_sizes expects
split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The
regional-prompting patch was a hand-copied snapshot of an outdated
ZImageTransformer2DModel.forward. The installed diffusers version
changed _pad_with_ids so caption pos_ids are now longer than the
caption feature tensor, while the stale patch split RoPE embeddings by
feature lengths instead of pos_ids lengths.

Rewrite create_regional_forward to delegate to the model's own helpers
(patchify_and_embed, _prepare_sequence, _build_unified_sequence) and
only override the main-layer attention mask to inject the regional
mask. This keeps the patch in sync with upstream diffusers and stops
re-implementing the drift-prone patchify/RoPE/padding logic.

* fix(z-image): repair & realign regional guidance after diffusers refactor

Z-Image Regional Guidance crashed with "split_with_sizes expects
split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The
regional-prompting patch was a hand-copied snapshot of an outdated
ZImageTransformer2DModel.forward; the installed diffusers version
changed _pad_with_ids so caption pos_ids are longer than the caption
feature tensor, while the stale patch split RoPE embeddings by feature
lengths instead of pos_ids lengths.

Rewrite create_regional_forward to delegate to the model's own helpers
(patchify_and_embed, _prepare_sequence, _build_unified_sequence) so it
stays in sync with upstream diffusers, and only override the main-layer
attention mask.

Also fix two reasons regional guidance had no visible effect:
- Mask alignment: the unified sequence pads the image and caption
  blocks individually to a multiple of 32, so the real layout is
  [img_real | img_pad | txt_real | txt_pad]. Scatter the four regional
  sub-blocks into their padding-aware positions instead of assuming a
  contiguous top-left block (which only matched square 1024x1024).
- CFG pass: the patched forward also runs for the negative prompt; only
  apply the regional mask to passes whose caption length matches the
  positive prompt, otherwise fall back to the plain padding mask.

* Chore Ruff + Typegen

* fix(z-image): use identity to gate regional mask onto the positive pass

The regional attention patch ran for both the conditioned and negative/CFG
forward passes and distinguished them by comparing the padded caption length
against the positive prompt's expected length. Two short prompts that round up
to the same multiple of 32 collided, so the positive regional mask could be
injected into the unconditional prediction and silently corrupt CFG.

Discriminate the conditioned pass by tensor identity (cap_feats is the exact
positive_cap_feats the mask was built for) instead of a length heuristic, so
the positive and negative passes can never be confused. The context manager now
requires positive_cap_feats whenever a regional mask is provided, turning the
previously inferred invariant into an enforced one rather than a silent no-op.

Also build the (bsz, 1, S, S) float mask lazily: compute applied_regional from
cheap scalar checks first and skip materializing/cloning the full mask on passes
that never match (every negative pass), avoiding a ~33 MB bf16 clone per call.

---------

Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>
Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one
generation device is configured and a GPU is idle, a session's text/prompt
encoder runs on the idle GPU instead of the one running its denoise pipeline.
This avoids evicting the denoise model from VRAM to make room for the encoder,
and lets a cached encoder be reused across generations. Under full load (no
idle GPU) behavior is unchanged.

Mechanism:
- New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a
  per-device exclusive-use lock. A native session blocking-acquires its own
  device's lock for the whole run; an encoder node try-borrows an idle device's
  lock for the duration of the node. This makes a borrowed encoder and a native
  session mutually exclusive on a GPU -- preventing the shared-encoder
  corruption that produced garbled images -- and is deadlock-free (borrows are
  non-blocking; a session only ever blocks on its own device).
- DefaultSessionRunner re-pins the worker thread to the borrowed device for the
  whole encoder node; conditioning is stored on the CPU and the denoiser picks
  it up on its own GPU afterward.
- Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the
  existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder
  nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2
  klein, z-image, flux_redux).

Inspired by #9310; supersedes it.

Tests: device-pool lock semantics, two concurrency regression tests asserting a
session and a borrow never use a GPU at the same time, the runner offload
context-manager behavior, and a marker-wiring check.

Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support
the feature in a node).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added CI-CD Continuous integration / Continuous delivery docker api python PRs that change python files Root invocations PRs that change invocations backend PRs that change backend files services PRs that change app services frontend-deps PRs that change frontend dependencies frontend PRs that change frontend files installer PRs that change the installer docs PRs that change docs labels Jun 28, 2026
@lstein

lstein commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator Author

Closing: this was opened against a stale same-named branch on the upstream repo, producing an incorrect diff. #9263's actual head lives on the fork (lstein/InvokeAI). The correctly-based, properly-stacked PR is lstein#137.

@lstein lstein closed this Jun 28, 2026
@lstein lstein deleted the lstein/feat/multi-gpu-use-idle branch June 28, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api backend PRs that change backend files CI-CD Continuous integration / Continuous delivery docker docs PRs that change docs frontend PRs that change frontend files frontend-deps PRs that change frontend dependencies installer PRs that change the installer invocations PRs that change invocations python PRs that change python files Root services PRs that change app services

Projects

None yet

Development

Successfully merging this pull request may close these issues.