feat(multi-gpu): offload text encoders to idle GPUs by lstein · Pull Request #9311 · invoke-ai/InvokeAI

lstein · 2026-06-28T23:29:49Z

⚠️ Merge order

This PR depends on #9263 and targets its branch (lstein/feat/multi-gpu), not main. It should be reviewed and merged only after #9263 has been reviewed, accepted, and merged. Once #9263 lands, this PR's base can be retargeted to main.

It was inspired by #9310 (split-GPU text encoder) and supersedes that PR — it delivers the same idea, but reworked to compose with the multi-GPU parallel-generation architecture from #9263 and to reuse that branch's existing per-device caches, device-aware VRAM accounting, and shared CPU-weights store rather than re-adding them.

Summary

On a multi-GPU machine, #9263 runs one generation session per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. This PR uses that idle capacity: a session's text/prompt encoder runs on a currently-idle GPU instead of the GPU running its denoise pipeline.

Avoids evicting the denoise model from VRAM just to make room for the encoder.
Lets a cached encoder be reused across generations, making repeated single-session generations noticeably smoother.
Purely a placement optimization — generated images are unchanged.

Controlled by a new offload_text_encoders_to_idle_gpus setting (default on). With a single device, or under full multi-GPU load (no idle GPU), encoders run on the session's own GPU exactly as before.

How it works

GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own GPU's lock for the whole run; an encoder node try-borrows an idle GPU's lock for the duration of that node. A borrowed encoder and a native session are therefore mutually exclusive on a GPU, and the design is deadlock-free (borrows are non-blocking try-acquires; a session only ever blocks on its own device).
DefaultSessionRunner temporarily re-pins the worker thread to the borrowed GPU for the whole encoder node. The encoder loads into and runs on that GPU; its conditioning is stored on the CPU (as encoder nodes already do) and the denoiser picks it up on its own GPU afterward — so the cross-GPU handoff needs no node changes.
Per-node opt-in via @invocation(idle_gpu_offloadable=True), mirroring the existing bottleneck ClassVar marker (no API-schema impact). Applied to the text/prompt encoder nodes: compel (+ SDXL/refiner), flux_text_encoder, sd3_text_encoder, qwen_image_text_encoder, anima_text_encoder, cogview4_text_encoder, flux2_klein_text_encoder, z_image_text_encoder, and flux_redux.

Why the per-device lock

An earlier iteration routed the encoder into the idle GPU's cache without exclusivity. Because two sessions using the same model/prompt resolve to the same encoder cache key, they ended up sharing one model object and running concurrent forward passes + in-place LoRA patching on it — producing garbled images. The per-device lock makes a borrow and a native session mutually exclusive on a GPU, which fixes this; prevent_auto_evict from #9310 is intentionally not ported, so a borrowed encoder yields its GPU's VRAM (via normal LRU) the moment that GPU is claimed for a real session.

Tests

tests/backend/util/test_device_pool.py — arbiter lock semantics (borrow exclusion, session/borrow mutual exclusion, startup-race ordering, deterministic selection) plus a multi-threaded regression test asserting a GPU is never used by a session and a borrow at the same time.
tests/app/services/session_processor/test_encoder_offload.py — the runner offload context manager (re-pin/restore, no-offload-when-busy, flag-off, restore-on-exception), the idle_gpu_offloadable marker wiring on real nodes, and a two-worker concurrency regression exercising the real offload path.

Docs

configuration/invokeai-yaml.mdx — documents offload_text_encoders_to_idle_gpus.
development/Guides/creating-nodes.mdx — explains how (and when) a node should set idle_gpu_offloadable=True.

Verification

Full backend test suite: 2138 passed / 127 skipped. (One unrelated failure, test_torch_cuda_allocator.py::test_configure_torch_cuda_allocator_configures_backend, requires a working CUDA cudaMallocAsync allocator and fails on a CPU-only box; it touches none of this PR's code.)
ruff check and ruff format --check clean; openapi.json / schema.ts regenerated (only the new config field).
Manually verified on a dual-GPU machine: single-session offload, parallel sessions with the same model, and parallel sessions with two different models/encoders all produce correct images.

🤖 Generated with Claude Code

* translationBot(ui): update translation (Italian) Currently translated at 98.0% (2205 of 2250 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI * translationBot(ui): update translation files Updated by "Remove blank strings" hook in Weblate. Co-authored-by: Hosted Weblate <hosted@weblate.org> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ Translation: InvokeAI/Web UI * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2210 of 2259 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2224 of 2272 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2252 of 2295 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.0% (2264 of 2309 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Russian) Currently translated at 60.7% (1419 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2290 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2319 of 2372 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ --------- Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Co-authored-by: DustyShoe <warukeichi@gmail.com>

…hoami() (#8913) `get_token_permission` is deprecated and will be removed in huggingface_hub 1.0. Use `whoami()` to validate the token instead, as recommended by the deprecation warning.

Merged Z-Image checkpoints (e.g. models with LoRAs baked in) may bundle text encoder weights (text_encoders.*) or other non-transformer keys alongside the transformer weights. These cause load_state_dict() to fail with strict=True. Instead of disabling strict mode, explicitly whitelist valid ZImageTransformer2DModel key prefixes and discard everything else. Also moves RAM allocation after filtering so it doesn't over-allocate for discarded keys. Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>

…art (#8932) Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* feat(model_manager): add export/import for model settings Add the ability to export model settings (default_settings, trigger_phrases, cpu_only) as JSON and import them back. The model name is used as the filename for exports. https://claude.ai/code/session_01LXKjbRjfzcG3d3vzk3xRCh * fix(ui): reset settings forms after import so updated values display immediately The useForm defaultValues only apply on mount, so importing model settings updated the backend but the forms kept showing stale values. Added useEffect to reset forms when the underlying model config changes. Also fixed lint errors (strict equality, missing React import). * fix(ui): harden model settings export/import Prevent cross-model-type import errors by filtering imported fields against the target model's supported fields, showing clear warnings for incompatible or partially compatible settings instead of raw pydantic validation errors. Also fix falsy checks for empty arrays and objects in export, disable export button when nothing to export, add client-side validation and FileReader error handling on import. * Chore pnpm fix --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* Fix: Kill the server with one keyboard interrupt (#94) * Initial plan * Handle KeyboardInterrupt in run_app to allow single Ctrl+C shutdown Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * Force os._exit(0) on KeyboardInterrupt to avoid hanging on background threads Co-authored-by: lstein <111189+lstein@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Fix graceful shutdown to wait for download/install worker threads (#102) * Initial plan * Replace os._exit(0) with ApiDependencies.shutdown() on KeyboardInterrupt Instead of immediately force-exiting the process on CTRL+C, call ApiDependencies.shutdown() to gracefully stop the download and install manager services, allowing active work to complete or cancel cleanly before the process exits. Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * Make stop() idempotent in download and model install services When CTRL+C is pressed, uvicorn's graceful shutdown triggers the FastAPI lifespan which calls ApiDependencies.shutdown(), then a KeyboardInterrupt propagates from run_until_complete() hitting the except block which tries to call ApiDependencies.shutdown() a second time. Change both stop() methods to return silently (instead of raising) when the service is not running. This handles: - Double-shutdown: lifespan already stopped the services - Early interrupt: services were never fully started Co-authored-by: lstein <111189+lstein@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Fix shutdown hang on session processor thread lock (#108) * Initial plan * Fix shutdown hang: wake session processor thread on stop() and mark daemon Co-authored-by: lstein <111189+lstein@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * Fix: shut down asyncio executor on KeyboardInterrupt to prevent post-generation hang (#112) Fix: cancel pending asyncio tasks before loop.close() to suppress destroyed-task warnings Fix: suppress stack trace when dispatching events after event loop is closed on shutdown Fix: cancel in-progress generation on stop() to prevent core dump during mid-flight Ctrl+C Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

…ions (#8920) * Persist selected board and auto-select most recent image across browser sessions (#92) * Persist selectedBoardId across browser sessions Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * fix(frontend): make appStarted listener async so image auto-selection works on startup Co-authored-by: lstein <111189+lstein@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * chore(frontend): remove unwanted package-lock.json --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

…nation directory (#104) (#8931) * Initial plan * Fix race condition in _do_download when scanning for .downloading files * chore(backend): update copyright --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* fix(prompt): add more punctuations, fixes attention hotkeys removing them from prompt. * fix(prompt): improve numeric weighting calculation * feat(prompts): add numeric attention preference toggle to settings * feat(prompts): use attention style preference, rewrite to accomodate prompt functions * fix(prompts): account for weirdness with quotes account for mismatching quotes, missing quotes and other quote entities * fix(prompts): add tests, qol improvements, code cleanup * fix(prompts): test lint * fix(prompts): remove unused exports * fix(prompts): separator whitespace serialization --------- Co-authored-by: joshistoast <me@joshcorbett.com> Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

The reidentify endpoint overwrote the model's relative path with an absolute path from the prober, and unconditionally accessed trigger_phrases which doesn't exist on all config types (e.g. IP Adapters), causing an AttributeError. Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

) * perf(flux2): optimize model loading order to prevent cache eviction (fixes #7513) * Update flux2_klein_text_encoder.py * Update flux2_klein_text_encoder.py version --------- Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev> Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

…ansformer-only keys (#8938) LoRAs trained with musubi-tuner (and potentially other trainers) that only target transformer blocks (double_blocks/single_blocks) without embedding layers (txt_in/vector_in/context_embedder) were incorrectly classified as Flux 1. Add fallback detection using attention projection hidden_size and MLP ratio from transformer block tensors Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* translationBot(ui): update translation (Italian) Currently translated at 98.0% (2205 of 2250 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI * translationBot(ui): update translation files Updated by "Remove blank strings" hook in Weblate. Co-authored-by: Hosted Weblate <hosted@weblate.org> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ Translation: InvokeAI/Web UI * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2210 of 2259 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2224 of 2272 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2252 of 2295 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.0% (2264 of 2309 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Russian) Currently translated at 60.7% (1419 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2290 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2319 of 2372 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2327 of 2380 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ --------- Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Co-authored-by: DustyShoe <warukeichi@gmail.com>

* Added SQL injection tests * Updated tests after multi-user merge * ruff:format --------- Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* Add user management UI for admin and regular users (#106) * Add user management UI and backend API endpoints Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Fix user management feedback: cancel/back navigation, system user filter, tooltip fix Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Make Back button on User Management page more prominent Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * chore(frontend): typegen --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com> * Add Confirm Password field to My Profile password change form (#110) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev>

* fix(gallery): restore arrow-key browsing and extract shared prev/next navigation * Added same behavior to Upscale mode and autofocus to gallery after using hotkeys Ctrl+Enter and Ctrl+Shift+Enter * restore arrow navigation focus flow across viewer states * fix(gallery): stabilize arrow-key browsing, remove viewer UI flicker, and optimize code --------- Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* docs(multiuser): update multiuser mode documentation * Update docs/multiuser/user_guide.md Co-authored-by: dunkeroni <dunkeroni@gmail.com> * Update docs/multiuser/user_guide.md Co-authored-by: dunkeroni <dunkeroni@gmail.com> * Update docs/multiuser/user_guide.md Co-authored-by: dunkeroni <dunkeroni@gmail.com> * slight wording change * add info about the host interface binding option --------- Co-authored-by: dunkeroni <dunkeroni@gmail.com>

Co-authored-by: Contributor <contributor@example.com>

* translationBot(ui): update translation (Italian) Currently translated at 98.0% (2205 of 2250 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI * translationBot(ui): update translation files Updated by "Remove blank strings" hook in Weblate. Co-authored-by: Hosted Weblate <hosted@weblate.org> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ Translation: InvokeAI/Web UI * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2210 of 2259 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2224 of 2272 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2252 of 2295 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.0% (2264 of 2309 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Russian) Currently translated at 60.7% (1419 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2290 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2319 of 2372 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2327 of 2380 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2328 of 2382 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.5% (2370 of 2429 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ --------- Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Co-authored-by: DustyShoe <warukeichi@gmail.com>

* feat: add strict_password_checking config option to relax password requirements - Add `strict_password_checking: bool = Field(default=False)` to InvokeAIAppConfig - Add `get_password_strength()` function to password_utils.py (returns weak/moderate/strong) - Add `strict_password_checking` field to SetupStatusResponse API endpoint - Update users_base.py and users_default.py to accept `strict_password_checking` param - Update auth.py router to pass config.strict_password_checking to all user service calls - Create shared frontend utility passwordUtils.ts for password strength validation - Update AdministratorSetup, UserProfile, UserManagement components to: - Fetch strict_password_checking from setup status endpoint - Show colored strength indicators (red/yellow/blue) in non-strict mode - Allow any non-empty password in non-strict mode - Maintain strict validation behavior when strict_password_checking=True - Update SetupStatusResponse type in auth.ts endpoint - Add passwordStrength and passwordHelperRelaxed translation keys to en.json - Add tests for new get_password_strength() function Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * Changes before error encountered Co-authored-by: lstein <111189+lstein@users.noreply.github.com> * chore(backend): docstrings * chore(frontend): typegen --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lstein <111189+lstein@users.noreply.github.com> Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>

…ory (#8954) When deleting a file-based model (e.g. LoRA), the previous logic used rmtree on the parent directory, which would delete all files in that folder — even unrelated ones. Now only the specific model file is removed, and the parent directory is cleaned up only if empty afterward.

…dels (#8960) * fix(ui): resolve models by name+base+type when recalling metadata for reinstalled models When a model (IP Adapter, ControlNet, etc.) is deleted and reinstalled, it gets a new UUID key. Previously, metadata recall would fail because it only looked up models by their stored UUID key. Now the recall falls back to searching by name+base+type, allowing reinstalled models with the same name to be correctly resolved. https://claude.ai/code/session_01XYubzMK363BXGTvfJJqFnX * Add hash-based model recall fallback for reinstalled models When a model is deleted and reinstalled, it gets a new UUID key but retains the same BLAKE3 content hash. This adds hash as a middle fallback stage in model resolution (key → hash → name+base+type), making recall more robust. Changes: - Add /api/v2/models/get_by_hash backend endpoint (uses existing search_by_hash from model records store) - Add getModelConfigByHash RTK Query endpoint in frontend - Add hash fallback to both resolveModel and parseModelIdentifier https://claude.ai/code/session_01XYubzMK363BXGTvfJJqFnX * Chore pnpm fix * Chore typegen --------- Co-authored-by: Claude <noreply@anthropic.com>

* Repair partially loaded Qwen models after cancel to avoid device mismatches * ruff * Repair CogView4 text encoder after canceled partial loads * Avoid MPS CI crash in repair regression test * Fix MPS device assertion in repair test

* translationBot(ui): update translation (Italian) Currently translated at 98.0% (2205 of 2250 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI * translationBot(ui): update translation files Updated by "Remove blank strings" hook in Weblate. Co-authored-by: Hosted Weblate <hosted@weblate.org> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ Translation: InvokeAI/Web UI * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2210 of 2259 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2224 of 2272 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2252 of 2295 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.0% (2264 of 2309 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Russian) Currently translated at 60.7% (1419 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2290 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2319 of 2372 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2327 of 2380 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2328 of 2382 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.5% (2370 of 2429 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Finnish) Currently translated at 1.5% (37 of 2429 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/fi/ * translationBot(ui): update translation (Italian) Currently translated at 97.5% (2373 of 2433 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ --------- Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Co-authored-by: DustyShoe <warukeichi@gmail.com> Co-authored-by: Ilmari Laakkonen <ilmarille@gmail.com>

* change submenu icon to phosphor * Use PiIntersectSquareBold

* chore: bump version to 6.12.0 * chore: update What's New text

* Add chained collect node * test(frontend): align parseSchema fixtures with collect v1.1 and normalize undefined fields in assertions * fix(nodes): block collect-to-collect links when inferred item types differ --------- Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

* translationBot(ui): update translation (Italian) Currently translated at 98.0% (2205 of 2250 strings) Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ Translation: InvokeAI/Web UI * translationBot(ui): update translation files Updated by "Remove blank strings" hook in Weblate. Co-authored-by: Hosted Weblate <hosted@weblate.org> Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ Translation: InvokeAI/Web UI * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2210 of 2259 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.8% (2224 of 2272 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2252 of 2295 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 98.0% (2264 of 2309 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Russian) Currently translated at 60.7% (1419 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ru/ * translationBot(ui): update translation (Italian) Currently translated at 98.1% (2290 of 2334 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2319 of 2372 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2327 of 2380 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.7% (2328 of 2382 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Italian) Currently translated at 97.5% (2370 of 2429 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Finnish) Currently translated at 1.5% (37 of 2429 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/fi/ * translationBot(ui): update translation (Italian) Currently translated at 97.5% (2373 of 2433 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ * translationBot(ui): update translation (Japanese) Currently translated at 87.1% (2120 of 2433 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/ja/ * translationBot(ui): update translation (Italian) Currently translated at 97.5% (2374 of 2433 strings) Translation: InvokeAI/Web UI Translate-URL: https://hosted.weblate.org/projects/invokeai/web-ui/it/ --------- Co-authored-by: Riccardo Giovanetti <riccardo.giovanetti@gmail.com> Co-authored-by: DustyShoe <warukeichi@gmail.com> Co-authored-by: Ilmari Laakkonen <ilmarille@gmail.com> Co-authored-by: 嶋田豪介 <shimada_gosuke@cyberagent.co.jp>

- Apply ruff 0.11.2 formatting to the files flagged by `ruff format --check`. - The new fail-fast guard in get_generation_devices() (reject a CUDA device that doesn't exist) made the pre-existing test_get_generation_devices_explicit_list_is_deduplicated fail on CPU-only CI runners, since it passes a cuda list with no CUDA present. Mock torch.cuda.is_available/device_count in that test (matching the existing pattern in this file) so it validates dedup on any runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three RAM fixes for multi-GPU (and one that helps single-GPU too), addressing transient spikes to ~100% RAM and swapping during text-encode/transformer loads: 1. Cap the global RAM-cache budget at a safe fraction of system RAM. When max_cache_ram_gb is unset, the budget was the *sum* of the per-device cache heuristics, so N GPUs each claiming ~50% of RAM summed to ~N*50% and starved the OS. Now clamp the sum to ModelCache.calc_system_ram_headroom_bytes() (50% of RAM - 2GB baseline, floored at 4GB). Promote the sizing magic numbers to named constants shared by the per-device heuristic and the global cap. 2. Adopt already-resident CPU weights across devices at load time. When a second device loads a model another device already holds, deep-copy a registered meta-weight structural clone and assign the shared canonical weights, instead of re-reading the model from disk and materializing a full transient second copy. Loader-agnostic (one mechanism in ModelLoader, no per-loader code): works for diffusers, single-file checkpoint, GGUF and transformers models, and preserves registered hooks (e.g. fp8 layerwise-cast). Best-effort with a meta-tensor self-check and fallback to a normal disk load on any failure. Skipped on single-device installs. 3. Dequantize FLUX.2 FP8 checkpoints straight to bf16. _dequantize_fp8_weights materialized the whole model in float32 (~36GB for 9B) before a later cast to bf16; now the multiply is done in float32 but stored bf16 per-weight, so the model is never held in float32. Numerically identical; halves the cold-load transient (helps single-GPU too). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Qwen Image VAE encode/decode invocations called model_on_device() without a working-memory estimate, unlike every other VAE family (SD/SDXL/SD3/CogView4/FLUX). So the model cache reserved only its small default working memory, never offloaded a large resident transformer (the VAE weights themselves are tiny), and the VAE's forward-pass activations then OOM'd VRAM — e.g. a ~40GB Qwen Image Edit transformer left ~1GB free while decode needed ~5GB. Reproduces single-GPU; unrelated to the multi-GPU RAM work. Add estimate_vae_working_memory_qwen_image() (same per-output-pixel scaling as the other estimators, handling the 5D Qwen latents) and pass it from both the i2l (encode, used for reference images in Image Edit) and l2i (decode) nodes, so the cache offloads the transformer before the VAE runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The FLUX.2 VAE encoder's mid-block self-attention scales quadratically with the input's spatial size, and on ROCm scaled_dot_product_attention falls back to a materialized attention matrix. Encoding a reference image (kontext) at full size therefore allocated ~15GB in a single attention call at 1024px — and hundreds of GB at the 2024px reference cap — OOMing VRAM regardless of how much other model memory was freed. Tile the reference-image encode to bound per-tile attention. The VAE's default tile size equals its sample_size (1024), whose per-tile attention still OOMs, so force a 512px tile (with a matching latent tile size derived from the config). Save/restore the VAE's tiling config since it is a shared, cached instance, so the final image decode does not inherit these settings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ModelCache._get_vram_in_use() called torch.cuda.memory_allocated() with no device argument, while _get_vram_available() reads memory_allocated(execution_device). The formula relies on those two canceling. In multi-GPU mode each worker calls torch.cuda.set_device for its own GPU, so the process-current device flips between workers; the no-argument call can then read a different (e.g. idle) GPU's allocation, breaking the cancellation and inflating "available" VRAM toward the card total. The cache then believes there is room and never offloads, so VRAM offloading effectively ignores device_working_mem_gb in multi-GPU. Single-GPU was unaffected (current device always equals the execution device). Query self._execution_device in both _get_vram_in_use() and the cache-state debug log. Add a regression test asserting the per-cache execution device is used. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… decode peak The Qwen Image VAE is a 3D-conv (video) VAE whose decode allocates large conv3d feature maps. A ~1MP decode was measured to peak at ~17 GiB of VRAM — far above what the generic 2200/1100 SD/FLUX constants reserved (~4.6 GiB), so the cache concluded the decode "fit" alongside the resident 20GB transformer + 15GB text encoder, never offloaded them, and OOMed. The offload only frees ~(working_mem - free) bytes, so the reservation must both cover the real peak and be large enough to trigger the offload of models the decode doesn't need. Raise the Qwen decode/encode constants (13000/6500) to match the measured peak. It's linear in output pixels, so it over-reserves past ~1.5MP (where the decode can exceed the card even after offloading) — that case is covered by force_tiled_decode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Qwen Image latents-to-image node hardcoded vae.disable_tiling(), ignoring the global force_tiled_decode setting that the SD/SDXL l2i node honors. Wire it up the same way so users can opt into tiled VAE decode for very large outputs that exceed VRAM even after the transformer/text encoder are offloaded. Off by default, so normal-size decodes are unchanged (full-frame, no tile blending). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The preview-panel progress circle re-renders on every InvocationProgressEvent. The parent passes a fresh progressEvent object each event, so the CircularProgress re-rendered constantly; during the indeterminate phases (everything except denoising) that restarted its CSS spin animation each time, which looked like the disk flashing. (Determinate denoising was unaffected because the value genuinely changes per step.) Split the circle into a memoized, ref-forwarding subcomponent keyed on its visual props (isIndeterminate, value, device label) so message-only updates no longer re-render it and the spin animation stays continuous. The Tooltip still anchors to it via the forwarded ref. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…cket double-emit) (#9288) * feat(api): add append mode to recall reference images POST /api/v1/recall/{queue_id}?append=true now asks the frontend to add the recalled reference images (ip_adapters and model-free reference_images) to its existing list instead of replacing it. The flag rides inside the event's parameters dict so the generated client schema needs no regeneration, and is injected after the persistence loop so it is never stored as a recall parameter. Mutually exclusive with strict. The frontend dispatches refImagesRecalled with replace:false in append mode, and skips the dispatch entirely when nothing resolved so a failed append can never clear the user's current reference images. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(sockets): emit recall event once to owner+admin room union RecallParametersUpdatedEvent was emitted in two separate socket.io calls — one to the owner's user room, one to the admin room. A socket that belongs to both (the "system" user in single-user mode is also an admin, so it joins user:system AND admin) received the event twice. That double delivery was invisible for the scalar/replace recall fields, which are idempotent, but the append-mode reference-image recall pushes rather than replaces — so each append showed up as two copies of the same reference image in the InvokeAI canvas. Emit once to the room union [user_room, "admin"] instead. python-socketio deduplicates recipients across a room list, so a socket in both rooms is delivered to exactly once, while genuinely distinct owner/admin sockets still each receive it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(api): regenerate openapi.json + schema.ts for recall append param Rebuilds the committed OpenAPI schema and generated TypeScript types so the update_recall_parameters operation advertises the new append query parameter. Generated via 'make frontend-openapi' / 'frontend-typegen' equivalent; the only change is the added append param + its docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: document append query parameter for recall API Documents the new append=true query parameter on POST /api/v1/recall/{queue_id}: - new Query parameters subsection covering strict and append - mutual exclusivity (strict+append -> 400) with error body - append-mode cURL example - updated WebSocket Events + frontend log sample for the merged reference-image list Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>

* fix metadata overrides with empty string values * chore(backend): ruff --------- Co-authored-by: wunianze666-netizen <wunianze666@gmail.com> Co-authored-by: Alexander Eichhorn <alex@eichhorn.dev> Co-authored-by: Jonathan <34005131+JPPhoto@users.noreply.github.com>

* fix(z-image): repair regional guidance forward after diffusers refactor Z-Image Regional Guidance crashed with "split_with_sizes expects split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The regional-prompting patch was a hand-copied snapshot of an outdated ZImageTransformer2DModel.forward. The installed diffusers version changed _pad_with_ids so caption pos_ids are now longer than the caption feature tensor, while the stale patch split RoPE embeddings by feature lengths instead of pos_ids lengths. Rewrite create_regional_forward to delegate to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) and only override the main-layer attention mask to inject the regional mask. This keeps the patch in sync with upstream diffusers and stops re-implementing the drift-prone patchify/RoPE/padding logic. * fix(z-image): repair & realign regional guidance after diffusers refactor Z-Image Regional Guidance crashed with "split_with_sizes expects split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The regional-prompting patch was a hand-copied snapshot of an outdated ZImageTransformer2DModel.forward; the installed diffusers version changed _pad_with_ids so caption pos_ids are longer than the caption feature tensor, while the stale patch split RoPE embeddings by feature lengths instead of pos_ids lengths. Rewrite create_regional_forward to delegate to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) so it stays in sync with upstream diffusers, and only override the main-layer attention mask. Also fix two reasons regional guidance had no visible effect: - Mask alignment: the unified sequence pads the image and caption blocks individually to a multiple of 32, so the real layout is [img_real | img_pad | txt_real | txt_pad]. Scatter the four regional sub-blocks into their padding-aware positions instead of assuming a contiguous top-left block (which only matched square 1024x1024). - CFG pass: the patched forward also runs for the negative prompt; only apply the regional mask to passes whose caption length matches the positive prompt, otherwise fall back to the plain padding mask. * Chore Ruff + Typegen * fix(z-image): use identity to gate regional mask onto the positive pass The regional attention patch ran for both the conditioned and negative/CFG forward passes and distinguished them by comparing the padded caption length against the positive prompt's expected length. Two short prompts that round up to the same multiple of 32 collided, so the positive regional mask could be injected into the unconditional prediction and silently corrupt CFG. Discriminate the conditioned pass by tensor identity (cap_feats is the exact positive_cap_feats the mask was built for) instead of a length heuristic, so the positive and negative passes can never be confused. The context manager now requires positive_cap_feats whenever a regional mask is provided, turning the previously inferred invariant into an enforced one rather than a silent no-op. Also build the (bsz, 1, S, S) float mask lazily: compute applied_regional from cheap scalar checks first and skip materializing/cloning the full mask on passes that never match (every negative pass), avoiding a ~33 MB bf16 clone per call. --------- Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

@invocation

Adds `offload_text_encoders_to_idle_gpus` (default on): when more than one generation device is configured and a GPU is idle, a session's text/prompt encoder runs on the idle GPU instead of the one running its denoise pipeline. This avoids evicting the denoise model from VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Under full load (no idle GPU) behavior is unchanged. Mechanism: - New GENERATION_DEVICE_POOL arbiter (backend/util/device_pool.py) with a per-device exclusive-use lock. A native session blocking-acquires its own device's lock for the whole run; an encoder node try-borrows an idle device's lock for the duration of the node. This makes a borrowed encoder and a native session mutually exclusive on a GPU -- preventing the shared-encoder corruption that produced garbled images -- and is deadlock-free (borrows are non-blocking; a session only ever blocks on its own device). - DefaultSessionRunner re-pins the worker thread to the borrowed device for the whole encoder node; conditioning is stored on the CPU and the denoiser picks it up on its own GPU afterward. - Nodes opt in via @invocation(idle_gpu_offloadable=True), mirroring the existing `bottleneck` ClassVar marker. Applied to the text/prompt encoder nodes (compel + sdxl/refiner, flux, sd3, qwen-image, anima, cogview4, flux2 klein, z-image, flux_redux). Inspired by #9310; supersedes it. Tests: device-pool lock semantics, two concurrency regression tests asserting a session and a borrow never use a GPU at the same time, the runner offload context-manager behavior, and a marker-wiring check. Docs: invokeai-yaml.mdx (config setting) and creating-nodes.mdx (how to support the feature in a node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lstein · 2026-06-28T23:34:23Z

Closing: this was opened against a stale same-named branch on the upstream repo, producing an incorrect diff. #9263's actual head lives on the fork (lstein/InvokeAI). The correctly-based, properly-stacked PR is lstein#137.

weblate and others added 30 commits February 28, 2026 15:09

Fix: Replace deprecated huggingface_hub.get_token_permission() with w…

ec46b5c

…hoami() (#8913) `get_token_permission` is deprecated and will be removed in huggingface_hub 1.0. Use `whoami()` to validate the token instead, as recommended by the deprecation warning.

Fix(MM): Fixed incorrect advertised model size for Z-Image Turbo (#8934)

445c6a3

fix(model-install): persist remote access_token for resume after rest…

6fe7910

…art (#8932) Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

Added SQL injection tests (#8873)

62b7c7a

* Added SQL injection tests * Updated tests after multi-user merge * ruff:format --------- Co-authored-by: Lincoln Stein <lincoln.stein@gmail.com>

docs: Fix typo in README.md - 'easy' should be 'ease' (#8948)

2179d93

Co-authored-by: Contributor <contributor@example.com>

docs: Fix typo in contributing guide - remove extra 'the' (#8949)

f01cbd3

Co-authored-by: Contributor <contributor@example.com>

Fix/model cache Qwen/CogView4 cancel repair (#8959)

dc5007f

* Repair partially loaded Qwen models after cancel to avoid device mismatches * ruff * Repair CogView4 text encoder after canceled partial loads * Avoid MPS CI crash in repair regression test * Fix MPS device assertion in repair test

Fix(UI): Replace boolean submenu icon with PiIntersectSquareBold (#8962)

17da6bb

* change submenu icon to phosphor * Use PiIntersectSquareBold

Chore: Bump version to 6.12.0 (#8981)

438515b

* chore: bump version to 6.12.0 * chore: update What's New text

lstein and others added 13 commits June 25, 2026 21:40

Merge branch 'main' into lstein/feat/multi-gpu

b275c38

lstein requested review from blessedcoolant and hipsterusername as code owners June 28, 2026 23:29

lstein closed this Jun 28, 2026

lstein deleted the lstein/feat/multi-gpu-use-idle branch June 28, 2026 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(multi-gpu): offload text encoders to idle GPUs#9311

feat(multi-gpu): offload text encoders to idle GPUs#9311
lstein wants to merge 7269 commits into
lstein/feat/multi-gpufrom
lstein/feat/multi-gpu-use-idle

lstein commented Jun 28, 2026

Uh oh!

lstein commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

lstein commented Jun 28, 2026

⚠️ Merge order

Summary

How it works

Why the per-device lock

Tests

Docs

Verification

Uh oh!

lstein commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants