lstein · lstein · Jun 28, 2026
diff --git a/docs/src/content/docs/configuration/invokeai-yaml.mdx b/docs/src/content/docs/configuration/invokeai-yaml.mdx
@@ -147,6 +147,27 @@ Notes:
 
 During parallel generation, the progress display shows one progress bar per active session, stacked vertically, each disappearing as its session completes.
 
+#### Text Encoder Offload to Idle GPUs
+
+When more than one GPU is configured for generation but not all of them are busy, InvokeAI can run a session's text/prompt encoder on a currently-idle GPU instead of the GPU running its denoise pipeline. This avoids evicting the denoise model from VRAM just to make room for the encoder, and lets the cached encoder be reused across generations — making repeated generations noticeably smoother.
+
+This is controlled by the `offload_text_encoders_to_idle_gpus` setting:
+
+```yaml
+offload_text_encoders_to_idle_gpus: true # default value
+```
+
+| Value   | Behavior                                                                                                          |
+| ------- | ---------------------------------------------------------------------------------------------------------------- |
+| `true`  | Run text encoders on an idle GPU when one is available. This is the default.                                      |
+| `false` | Always run text encoders on the same GPU as the rest of the pipeline (the behavior before this feature existed).  |
+
+Notes:
+
+- This has no effect unless at least two `generation_devices` are configured. On a single device — or when every GPU is already busy with its own session — encoders run on the session's own GPU, exactly as if the setting were `false`.
+- It is purely a placement optimization and does not change generated images.
+- A borrowed GPU is used exclusively for the encoder while it runs, so it never interferes with a generation session running on that same GPU.
+
 #### Image Subfolder Strategy
 
 By default, generated images are stored in a single flat directory under `outputs/images/`. The `image_subfolder_strategy` setting lets you organize newly-created images into subfolders automatically. You can edit this setting in `invokeai.yaml` or, as an admin user, in the Settings panel.

diff --git a/docs/src/content/docs/development/Guides/creating-nodes.mdx b/docs/src/content/docs/development/Guides/creating-nodes.mdx
@@ -21,6 +21,39 @@ import { Steps, LinkCard } from '@astrojs/starlight/components';
   4. A maintainer will review the pull request and node. If the node is aligned with the direction of the project, you may be asked for permission to include it in the core project.
 </Steps>
 
+### Supporting multi-GPU text-encoder offload
+
+On a machine with more than one GPU, InvokeAI can run several generation sessions at once — one per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. To put that capacity to use, InvokeAI can run a session's **prompt/text encoder** on a currently-idle GPU instead of on the GPU running the denoise pipeline. This avoids evicting the denoise model from VRAM just to make room for the encoder, and lets the cached encoder be reused across generations.
+
+This is controlled globally by the `offload_text_encoders_to_idle_gpus` config setting (enabled by default) and opted into **per node** via the `@invocation` decorator:
+
+```python
+from invokeai.app.invocations.baseinvocation import BaseInvocation, invocation
+
+
+@invocation(
+    "my_text_encoder",
+    title="Prompt - My Model",
+    category="conditioning",
+    version="1.0.0",
+    idle_gpu_offloadable=True,  # opt in to idle-GPU offload
+)
+class MyTextEncoderInvocation(BaseInvocation):
+    ...
+```
+
+When the feature is enabled and an idle GPU is available, the **entire node** is temporarily re-pinned to a borrowed idle GPU: any model it loads goes onto that GPU and runs there. If no idle GPU is free (e.g. every GPU is busy with its own session), the node simply runs on its own GPU, unchanged. The borrow holds the idle GPU exclusively for the duration of the node, so it can never run concurrently against a native session on that same GPU.
+
+Because the whole node is moved to another device, only mark a node `idle_gpu_offloadable=True` if **all** of the following hold:
+
+- **It is encoder-only.** Its sole GPU work is loading one or more encoder models and running their forward pass. It must not load or run the denoise/transformer or VAE, or do any other work tied to the session's own GPU.
+- **It stores its result on the CPU before returning.** Move output tensors to the CPU (`tensor.detach().to("cpu")`) and save them as conditioning/tensors. The denoiser picks them up and moves them onto its own GPU later — this is what makes the cross-GPU handoff safe and device-agnostic.
+- **It places inputs on the loaded model's device, not a fixed device.** Resolve the device from the model you just loaded (e.g. `get_effective_device(model)` from `invokeai.backend.model_manager.load.model_cache.utils`, or `TorchDevice.choose_torch_device()`), rather than hard-coding `cuda:0`. The built-in `flux_text_encoder` and `compel` nodes are good references.
+
+:::caution[Only mark encoder-only nodes]
+If a node that also runs the denoiser, VAE, or other session-GPU work is marked `idle_gpu_offloadable=True`, that work will be re-pinned to the wrong GPU and can misplace tensors or raise device-mismatch errors. When in doubt, leave it unset (the default is `False`) — the node will still work correctly, just without the offload optimization.
+:::
+
 ### Community Node Template
 
 Append the following template to your pull request and the [Community Nodes](../../../workflows/community-nodes) page when submitting a node to be added to the community nodes list:

diff --git a/docs/src/generated/settings.json b/docs/src/generated/settings.json
@@ -501,6 +501,17 @@
       "type": "typing.Union[typing.Literal['auto'], list[str]]",
       "validation": {}
     },
+    {
+      "category": "DEVICE",
+      "default": true,
+      "description": "When running on multiple GPUs, load text encoders onto a currently-idle GPU instead of the one running the denoise pipeline. This avoids churning the denoise model in and out of VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Has no effect unless at least two `generation_devices` are configured and a GPU is idle; under full load encoders run on the session's own GPU as before.",
+      "env_var": "INVOKEAI_OFFLOAD_TEXT_ENCODERS_TO_IDLE_GPUS",
+      "literal_values": [],
+      "name": "offload_text_encoders_to_idle_gpus",
+      "required": false,
+      "type": "<class 'bool'>",
+      "validation": {}
+    },
     {
       "category": "DEVICE",
       "default": "auto",

@@ -59,6 +59,7 @@
     category="conditioning",
     version="1.4.0",
     classification=Classification.Prototype,
+    idle_gpu_offloadable=True,
 )
 class AnimaTextEncoderInvocation(BaseInvocation):
     """Encodes and preps a prompt for an Anima image.

@@ -271,6 +271,12 @@ def invoke_internal(self, context: InvocationContext, services: "InvocationServi
 
     bottleneck: ClassVar[Bottleneck]
 
+    idle_gpu_offloadable: ClassVar[bool] = False
+    """Whether this node's entire execution may be temporarily re-pinned to an idle GPU when
+    `offload_text_encoders_to_idle_gpus` is enabled in multi-GPU mode. Only set this to True on nodes
+    that exclusively load encoder model(s), run a forward pass, and store their result on the CPU —
+    i.e. nodes that do no work tied to the session's own GPU. Set via the `@invocation` decorator."""
+
     UIConfig: ClassVar[UIConfigBase]
 
     model_config = ConfigDict(
@@ -459,6 +465,7 @@ def get_output_for_type(cls, output_type: str) -> type[BaseInvocationOutput] | N
     "type",
     "workflow",
     "bottleneck",
+    "idle_gpu_offloadable",
 }
 
 RESERVED_INPUT_FIELD_NAMES = {"metadata", "board"}
@@ -643,6 +650,7 @@ def invocation(
     use_cache: Optional[bool] = True,
     classification: Classification = Classification.Stable,
     bottleneck: Bottleneck = Bottleneck.GPU,
+    idle_gpu_offloadable: bool = False,
 ) -> Callable[[Type[TBaseInvocation]], Type[TBaseInvocation]]:
     """
     Registers an invocation.
@@ -655,6 +663,7 @@ def invocation(
     :param Optional[bool] use_cache: Whether or not to use the invocation cache. Defaults to True. The user may override this in the workflow editor.
     :param Classification classification: The classification of the invocation. Defaults to FeatureClassification.Stable. Use Beta or Prototype if the invocation is unstable.
     :param Bottleneck bottleneck: The bottleneck of the invocation. Defaults to Bottleneck.GPU. Use Network if the invocation is network-bound.
+    :param bool idle_gpu_offloadable: Whether this node's whole execution may run on a borrowed idle GPU when `offload_text_encoders_to_idle_gpus` is enabled. Only set True for encoder-only nodes that store their result on the CPU and do no work on the session's own GPU. Defaults to False.
     """
 
     def wrapper(cls: Type[TBaseInvocation]) -> Type[TBaseInvocation]:
@@ -712,6 +721,7 @@ def wrapper(cls: Type[TBaseInvocation]) -> Type[TBaseInvocation]:
             cls.model_fields["use_cache"].default = use_cache
 
         cls.bottleneck = bottleneck
+        cls.idle_gpu_offloadable = idle_gpu_offloadable
 
         # Add the invocation type to the model.
 

@@ -23,6 +23,7 @@
     category="prompt",
     version="1.0.0",
     classification=Classification.Prototype,
+    idle_gpu_offloadable=True,
 )
 class CogView4TextEncoderInvocation(BaseInvocation):
     """Encodes and preps a prompt for a cogview4 image."""

@@ -45,6 +45,7 @@
     tags=["prompt", "compel"],
     category="prompt",
     version="1.2.1",
+    idle_gpu_offloadable=True,
 )
 class CompelInvocation(BaseInvocation):
     """Parse prompt using compel package to conditioning."""
@@ -250,6 +251,7 @@ def _lora_loader() -> Iterator[Tuple[ModelPatchRaw, float]]:
     tags=["sdxl", "compel", "prompt"],
     category="prompt",
     version="1.2.1",
+    idle_gpu_offloadable=True,
 )
 class SDXLCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
     """Parse prompt using compel package to conditioning."""
@@ -344,6 +346,7 @@ def invoke(self, context: InvocationContext) -> ConditioningOutput:
     tags=["sdxl", "compel", "prompt"],
     category="prompt",
     version="1.1.2",
+    idle_gpu_offloadable=True,
 )
 class SDXLRefinerCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
     """Parse prompt using compel package to conditioning."""

@@ -48,6 +48,7 @@
     category="prompt",
     version="1.1.1",
     classification=Classification.Prototype,
+    idle_gpu_offloadable=True,
 )
 class Flux2KleinTextEncoderInvocation(BaseInvocation):
     """Encodes and preps a prompt for Flux2 Klein image generation.

@@ -50,6 +50,7 @@ class FluxReduxOutput(BaseInvocationOutput):
     category="conditioning",
     version="2.1.0",
     classification=Classification.Beta,
+    idle_gpu_offloadable=True,
 )
 class FluxReduxInvocation(BaseInvocation):
     """Runs a FLUX Redux model to generate a conditioning tensor."""

@@ -30,6 +30,7 @@
     tags=["prompt", "conditioning", "flux"],
     category="prompt",
     version="1.1.2",
+    idle_gpu_offloadable=True,
 )
 class FluxTextEncoderInvocation(BaseInvocation):
     """Encodes and preps a prompt for a flux image."""

@@ -68,6 +68,7 @@ def _build_prompt(user_prompt: str, num_images: int) -> str:
     category="conditioning",
     version="1.2.0",
     classification=Classification.Prototype,
+    idle_gpu_offloadable=True,
 )
 class QwenImageTextEncoderInvocation(BaseInvocation):
     """Encodes text and reference images for Qwen Image using Qwen2.5-VL."""

@@ -33,6 +33,7 @@
     tags=["prompt", "conditioning", "sd3"],
     category="prompt",
     version="1.0.1",
+    idle_gpu_offloadable=True,
 )
 class Sd3TextEncoderInvocation(BaseInvocation):
     """Encodes and preps a prompt for a SD3 image."""

@@ -37,6 +37,7 @@
     category="prompt",
     version="1.1.0",
     classification=Classification.Prototype,
+    idle_gpu_offloadable=True,
 )
 class ZImageTextEncoderInvocation(BaseInvocation):
     """Encodes and preps a prompt for a Z-Image image.

@@ -206,6 +206,7 @@ class InvokeAIAppConfig(BaseSettings):
     # DEVICE
     device:                      str = Field(default="auto",                description="Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.<br>Valid values: `auto`, `cpu`, `cuda`, `mps`, `cuda:N` (where N is a device number)", pattern=r"^(auto|cpu|mps|cuda(:\d+)?)$")
     generation_devices: Union[Literal["auto"], list[str]] = Field(default="auto", description="Devices to use for parallel generation. `auto` (the default) uses every available GPU, running one generation session per GPU concurrently and distributing jobs fairly across users. Provide an explicit list (e.g. `[cuda:0, cuda:1]`) to use specific devices, or a single-device list (e.g. `[cuda:0]`) to run serially. On systems without a GPU, `auto` resolves to the single `cpu`/`mps` device.<br>Valid values: `auto`, or a list whose entries are each `cpu`, `cuda`, `mps`, or `cuda:N` (where N is a device number)")
+    offload_text_encoders_to_idle_gpus: bool = Field(default=True,          description="When running on multiple GPUs, load text encoders onto a currently-idle GPU instead of the one running the denoise pipeline. This avoids churning the denoise model in and out of VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Has no effect unless at least two `generation_devices` are configured and a GPU is idle; under full load encoders run on the session's own GPU as before.")
     precision:                PRECISION = Field(default="auto",             description="Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.")
 
     # GENERATION

@@ -1,9 +1,9 @@
 import gc
 import traceback
-from contextlib import suppress
+from contextlib import contextmanager, suppress
 from threading import BoundedSemaphore, Thread
 from threading import Event as ThreadEvent
-from typing import Optional
+from typing import Iterator, Optional
 
 import torch
 
@@ -33,6 +33,7 @@
 from invokeai.app.services.shared.graph import NodeInputError
 from invokeai.app.services.shared.invocation_context import InvocationContextData, build_invocation_context
 from invokeai.app.util.profiler import Profiler
+from invokeai.backend.util.device_pool import GENERATION_DEVICE_POOL
 from invokeai.backend.util.devices import TorchDevice
 
 
@@ -129,8 +130,9 @@ def run_node(self, invocation: BaseInvocation, queue_item: SessionQueueItem):
                     is_canceled=self._is_canceled,
                 )
 
-                # Invoke the node
-                output = invocation.invoke_internal(context=context, services=self._services)
+                # Invoke the node, optionally on a borrowed idle GPU (text encoders only).
+                with self._maybe_offload_to_idle_gpu(invocation):
+                    output = invocation.invoke_internal(context=context, services=self._services)
                 # Save output and history
                 queue_item.session.complete(invocation.id, output)
 
@@ -156,6 +158,45 @@ def run_node(self, invocation: BaseInvocation, queue_item: SessionQueueItem):
                 error_traceback=error_traceback,
             )
 
+    @contextmanager
+    def _maybe_offload_to_idle_gpu(self, invocation: BaseInvocation) -> Iterator[None]:
+        """Temporarily re-pin this worker thread to an idle GPU for a text-encoder node.
+
+        When ``offload_text_encoders_to_idle_gpus`` is enabled and an idle generation GPU can be
+        borrowed, the encoder model loads into that GPU's cache and its forward runs there (all
+        device-selecting code resolves to the pinned device), keeping the busy GPU's denoise model
+        resident. The conditioning output is stored on the CPU, so the denoiser picks it up on the
+        worker's own GPU after the pin is restored.
+
+        The borrow holds the idle device's exclusive-use lock for the whole node, so a native
+        session on that GPU can never run concurrently against the same cached encoder (which would
+        corrupt it). If no idle GPU is free, the node runs on the worker's own GPU unchanged.
+        """
+        native_device = TorchDevice.get_session_device()
+        if (
+            native_device is None
+            or native_device.type != "cuda"
+            or not invocation.idle_gpu_offloadable
+            or not self._services.configuration.offload_text_encoders_to_idle_gpus
+        ):
+            yield
+            return
+
+        borrowed_device = GENERATION_DEVICE_POOL.try_borrow(exclude=native_device)
+        if borrowed_device is None:
+            yield
+            return
+
+        self._services.logger.debug(
+            f"Running {invocation.get_type()} on idle device {borrowed_device} (session device {native_device})."
+        )
+        TorchDevice.set_session_device(borrowed_device)
+        try:
+            yield
+        finally:
+            TorchDevice.set_session_device(native_device)
+            GENERATION_DEVICE_POOL.release_borrow(borrowed_device)
+
     def _on_before_run_session(self, queue_item: SessionQueueItem) -> None:
         """Called before a session is run.
 
@@ -388,6 +429,10 @@ def start(self, invoker: Invoker) -> None:
 
         devices = self._resolve_devices()
 
+        # Register the generation devices so the model loader can discover idle GPUs to host text
+        # encoders on (see offload_text_encoders_to_idle_gpus). None means legacy single-device mode.
+        GENERATION_DEVICE_POOL.set_generation_devices([d for d in devices if d is not None])
+
         # If profiling is enabled, create a profiler. The same profiler will be used for all sessions. Internally,
         # the profiler will create a new profile for each session. Profiling uses a process-global cProfile, which
         # cannot cleanly attribute work when multiple sessions run concurrently, so it is disabled in multi-GPU mode.
@@ -582,8 +627,16 @@ def _process(
                         f"on {worker.label}"
                     )
 
-                    # Run the graph
-                    worker.runner.run(queue_item=worker.queue_item)
+                    # Run the graph. Hold this GPU's exclusive-use lock for the whole session so no
+                    # other worker can borrow it for text-encoder offload while we're running on it
+                    # (a borrow + concurrent native session on one GPU would corrupt the shared
+                    # cached encoder). Acquired here, after dequeue, so an idle worker doesn't hold
+                    # the lock and block borrows while waiting for work.
+                    GENERATION_DEVICE_POOL.acquire_session(worker.device)
+                    try:
+                        worker.runner.run(queue_item=worker.queue_item)
+                    finally:
+                        GENERATION_DEVICE_POOL.release_session(worker.device)
 
                 except Exception as e:
                     error_type = e.__class__.__name__