Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/src/content/docs/configuration/invokeai-yaml.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,27 @@ Notes:

During parallel generation, the progress display shows one progress bar per active session, stacked vertically, each disappearing as its session completes.

#### Text Encoder Offload to Idle GPUs

When more than one GPU is configured for generation but not all of them are busy, InvokeAI can run a session's text/prompt encoder on a currently-idle GPU instead of the GPU running its denoise pipeline. This avoids evicting the denoise model from VRAM just to make room for the encoder, and lets the cached encoder be reused across generations — making repeated generations noticeably smoother.

This is controlled by the `offload_text_encoders_to_idle_gpus` setting:

```yaml
offload_text_encoders_to_idle_gpus: true # default value
```

| Value | Behavior |
| ------- | ---------------------------------------------------------------------------------------------------------------- |
| `true` | Run text encoders on an idle GPU when one is available. This is the default. |
| `false` | Always run text encoders on the same GPU as the rest of the pipeline (the behavior before this feature existed). |

Notes:

- This has no effect unless at least two `generation_devices` are configured. On a single device — or when every GPU is already busy with its own session — encoders run on the session's own GPU, exactly as if the setting were `false`.
- It is purely a placement optimization and does not change generated images.
- A borrowed GPU is used exclusively for the encoder while it runs, so it never interferes with a generation session running on that same GPU.

#### Image Subfolder Strategy

By default, generated images are stored in a single flat directory under `outputs/images/`. The `image_subfolder_strategy` setting lets you organize newly-created images into subfolders automatically. You can edit this setting in `invokeai.yaml` or, as an admin user, in the Settings panel.
Expand Down
33 changes: 33 additions & 0 deletions docs/src/content/docs/development/Guides/creating-nodes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,39 @@ import { Steps, LinkCard } from '@astrojs/starlight/components';
4. A maintainer will review the pull request and node. If the node is aligned with the direction of the project, you may be asked for permission to include it in the core project.
</Steps>

### Supporting multi-GPU text-encoder offload

On a machine with more than one GPU, InvokeAI can run several generation sessions at once — one per GPU. When fewer sessions are running than there are GPUs, the spare GPUs sit idle. To put that capacity to use, InvokeAI can run a session's **prompt/text encoder** on a currently-idle GPU instead of on the GPU running the denoise pipeline. This avoids evicting the denoise model from VRAM just to make room for the encoder, and lets the cached encoder be reused across generations.

This is controlled globally by the `offload_text_encoders_to_idle_gpus` config setting (enabled by default) and opted into **per node** via the `@invocation` decorator:

```python
from invokeai.app.invocations.baseinvocation import BaseInvocation, invocation


@invocation(
"my_text_encoder",
title="Prompt - My Model",
category="conditioning",
version="1.0.0",
idle_gpu_offloadable=True, # opt in to idle-GPU offload
)
class MyTextEncoderInvocation(BaseInvocation):
...
```

When the feature is enabled and an idle GPU is available, the **entire node** is temporarily re-pinned to a borrowed idle GPU: any model it loads goes onto that GPU and runs there. If no idle GPU is free (e.g. every GPU is busy with its own session), the node simply runs on its own GPU, unchanged. The borrow holds the idle GPU exclusively for the duration of the node, so it can never run concurrently against a native session on that same GPU.

Because the whole node is moved to another device, only mark a node `idle_gpu_offloadable=True` if **all** of the following hold:

- **It is encoder-only.** Its sole GPU work is loading one or more encoder models and running their forward pass. It must not load or run the denoise/transformer or VAE, or do any other work tied to the session's own GPU.
- **It stores its result on the CPU before returning.** Move output tensors to the CPU (`tensor.detach().to("cpu")`) and save them as conditioning/tensors. The denoiser picks them up and moves them onto its own GPU later — this is what makes the cross-GPU handoff safe and device-agnostic.
- **It places inputs on the loaded model's device, not a fixed device.** Resolve the device from the model you just loaded (e.g. `get_effective_device(model)` from `invokeai.backend.model_manager.load.model_cache.utils`, or `TorchDevice.choose_torch_device()`), rather than hard-coding `cuda:0`. The built-in `flux_text_encoder` and `compel` nodes are good references.

:::caution[Only mark encoder-only nodes]
If a node that also runs the denoiser, VAE, or other session-GPU work is marked `idle_gpu_offloadable=True`, that work will be re-pinned to the wrong GPU and can misplace tensors or raise device-mismatch errors. When in doubt, leave it unset (the default is `False`) — the node will still work correctly, just without the offload optimization.
:::

### Community Node Template

Append the following template to your pull request and the [Community Nodes](../../../workflows/community-nodes) page when submitting a node to be added to the community nodes list:
Expand Down
11 changes: 11 additions & 0 deletions docs/src/generated/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,17 @@
"type": "typing.Union[typing.Literal['auto'], list[str]]",
"validation": {}
},
{
"category": "DEVICE",
"default": true,
"description": "When running on multiple GPUs, load text encoders onto a currently-idle GPU instead of the one running the denoise pipeline. This avoids churning the denoise model in and out of VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Has no effect unless at least two `generation_devices` are configured and a GPU is idle; under full load encoders run on the session's own GPU as before.",
"env_var": "INVOKEAI_OFFLOAD_TEXT_ENCODERS_TO_IDLE_GPUS",
"literal_values": [],
"name": "offload_text_encoders_to_idle_gpus",
"required": false,
"type": "<class 'bool'>",
"validation": {}
},
{
"category": "DEVICE",
"default": "auto",
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/anima_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
category="conditioning",
version="1.4.0",
classification=Classification.Prototype,
idle_gpu_offloadable=True,
)
class AnimaTextEncoderInvocation(BaseInvocation):
"""Encodes and preps a prompt for an Anima image.
Expand Down
10 changes: 10 additions & 0 deletions invokeai/app/invocations/baseinvocation.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,12 @@ def invoke_internal(self, context: InvocationContext, services: "InvocationServi

bottleneck: ClassVar[Bottleneck]

idle_gpu_offloadable: ClassVar[bool] = False
"""Whether this node's entire execution may be temporarily re-pinned to an idle GPU when
`offload_text_encoders_to_idle_gpus` is enabled in multi-GPU mode. Only set this to True on nodes
that exclusively load encoder model(s), run a forward pass, and store their result on the CPU —
i.e. nodes that do no work tied to the session's own GPU. Set via the `@invocation` decorator."""

UIConfig: ClassVar[UIConfigBase]

model_config = ConfigDict(
Expand Down Expand Up @@ -459,6 +465,7 @@ def get_output_for_type(cls, output_type: str) -> type[BaseInvocationOutput] | N
"type",
"workflow",
"bottleneck",
"idle_gpu_offloadable",
}

RESERVED_INPUT_FIELD_NAMES = {"metadata", "board"}
Expand Down Expand Up @@ -643,6 +650,7 @@ def invocation(
use_cache: Optional[bool] = True,
classification: Classification = Classification.Stable,
bottleneck: Bottleneck = Bottleneck.GPU,
idle_gpu_offloadable: bool = False,
) -> Callable[[Type[TBaseInvocation]], Type[TBaseInvocation]]:
"""
Registers an invocation.
Expand All @@ -655,6 +663,7 @@ def invocation(
:param Optional[bool] use_cache: Whether or not to use the invocation cache. Defaults to True. The user may override this in the workflow editor.
:param Classification classification: The classification of the invocation. Defaults to FeatureClassification.Stable. Use Beta or Prototype if the invocation is unstable.
:param Bottleneck bottleneck: The bottleneck of the invocation. Defaults to Bottleneck.GPU. Use Network if the invocation is network-bound.
:param bool idle_gpu_offloadable: Whether this node's whole execution may run on a borrowed idle GPU when `offload_text_encoders_to_idle_gpus` is enabled. Only set True for encoder-only nodes that store their result on the CPU and do no work on the session's own GPU. Defaults to False.
"""

def wrapper(cls: Type[TBaseInvocation]) -> Type[TBaseInvocation]:
Expand Down Expand Up @@ -712,6 +721,7 @@ def wrapper(cls: Type[TBaseInvocation]) -> Type[TBaseInvocation]:
cls.model_fields["use_cache"].default = use_cache

cls.bottleneck = bottleneck
cls.idle_gpu_offloadable = idle_gpu_offloadable

# Add the invocation type to the model.

Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/cogview4_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
category="prompt",
version="1.0.0",
classification=Classification.Prototype,
idle_gpu_offloadable=True,
)
class CogView4TextEncoderInvocation(BaseInvocation):
"""Encodes and preps a prompt for a cogview4 image."""
Expand Down
3 changes: 3 additions & 0 deletions invokeai/app/invocations/compel.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
tags=["prompt", "compel"],
category="prompt",
version="1.2.1",
idle_gpu_offloadable=True,
)
class CompelInvocation(BaseInvocation):
"""Parse prompt using compel package to conditioning."""
Expand Down Expand Up @@ -250,6 +251,7 @@ def _lora_loader() -> Iterator[Tuple[ModelPatchRaw, float]]:
tags=["sdxl", "compel", "prompt"],
category="prompt",
version="1.2.1",
idle_gpu_offloadable=True,
)
class SDXLCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
"""Parse prompt using compel package to conditioning."""
Expand Down Expand Up @@ -344,6 +346,7 @@ def invoke(self, context: InvocationContext) -> ConditioningOutput:
tags=["sdxl", "compel", "prompt"],
category="prompt",
version="1.1.2",
idle_gpu_offloadable=True,
)
class SDXLRefinerCompelPromptInvocation(BaseInvocation, SDXLPromptInvocationBase):
"""Parse prompt using compel package to conditioning."""
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/flux2_klein_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
category="prompt",
version="1.1.1",
classification=Classification.Prototype,
idle_gpu_offloadable=True,
)
class Flux2KleinTextEncoderInvocation(BaseInvocation):
"""Encodes and preps a prompt for Flux2 Klein image generation.
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/flux_redux.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ class FluxReduxOutput(BaseInvocationOutput):
category="conditioning",
version="2.1.0",
classification=Classification.Beta,
idle_gpu_offloadable=True,
)
class FluxReduxInvocation(BaseInvocation):
"""Runs a FLUX Redux model to generate a conditioning tensor."""
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/flux_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
tags=["prompt", "conditioning", "flux"],
category="prompt",
version="1.1.2",
idle_gpu_offloadable=True,
)
class FluxTextEncoderInvocation(BaseInvocation):
"""Encodes and preps a prompt for a flux image."""
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/qwen_image_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ def _build_prompt(user_prompt: str, num_images: int) -> str:
category="conditioning",
version="1.2.0",
classification=Classification.Prototype,
idle_gpu_offloadable=True,
)
class QwenImageTextEncoderInvocation(BaseInvocation):
"""Encodes text and reference images for Qwen Image using Qwen2.5-VL."""
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/sd3_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
tags=["prompt", "conditioning", "sd3"],
category="prompt",
version="1.0.1",
idle_gpu_offloadable=True,
)
class Sd3TextEncoderInvocation(BaseInvocation):
"""Encodes and preps a prompt for a SD3 image."""
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/z_image_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
category="prompt",
version="1.1.0",
classification=Classification.Prototype,
idle_gpu_offloadable=True,
)
class ZImageTextEncoderInvocation(BaseInvocation):
"""Encodes and preps a prompt for a Z-Image image.
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/services/config/config_default.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,7 @@ class InvokeAIAppConfig(BaseSettings):
# DEVICE
device: str = Field(default="auto", description="Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.<br>Valid values: `auto`, `cpu`, `cuda`, `mps`, `cuda:N` (where N is a device number)", pattern=r"^(auto|cpu|mps|cuda(:\d+)?)$")
generation_devices: Union[Literal["auto"], list[str]] = Field(default="auto", description="Devices to use for parallel generation. `auto` (the default) uses every available GPU, running one generation session per GPU concurrently and distributing jobs fairly across users. Provide an explicit list (e.g. `[cuda:0, cuda:1]`) to use specific devices, or a single-device list (e.g. `[cuda:0]`) to run serially. On systems without a GPU, `auto` resolves to the single `cpu`/`mps` device.<br>Valid values: `auto`, or a list whose entries are each `cpu`, `cuda`, `mps`, or `cuda:N` (where N is a device number)")
offload_text_encoders_to_idle_gpus: bool = Field(default=True, description="When running on multiple GPUs, load text encoders onto a currently-idle GPU instead of the one running the denoise pipeline. This avoids churning the denoise model in and out of VRAM to make room for the encoder, and lets a cached encoder be reused across generations. Has no effect unless at least two `generation_devices` are configured and a GPU is idle; under full load encoders run on the session's own GPU as before.")
precision: PRECISION = Field(default="auto", description="Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.")

# GENERATION
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import gc
import traceback
from contextlib import suppress
from contextlib import contextmanager, suppress
from threading import BoundedSemaphore, Thread
from threading import Event as ThreadEvent
from typing import Optional
from typing import Iterator, Optional

import torch

Expand Down Expand Up @@ -33,6 +33,7 @@
from invokeai.app.services.shared.graph import NodeInputError
from invokeai.app.services.shared.invocation_context import InvocationContextData, build_invocation_context
from invokeai.app.util.profiler import Profiler
from invokeai.backend.util.device_pool import GENERATION_DEVICE_POOL
from invokeai.backend.util.devices import TorchDevice


Expand Down Expand Up @@ -129,8 +130,9 @@ def run_node(self, invocation: BaseInvocation, queue_item: SessionQueueItem):
is_canceled=self._is_canceled,
)

# Invoke the node
output = invocation.invoke_internal(context=context, services=self._services)
# Invoke the node, optionally on a borrowed idle GPU (text encoders only).
with self._maybe_offload_to_idle_gpu(invocation):
output = invocation.invoke_internal(context=context, services=self._services)
# Save output and history
queue_item.session.complete(invocation.id, output)

Expand All @@ -156,6 +158,45 @@ def run_node(self, invocation: BaseInvocation, queue_item: SessionQueueItem):
error_traceback=error_traceback,
)

@contextmanager
def _maybe_offload_to_idle_gpu(self, invocation: BaseInvocation) -> Iterator[None]:
"""Temporarily re-pin this worker thread to an idle GPU for a text-encoder node.

When ``offload_text_encoders_to_idle_gpus`` is enabled and an idle generation GPU can be
borrowed, the encoder model loads into that GPU's cache and its forward runs there (all
device-selecting code resolves to the pinned device), keeping the busy GPU's denoise model
resident. The conditioning output is stored on the CPU, so the denoiser picks it up on the
worker's own GPU after the pin is restored.

The borrow holds the idle device's exclusive-use lock for the whole node, so a native
session on that GPU can never run concurrently against the same cached encoder (which would
corrupt it). If no idle GPU is free, the node runs on the worker's own GPU unchanged.
"""
native_device = TorchDevice.get_session_device()
if (
native_device is None
or native_device.type != "cuda"
or not invocation.idle_gpu_offloadable
or not self._services.configuration.offload_text_encoders_to_idle_gpus
):
yield
return

borrowed_device = GENERATION_DEVICE_POOL.try_borrow(exclude=native_device)
if borrowed_device is None:
yield
return

self._services.logger.debug(
f"Running {invocation.get_type()} on idle device {borrowed_device} (session device {native_device})."
)
TorchDevice.set_session_device(borrowed_device)
try:
yield
finally:
TorchDevice.set_session_device(native_device)
GENERATION_DEVICE_POOL.release_borrow(borrowed_device)

def _on_before_run_session(self, queue_item: SessionQueueItem) -> None:
"""Called before a session is run.

Expand Down Expand Up @@ -388,6 +429,10 @@ def start(self, invoker: Invoker) -> None:

devices = self._resolve_devices()

# Register the generation devices so the model loader can discover idle GPUs to host text
# encoders on (see offload_text_encoders_to_idle_gpus). None means legacy single-device mode.
GENERATION_DEVICE_POOL.set_generation_devices([d for d in devices if d is not None])

# If profiling is enabled, create a profiler. The same profiler will be used for all sessions. Internally,
# the profiler will create a new profile for each session. Profiling uses a process-global cProfile, which
# cannot cleanly attribute work when multiple sessions run concurrently, so it is disabled in multi-GPU mode.
Expand Down Expand Up @@ -582,8 +627,16 @@ def _process(
f"on {worker.label}"
)

# Run the graph
worker.runner.run(queue_item=worker.queue_item)
# Run the graph. Hold this GPU's exclusive-use lock for the whole session so no
# other worker can borrow it for text-encoder offload while we're running on it
# (a borrow + concurrent native session on one GPU would corrupt the shared
# cached encoder). Acquired here, after dequeue, so an idle worker doesn't hold
# the lock and block borrows while waiting for work.
GENERATION_DEVICE_POOL.acquire_session(worker.device)
try:
worker.runner.run(queue_item=worker.queue_item)
finally:
GENERATION_DEVICE_POOL.release_session(worker.device)

except Exception as e:
error_type = e.__class__.__name__
Expand Down
Loading
Loading