Skip to content

fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305

Open
lstein wants to merge 1 commit into
mainfrom
fix-qwen-vae-working-memory
Open

fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305
lstein wants to merge 1 commit into
mainfrom
fix-qwen-vae-working-memory

Conversation

@lstein

@lstein lstein commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

The Qwen Image qwen_image_l2i (decode) and qwen_image_i2l (encode) invocations called model_on_device() without a working_mem_bytes estimate — unlike the SD/SDXL l2i path, which calls estimate_vae_working_memory_sd15_sdxl(...). As a result, the model cache only reserved the default device_working_mem_gb and never evicted the resident transformer / text encoder before the VAE decode.

On a near-full card this OOMs. Reproduced with Qwen Image Edit 2511 (Q8_0) + the standard Qwen Image VAE on a 48 GB AMD W7900: with the transformer (~20.7 GB) and text encoder (~15.8 GB) resident, the autoencoder decode tried to allocate ~5 GiB into the fragmented ~8 GiB remainder and failed:

CUDA out of memory. Tried to allocate 5.01 GiB. GPU 0 has a total capacity of 44.98 GiB
of which 3.69 GiB is free. ... 2.48 GiB is reserved by PyTorch but unallocated.

Root cause

ModelCache._load_locked_model() computes vram_available = free_vram − working_mem and only evicts other models when that drops below what the locked model needs. The VAE is tiny (~242 MB) and already resident, so model_vram_needed ≈ 0 and nothing is ever evicted — the big transformer/text encoder stay put and the decode is squeezed into whatever fragmented VRAM is left.

Passing a realistic working_mem_bytes lets the cache make room (evicting other models) before the operation runs, which is exactly what the SD/SDXL path already does.

Fix

  • Add estimate_vae_working_memory_qwen_image() in vae_working_memory.py.
  • Pass the estimate into model_on_device(working_mem_bytes=...) in both the decode and encode invocations.

Calibration

The estimate is calibrated against a measured decode on a W7900. At 1248×832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082); rounded up to 5500 for headroom. The current SD constant (2200) under-modeled this heavier video-style VAE by ~2.4×.

The constant intentionally tracks peak reserved (not just allocated) memory. The cache's guarantee is "if it doesn't evict, then free ≥ estimate," so the estimate must be ≥ the decode's true reserved footprint. This closes the danger zone where the cache would skip eviction yet the decode would still reserve more than the free VRAM:

  • Tight card (transformer + text encoder resident, ~8 GiB free): estimate ~10.6 GiB > free → cache evicts the text encoder → ~24 GiB clean headroom → decode succeeds.
  • Roomy card (free ≥ estimate): no eviction, but free already ≥ the decode's need → fits.

Testing

  • Reproduced the OOM, then confirmed a clean run at 1248×832 with the calibrated constant and device_working_mem_gb back at its default — the text encoder is offloaded just before the VAE decode and the generation completes.
  • ruff check / ruff format / compile all clean.

Notes / open question

  • The encode constant (2750) follows the SD-style "half of decode" convention and is not independently measured — a conservative default. Worth a follow-up measurement if encode-side OOMs surface (relevant for Qwen Image Edit, which encodes an input image).
  • Calibration was done on ROCm/W7900; expandable_segments:True did not resolve the fragmentation on that stack — eviction is what reliably works.

🤖 Generated with Claude Code

… decode/encode

The Qwen Image l2i/i2l invocations called `model_on_device()` without a
`working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache
therefore only reserved the default `device_working_mem_gb` and never
evicted the resident transformer/text encoder before the VAE decode. On a
near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder
resident) the decode then OOMs trying to allocate its working set into the
fragmented remainder.

Add `estimate_vae_working_memory_qwen_image()` and pass it into both the
decode and encode paths so the cache makes room (evicting other models when
needed) before the operation runs.

The constant is calibrated against a measured decode on an AMD W7900: at
1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied
constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved*
(not just allocated) memory so that whenever the cache declines to free room
(free >= estimate) the decode is still guaranteed to fit. Encode uses ~half,
matching the other estimators (not independently measured).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files labels Jun 26, 2026
@lstein lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap Jun 26, 2026
@lstein lstein added the 6.13.5 Library Updates label Jun 26, 2026
@Pfannkuchensack

Copy link
Copy Markdown
Collaborator

Findings

Medium - Missing test coverage for the actual fix

Paths: invokeai/app/invocations/qwen_image_latents_to_image.py:45-52, invokeai/app/invocations/qwen_image_image_to_latents.py:48-54

Both the decode and encode paths now compute and pass working_mem_bytes, but the PR adds zero tests. The repo already has a directly-applicable pattern: tests/app/invocations/test_z_image_working_memory.py mocks the VAE/context and asserts model_on_device.assert_called_once_with(working_mem_bytes=expected_memory) plus estimate.assert_called_once().

Without an equivalent, a future refactor that drops the working_mem_bytes= argument, reverts to a bare model_on_device(), or wires the wrong operation/tensor into the estimator would silently reintroduce the original OOM regression with no CI signal. The entire value of this PR is "the estimate is actually passed to the cache," and nothing proves that automatically.

To expose this issue, add a test that constructs QwenImageLatentsToImageInvocation and QwenImageImageToLatentsInvocation with a mocked AutoencoderKLQwenImage, patches estimate_vae_working_memory_qwen_image, invokes them, and asserts model_on_device was called once with working_mem_bytes=<estimate> for both operation="decode" and operation="encode" (mirroring test_z_image_working_memory.py).

Low - Encode constant is unmeasured and the docstring is inconsistent

Path: invokeai/backend/util/vae_working_memory.py:117-118

The encode constant is hardcoded as 2750, which the PR description itself flags as "not independently measured" and merely "half of decode." The new docstring says "Encoding uses ~half the working memory of decoding," but the sibling estimators it claims to match document encode as "~45%" / "~50%" of decode (invokeai/backend/util/vae_working_memory.py:25, :65, :86, :138).

For Qwen Image Edit, which encodes a real input image, an under-modeled encode constant reproduces the exact failure mode this PR fixes (cache declines to evict because free >= estimate, yet the encode reserves more than free and OOMs). This is a residual correctness gap, not just a style nit, because the encode path is the one Qwen Image Edit actually exercises.

To expose this issue, add a test that calls estimate_vae_working_memory_qwen_image(operation="encode", ...) and asserts the returned value is at least the measured encode reserved footprint once an encode measurement exists; until then, treat the encode constant as unverified.

Open Questions

  • Single-point linear calibration for a video-style VAE. The decode constant 5500 (invokeai/backend/util/vae_working_memory.py:118) is extrapolated linearly in h*w from one calibration point (1248x832, ~10.06 GiB reserved). At that point the formula yields 1038336 * 2 * 5500 = ~10.64 GiB, only ~5.7% above the measured value. The estimator assumes peak working memory is strictly linear in spatial area. If the VAE has any super-linear (attention) memory term at higher resolutions, a large decode (e.g. 2048x2048) could under-estimate, the cache would skip eviction (free >= estimate), and the decode would still OOM. A second calibration point at a larger resolution would settle whether the linear model and the thin margin hold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates backend PRs that change backend files invocations PRs that change invocations python PRs that change python files

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

2 participants