fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305
fix(qwen): estimate VAE working memory so the cache frees room before decode/encode#9305lstein wants to merge 1 commit into
Conversation
… decode/encode The Qwen Image l2i/i2l invocations called `model_on_device()` without a `working_mem_bytes` estimate, unlike the SD/SDXL path. The model cache therefore only reserved the default `device_working_mem_gb` and never evicted the resident transformer/text encoder before the VAE decode. On a near-full card (e.g. Qwen Image Edit Q8_0 with transformer + text encoder resident) the decode then OOMs trying to allocate its working set into the fragmented remainder. Add `estimate_vae_working_memory_qwen_image()` and pass it into both the decode and encode paths so the cache makes room (evicting other models when needed) before the operation runs. The constant is calibrated against a measured decode on an AMD W7900: at 1248x832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082), rounded up to 5500 for headroom. It tracks peak *reserved* (not just allocated) memory so that whenever the cache declines to free room (free >= estimate) the decode is still guaranteed to fit. Encode uses ~half, matching the other estimators (not independently measured). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FindingsMedium - Missing test coverage for the actual fixPaths: Both the decode and encode paths now compute and pass Without an equivalent, a future refactor that drops the
Low - Encode constant is unmeasured and the docstring is inconsistentPath: The encode constant is hardcoded as For Qwen Image Edit, which encodes a real input image, an under-modeled encode constant reproduces the exact failure mode this PR fixes (cache declines to evict because
Open Questions
|
Summary
The Qwen Image
qwen_image_l2i(decode) andqwen_image_i2l(encode) invocations calledmodel_on_device()without aworking_mem_bytesestimate — unlike the SD/SDXLl2ipath, which callsestimate_vae_working_memory_sd15_sdxl(...). As a result, the model cache only reserved the defaultdevice_working_mem_gband never evicted the resident transformer / text encoder before the VAE decode.On a near-full card this OOMs. Reproduced with Qwen Image Edit 2511 (Q8_0) + the standard Qwen Image VAE on a 48 GB AMD W7900: with the transformer (~20.7 GB) and text encoder (~15.8 GB) resident, the autoencoder decode tried to allocate ~5 GiB into the fragmented ~8 GiB remainder and failed:
Root cause
ModelCache._load_locked_model()computesvram_available = free_vram − working_memand only evicts other models when that drops below what the locked model needs. The VAE is tiny (~242 MB) and already resident, somodel_vram_needed ≈ 0and nothing is ever evicted — the big transformer/text encoder stay put and the decode is squeezed into whatever fragmented VRAM is left.Passing a realistic
working_mem_byteslets the cache make room (evicting other models) before the operation runs, which is exactly what the SD/SDXL path already does.Fix
estimate_vae_working_memory_qwen_image()invae_working_memory.py.model_on_device(working_mem_bytes=...)in both the decode and encode invocations.Calibration
The estimate is calibrated against a measured decode on a W7900. At 1248×832 the decode grew CUDA reserved memory by ~10.06 GiB (implied constant ~5082); rounded up to 5500 for headroom. The current SD constant (
2200) under-modeled this heavier video-style VAE by ~2.4×.The constant intentionally tracks peak reserved (not just allocated) memory. The cache's guarantee is "if it doesn't evict, then
free ≥ estimate," so the estimate must be ≥ the decode's true reserved footprint. This closes the danger zone where the cache would skip eviction yet the decode would still reserve more than the free VRAM:Testing
device_working_mem_gbback at its default — the text encoder is offloaded just before the VAE decode and the generation completes.ruff check/ruff format/ compile all clean.Notes / open question
2750) follows the SD-style "half of decode" convention and is not independently measured — a conservative default. Worth a follow-up measurement if encode-side OOMs surface (relevant for Qwen Image Edit, which encodes an input image).expandable_segments:Truedid not resolve the fragmentation on that stack — eviction is what reliably works.🤖 Generated with Claude Code