Feat(model support): ideogram4 support#9303
Open
Pfannkuchensack wants to merge 20 commits into
Open
Conversation
Switches compel from PyPI 2.1.1 to invoke-ai/compel@main fork which supports transformers 5.x. Bumps transformers floor to 5.9.0. Removes the transformers>=5.1.0 uv override that was only needed to bypass compel 2.1.1's <5.0 constraint. NOTE: compel fork pulls notebook dep (full Jupyter stack); flag to maintainer for cleanup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s 5.x transformers 5.x no longer exposes rope_theta as a top-level attribute on Qwen3Config; the value is stored in the rope_parameters (and rope_scaling) dict instead. Read it from there with a getattr fallback so the inv_freq buffer is computed from the configured base (1e6 / 256) instead of raising AttributeError. Applies to both the safetensors and GGUF Qwen3 encoder paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…whoami huggingface_hub 1.x removed get_token_permission(). HFTokenHelper.get_status() now validates the token via whoami(), which returns user info for a valid token and raises HfHubHTTPError for an invalid one. Preserves the original three-way status: VALID on success, INVALID on HfHubHTTPError (e.g. 401), UNKNOWN on any other error (e.g. network failure). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
….9-compel-fork # Conflicts: # invokeai/app/api/routers/model_manager.py # invokeai/app/invocations/sd3_text_encoder.py # invokeai/backend/model_manager/metadata/fetch/huggingface.py # pyproject.toml # uv.lock
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The upstream merge left an unresolved conflict marker in _t5_encode and reintroduced T5TokenizerFast. Keep our v5 assertion (T5Tokenizer only) plus upstream's new t5_device logic, and drop the now-dead T5TokenizerFast monkeypatch in the test (the name no longer exists in the module). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- flux_text_encoder.py: drop unused typing.Union (F401) left by v5 import merge - huggingface.py: ruff format (wrap append(SimpleNamespace(...))) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
transformers 5.6 flattened CLIPTextModel (removed the self.text_model wrapper, hoisted embeddings/encoder/final_layer_norm to the top level). diffusers' single-file checkpoint loader (create_diffusers_clip_model_from_ldm) still assumes the nested layout, so loading SD1.5 .safetensors checkpoints fails on 5.6+ with 'CLIPTextModel object has no attribute text_model' and, once that read is shimmed, 'Cannot copy out of meta tensor' (weights never populate the flattened model). Pin to >=5.5,<5.6 (last pre-flattening release) which keeps both the single-file and from_pretrained paths working. The invoke-ai/compel fork accepts any 5.x. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
chore(deps): replace compel fork with official compel 2.4.0 compel 2.4.0 (released 2026-05-30) merges the transformers-5 support that the invoke-ai fork carried (both descend from upstream PR invoke-ai#129), plus the maintainer-reviewed padding rework and added diffusers/T5 smoke coverage. Switch from the git fork to the PyPI release. - pyproject: compel git+main -> compel>=2.4.0,<3 - uv.lock: compel 2.3.1 (git 8f404b45) -> 2.4.0 (pypi) - transformers stays 5.5.4 (satisfies compel >=5,<6 and our <5.6 pin) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> @
Vendor the Apache-2.0 Ideogram 4 reference model (DiT, FLUX2-style VAE, logit-normal flow-match scheduler, nf4/fp8 quant loading) into invokeai/backend/ideogram4/, plus InvokeAI glue (Qwen3-VL text encoding, packed-input build, dual-branch Euler denoise loop). Register the model: BaseModelType.Ideogram4, Main_Diffusers_Ideogram4_Config (detected via the Ideogram4Pipeline class name in model_index.json), and the Ideogram4DiffusersModel loader that loads both transformers as one Ideogram4TransformerPair submodel plus the Qwen3-VL encoder and VAE. Text-to-image only.
… loading End-to-end text-to-image backend for Ideogram 4, validated through the real session runner. Vendors the Apache-2.0 reference model (DiT, FLUX2-style VAE, logit-normal flow-match scheduler) into invokeai/backend/ideogram4/ with InvokeAI glue. Registers BaseModelType.Ideogram4, Main_Diffusers_Ideogram4_Config, and the Ideogram4DiffusersModel loader (two transformers as one Ideogram4TransformerPair; Qwen3-VL encoder + VAE). Both transformers and the encoder load via InvokeLinearNF4 so they work with the partial-load cache. Adds Ideogram4ConditioningInfo/Field/Output and the model_loader/text_encoder/denoise/l2i invocations. Text-to-image only.
Wires Ideogram 4 into the canvas/generate UI. buildIdeogram4Prompt assembles the structured JSON caption from the global prompt + Canvas Regional Guidance layers (each region → an obj element with a 0–1000 bbox + desc), with raw-JSON passthrough and a plain-text fallback when there are no regions. Adds buildIdeogram4Graph (text-to-image only, no negative prompt) and the enqueue switch. Structured captions use a static string node + a decoy positive-prompt node so the linear batch can't clobber the assembled JSON; plain text uses the real node so dynamic prompts/batching still work. Registers the 'ideogram-4' base (enums, color, names, model picker, grid size 16), a sampler-preset param (V4_QUALITY_48/V4_DEFAULT_20/V4_TURBO_12) replacing the steps/CFG controls, ParamIdeogram4SamplerPreset, and metadata recall. Regenerates schema.ts.
Advanced accordion now shows only Ideogram 4-relevant controls. Adds optional overrides of the sampler preset — steps, guidance scale (overrides the main gw, preserves the preset's polish tail), and schedule shift (mu) — plus a color palette editor that injects style_description.color_palette into the auto-built JSON caption (uppercase #RRGGBB, max 16, ignored for raw-JSON prompts). All are nullable (null = use preset), recallable from metadata, and the irrelevant controls (VAE, CLIP skip, CFG rescale, seamless, color compensation) are hidden for Ideogram 4. Backend denoise gains steps/guidance_scale/mu fields; schema.ts regenerated.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class Ideogram 4 (text-to-image) support to InvokeAI — a new open-weight 9.3B single-stream DiT with a Qwen3-VL-8B text encoder and flow-matching sampler.
The defining trait of this model is that it is trained on a structured JSON prompt that describes the scene as a list of regions, each with a bounding box (
[y_min, x_min, y_max, x_max], normalized 0–1000, origin top-left) and a text description. Plain text works but is markedly lower quality.The headline feature here is that this JSON is auto-assembled on the frontend from the existing Canvas Regional Guidance layers: the global prompt becomes the overall description, and each enabled region contributes one element (its drawn rect → bbox, its prompt → description). Users can also paste raw JSON to drive the model directly.
Why this is purely frontend string assembly: Ideogram 4 does not use spatial attention masks for regions (unlike FLUX/SDXL/Z-Image regional guidance) — the region boxes are encoded as text inside the single JSON string fed to Qwen3-VL. So the backend only ever sees one prompt string; no mask-conditioning code is touched.
How
Backend (
invokeai/backend/ideogram4/, vendored from the Apache-2.0 reference, copyright headers retained):modeling_ideogram4.py), FLUX2-style KL VAE (autoencoder.py+latent_norm.py), logit-normal flow-match scheduler + presets (scheduler.py,sampler_configs.py), nf4/fp8 quantized loading, and InvokeAI-sidedenoise.py/text_encoding.py/sampling_utils.py.[text]+[image]tokens, negative runs the unconditional transformer over image-only tokens with zeroed LLM features (v = gw·pos + (1−gw)·neg). ⇒ no negative prompt. Atransformer_pair.pywrapper keeps both transformers co-resident through the loop so the cache doesn't swap them every step (nf4 ≈ 10 GB resident during denoise; fits 24 GB).BaseModelType.Ideogram4, a Qwen3-VL text-encoder type, config detector + diffusers config, and a loader mirroring Z-Image.ideogram4_model_loader,ideogram4_text_encoder,ideogram4_denoise,ideogram4_latents_to_image; newIdeogram4ConditioningInfo+ field/output;ideogram4_txt2imggeneration mode.Frontend:
buildIdeogram4Prompt.ts— Regions→JSON assembly (raw-JSON passthrough; stable key order; bbox clamped/rounded to 0–1000) with unit tests.buildIdeogram4Graph.ts— text2img-only graph builder + enqueue wiring. Uses a decoy string node for the assembled JSON so the linear-UI batch injector doesn't clobber it, while plain-text prompts still flow through the real prompt node (so dynamic prompts / batching keep working).Dependencies: bumps
transformersto>=5.5,<5.6(Qwen3-VL landed in 4.57; the encoder needs it) andcompelto>=2.4.0,<3, with the necessary adaptations to the FLUX / Z-Image loaders, the safety checker, the HF metadata fetcher andmodel_util.Out of scope (v1): img2img / inpaint / outpaint, ControlNet / IP-Adapter / LoRA, and the optional local "Magic Prompt" plain-text→JSON expander (parked — see Merge Plan).
Related Issues / Discussions
transformers5.x bump tracked in PR feat - Migrate to Transformers 5.5.4 #9248 — this branch carries the same bump (>=5.5,<5.6) plus the cross-model adaptations it requires.QA Instructions
Requires the gated weights (
ideogram-ai/ideogram-4-nf4— nf4 is the 24 GB path, CUDA/bitsandbytes only) plus the Qwen3-VL encoder + VAE sub-dependencies.elements[*].bbox(0–1000,[y_min, x_min, y_max, x_max]) matching where you drew the boxes, and that element placement in the image roughly matches.style_description.color_palette(auto-build mode only — ignored for raw JSON).Frontend gates (from
invokeai/frontend/web/):pnpm lintandpnpm test:no-watch(includesbuildIdeogram4Prompt.test.ts).tscclean.Merge Plan
transformersto>=5.5,<5.6andcompelto>=2.4.0,<3, which touches many models (FLUX, Z-Image, safety checker, HF metadata fetch). Coordinate with / sequence after PR feat - Migrate to Transformers 5.5.4 #9248 (the transformers 5.x bump) to avoid a double-bump conflict, and time it to not collide with a pending release. Broad regression QA across existing model types is warranted, not just Ideogram 4.starter_models.pyentries, and the optional local Magic Prompt node (blocked on the upstream system-prompts PR feat: add System Prompts library for Expand Prompt button #9152, whosemigration_32collides withmainand must be renumbered to 33 first).Checklist
What's Newcopy (if doing a release after this PR)