Skip to content

Feat(model support): ideogram4 support#9303

Open
Pfannkuchensack wants to merge 20 commits into
invoke-ai:mainfrom
Pfannkuchensack:feat/ideogram4-support
Open

Feat(model support): ideogram4 support#9303
Pfannkuchensack wants to merge 20 commits into
invoke-ai:mainfrom
Pfannkuchensack:feat/ideogram4-support

Conversation

@Pfannkuchensack

Copy link
Copy Markdown
Collaborator

Summary

Adds first-class Ideogram 4 (text-to-image) support to InvokeAI — a new open-weight 9.3B single-stream DiT with a Qwen3-VL-8B text encoder and flow-matching sampler.

The defining trait of this model is that it is trained on a structured JSON prompt that describes the scene as a list of regions, each with a bounding box ([y_min, x_min, y_max, x_max], normalized 0–1000, origin top-left) and a text description. Plain text works but is markedly lower quality.

The headline feature here is that this JSON is auto-assembled on the frontend from the existing Canvas Regional Guidance layers: the global prompt becomes the overall description, and each enabled region contributes one element (its drawn rect → bbox, its prompt → description). Users can also paste raw JSON to drive the model directly.

Why this is purely frontend string assembly: Ideogram 4 does not use spatial attention masks for regions (unlike FLUX/SDXL/Z-Image regional guidance) — the region boxes are encoded as text inside the single JSON string fed to Qwen3-VL. So the backend only ever sees one prompt string; no mask-conditioning code is touched.

How

Backend (invokeai/backend/ideogram4/, vendored from the Apache-2.0 reference, copyright headers retained):

  • DiT (modeling_ideogram4.py), FLUX2-style KL VAE (autoencoder.py + latent_norm.py), logit-normal flow-match scheduler + presets (scheduler.py, sampler_configs.py), nf4/fp8 quantized loading, and InvokeAI-side denoise.py / text_encoding.py / sampling_utils.py.
  • Dual-branch asymmetric CFG: positive runs the conditional transformer over [text]+[image] tokens, negative runs the unconditional transformer over image-only tokens with zeroed LLM features (v = gw·pos + (1−gw)·neg). ⇒ no negative prompt. A transformer_pair.py wrapper keeps both transformers co-resident through the loop so the cache doesn't swap them every step (nf4 ≈ 10 GB resident during denoise; fits 24 GB).
  • Model-manager registration: new BaseModelType.Ideogram4, a Qwen3-VL text-encoder type, config detector + diffusers config, and a loader mirroring Z-Image.
  • Four invocations: ideogram4_model_loader, ideogram4_text_encoder, ideogram4_denoise, ideogram4_latents_to_image; new Ideogram4ConditioningInfo + field/output; ideogram4_txt2img generation mode.

Frontend:

  • buildIdeogram4Prompt.ts — Regions→JSON assembly (raw-JSON passthrough; stable key order; bbox clamped/rounded to 0–1000) with unit tests.
  • buildIdeogram4Graph.ts — text2img-only graph builder + enqueue wiring. Uses a decoy string node for the assembled JSON so the linear-UI batch injector doesn't clobber it, while plain-text prompts still flow through the real prompt node (so dynamic prompts / batching keep working).
  • Params + UI: a Sampler Preset combobox (Quality 48 / Default 20 / Turbo 12) as the primary control, plus Advanced overrides that actually apply to this model — Steps, Guidance Scale, Schedule Shift (mu) and a Color Palette picker. The irrelevant Advanced controls (VAE, CLIP Skip, CFG Rescale, Seamless, Color Compensation) are hidden for Ideogram 4.
  • Metadata recall for all of the above.

Dependencies: bumps transformers to >=5.5,<5.6 (Qwen3-VL landed in 4.57; the encoder needs it) and compel to >=2.4.0,<3, with the necessary adaptations to the FLUX / Z-Image loaders, the safety checker, the HF metadata fetcher and model_util.

Out of scope (v1): img2img / inpaint / outpaint, ControlNet / IP-Adapter / LoRA, and the optional local "Magic Prompt" plain-text→JSON expander (parked — see Merge Plan).

Related Issues / Discussions

QA Instructions

Requires the gated weights (ideogram-ai/ideogram-4-nf4 — nf4 is the 24 GB path, CUDA/bitsandbytes only) plus the Qwen3-VL encoder + VAE sub-dependencies.

  1. Install & select an Ideogram 4 model; open the Canvas/Generate tab. Confirm the model shows under its own group, dimensions default to 1024×1024 (multiples of 16), and the Generation settings show the Sampler Preset control instead of Scheduler/CFG.
  2. Regions → JSON: type an overall description, add 1–2 Regional Guidance layers each with a prompt + a drawn box, and Invoke. In the result's metadata, confirm the assembled JSON has the correct key order and elements[*].bbox (0–1000, [y_min, x_min, y_max, x_max]) matching where you drew the boxes, and that element placement in the image roughly matches.
  3. Raw-JSON passthrough: paste a hand-written JSON object into the prompt box → it is sent unchanged (no region wrapping).
  4. Plain text: with no regions and a plain prompt, confirm normal text2img still works and that dynamic prompts / batching still expand (decoy-node behavior).
  5. Sampler presets: Quality / Default / Turbo produce the expected step counts.
  6. Advanced overrides: Steps / Guidance Scale / mu show the active preset's value as the "auto" default and can be overridden + reset; Color Palette swatches inject style_description.color_palette (auto-build mode only — ignored for raw JSON).
  7. Metadata recall: load an Ideogram 4 image and recall — preset, overrides and palette restore (and are guarded to only recall onto an Ideogram 4 model).

Frontend gates (from invokeai/frontend/web/): pnpm lint and pnpm test:no-watch (includes buildIdeogram4Prompt.test.ts). tsc clean.

Merge Plan

  • Large PR + dependency bump. This raises transformers to >=5.5,<5.6 and compel to >=2.4.0,<3, which touches many models (FLUX, Z-Image, safety checker, HF metadata fetch). Coordinate with / sequence after PR feat - Migrate to Transformers 5.5.4 #9248 (the transformers 5.x bump) to avoid a double-bump conflict, and time it to not collide with a pending release. Broad regression QA across existing model types is warranted, not just Ideogram 4.
  • Follow-ups (not in this PR): GGUF loader for the custom DiT (more VRAM headroom), starter_models.py entries, and the optional local Magic Prompt node (blocked on the upstream system-prompts PR feat: add System Prompts library for Expand Prompt button #9152, whose migration_32 collides with main and must be renumbered to 33 first).

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable) — frontend unit tests for the Regions→JSON assembly
  • ❗Changes to a redux slice have a corresponding migration — new params fields use zod defaults; confirm if a migration is needed
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Your Name and others added 20 commits February 6, 2026 19:58
Switches compel from PyPI 2.1.1 to invoke-ai/compel@main fork which supports
transformers 5.x. Bumps transformers floor to 5.9.0. Removes the
transformers>=5.1.0 uv override that was only needed to bypass compel 2.1.1's
<5.0 constraint.

NOTE: compel fork pulls notebook dep (full Jupyter stack); flag to maintainer for cleanup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s 5.x

transformers 5.x no longer exposes rope_theta as a top-level attribute on
Qwen3Config; the value is stored in the rope_parameters (and rope_scaling)
dict instead. Read it from there with a getattr fallback so the inv_freq
buffer is computed from the configured base (1e6 / 256) instead of raising
AttributeError. Applies to both the safetensors and GGUF Qwen3 encoder paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…whoami

huggingface_hub 1.x removed get_token_permission(). HFTokenHelper.get_status()
now validates the token via whoami(), which returns user info for a valid token
and raises HfHubHTTPError for an invalid one. Preserves the original three-way
status: VALID on success, INVALID on HfHubHTTPError (e.g. 401), UNKNOWN on any
other error (e.g. network failure).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
….9-compel-fork

# Conflicts:
#	invokeai/app/api/routers/model_manager.py
#	invokeai/app/invocations/sd3_text_encoder.py
#	invokeai/backend/model_manager/metadata/fetch/huggingface.py
#	pyproject.toml
#	uv.lock
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The upstream merge left an unresolved conflict marker in _t5_encode and
reintroduced T5TokenizerFast. Keep our v5 assertion (T5Tokenizer only) plus
upstream's new t5_device logic, and drop the now-dead T5TokenizerFast
monkeypatch in the test (the name no longer exists in the module).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- flux_text_encoder.py: drop unused typing.Union (F401) left by v5 import merge
- huggingface.py: ruff format (wrap append(SimpleNamespace(...)))

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
transformers 5.6 flattened CLIPTextModel (removed the self.text_model wrapper,
hoisted embeddings/encoder/final_layer_norm to the top level). diffusers' single-file
checkpoint loader (create_diffusers_clip_model_from_ldm) still assumes the nested
layout, so loading SD1.5 .safetensors checkpoints fails on 5.6+ with
'CLIPTextModel object has no attribute text_model' and, once that read is shimmed,
'Cannot copy out of meta tensor' (weights never populate the flattened model).

Pin to >=5.5,<5.6 (last pre-flattening release) which keeps both the single-file
and from_pretrained paths working. The invoke-ai/compel fork accepts any 5.x.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@
chore(deps): replace compel fork with official compel 2.4.0

compel 2.4.0 (released 2026-05-30) merges the transformers-5 support that
the invoke-ai fork carried (both descend from upstream PR invoke-ai#129), plus the
maintainer-reviewed padding rework and added diffusers/T5 smoke coverage.
Switch from the git fork to the PyPI release.

- pyproject: compel git+main -> compel>=2.4.0,<3
- uv.lock: compel 2.3.1 (git 8f404b45) -> 2.4.0 (pypi)
- transformers stays 5.5.4 (satisfies compel >=5,<6 and our <5.6 pin)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@
Vendor the Apache-2.0 Ideogram 4 reference model (DiT, FLUX2-style VAE,
logit-normal flow-match scheduler, nf4/fp8 quant loading) into
invokeai/backend/ideogram4/, plus InvokeAI glue (Qwen3-VL text encoding,
packed-input build, dual-branch Euler denoise loop). Register the model:
BaseModelType.Ideogram4, Main_Diffusers_Ideogram4_Config (detected via the
Ideogram4Pipeline class name in model_index.json), and the Ideogram4DiffusersModel
loader that loads both transformers as one Ideogram4TransformerPair submodel plus
the Qwen3-VL encoder and VAE. Text-to-image only.
… loading

End-to-end text-to-image backend for Ideogram 4, validated through the real
session runner. Vendors the Apache-2.0 reference model (DiT, FLUX2-style VAE,
logit-normal flow-match scheduler) into invokeai/backend/ideogram4/ with InvokeAI
glue. Registers BaseModelType.Ideogram4, Main_Diffusers_Ideogram4_Config, and the
Ideogram4DiffusersModel loader (two transformers as one Ideogram4TransformerPair;
Qwen3-VL encoder + VAE). Both transformers and the encoder load via InvokeLinearNF4
so they work with the partial-load cache. Adds Ideogram4ConditioningInfo/Field/Output
and the model_loader/text_encoder/denoise/l2i invocations. Text-to-image only.
Wires Ideogram 4 into the canvas/generate UI. buildIdeogram4Prompt assembles the
structured JSON caption from the global prompt + Canvas Regional Guidance layers
(each region → an obj element with a 0–1000 bbox + desc), with raw-JSON passthrough
and a plain-text fallback when there are no regions. Adds buildIdeogram4Graph
(text-to-image only, no negative prompt) and the enqueue switch. Structured captions
use a static string node + a decoy positive-prompt node so the linear batch can't
clobber the assembled JSON; plain text uses the real node so dynamic prompts/batching
still work.

Registers the 'ideogram-4' base (enums, color, names, model picker, grid size 16), a
sampler-preset param (V4_QUALITY_48/V4_DEFAULT_20/V4_TURBO_12) replacing the steps/CFG
controls, ParamIdeogram4SamplerPreset, and metadata recall. Regenerates schema.ts.
Advanced accordion now shows only Ideogram 4-relevant controls. Adds optional
overrides of the sampler preset — steps, guidance scale (overrides the main gw,
preserves the preset's polish tail), and schedule shift (mu) — plus a color
palette editor that injects style_description.color_palette into the auto-built
JSON caption (uppercase #RRGGBB, max 16, ignored for raw-JSON prompts). All are
nullable (null = use preset), recallable from metadata, and the irrelevant
controls (VAE, CLIP skip, CFG rescale, seamless, color compensation) are hidden
for Ideogram 4. Backend denoise gains steps/guidance_scale/mu fields; schema.ts
regenerated.
@github-actions github-actions Bot added api python PRs that change python files Root invocations PRs that change invocations backend PRs that change backend files frontend PRs that change frontend files labels Jun 25, 2026
@github-actions github-actions Bot added python-tests PRs that change python tests python-deps PRs that change python dependencies labels Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api backend PRs that change backend files frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-deps PRs that change python dependencies python-tests PRs that change python tests Root

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants