Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
333 changes: 333 additions & 0 deletions multi-component-model-architecture-design.md

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions multi_comp_recipe/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
exported_pkg
exported_vlm_pkg
out
216 changes: 216 additions & 0 deletions multi_comp_recipe/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Multi-Component Model Optimization Recipes

These recipes demonstrate **Flow A — export first, then per-component optimization**: export a
multi-component model to ONNX once, then run a single Olive config whose `builds` apply a
**different pipeline to each component**.

The flow is two explicit steps:

1. **Export** the model to a directory of per-component ONNX subfolders using the Olive CLI with the
Mobius builder.
2. **Optimize** by pointing an Olive config at that directory; each component subfolder becomes a
selectable component that a `build` can target.

There is no need to memorize component names: each exported component lives in its own folder, and
Olive loads the export directory as a `CompositeModel` whose **component names are the subfolder
names**.

---

## Prerequisites

```
pip install olive-ai
pip install mobius-ai
```

Exporting a diffusion pipeline also needs `diffusers`/`transformers` and access to the model on
Hugging Face (Stable Diffusion 3 is a gated model — accept its license and `huggingface-cli login`
first).

---

## Recipe 1 — Stable Diffusion 3 (`sd3_optimize_components.json`)

### Step 1 — Export with the CLI

```
olive capture-onnx-graph --model_name_or_path stabilityai/stable-diffusion-3-medium-diffusers --use_mobius_builder --output_path exported_pkg
```

Mobius exports each neural-network component to its own subfolder:

```
exported_pkg/
text_encoder/model.onnx # CLIP-L text encoder
text_encoder_2/model.onnx # CLIP-G text encoder
text_encoder_3/model.onnx # T5-XXL text encoder
transformer/model.onnx # MMDiT denoising backbone
vae_encoder/model.onnx
vae_decoder/model.onnx
```

> **Note.** The exact subfolders depend on the pipeline; the optimize config below only
> needs `builds` for the components you actually want to optimize.

### Step 2 — Optimize each component

Run from the directory that contains `exported_pkg/`:

```
olive run --config sd3_optimize_components.json
```

This applies a different pipeline per component:

| component | pipeline | intent |
|------------------|--------------------|------------------------------------------|
| `transformer` | `dynamic_quant` | INT8-quantize the heavy denoising backbone |
| `text_encoder_3` | `to_fp16` | keep T5-XXL in FP16 |
| `vae_encoder` | `to_fp16` | keep the VAE in FP16 to preserve quality |
| `vae_decoder` | `to_fp16` | keep the VAE in FP16 to preserve quality |

Output:

```
out/transformer/ # INT8 transformer
out/text_encoder_3/ # FP16 T5-XXL
out/vae_encoder/ # FP16 VAE encoder
out/vae_decoder/ # FP16 VAE decoder
```

Each build writes one optimized component; components without a build stay as exported.

### Step 3 — Inference

Run end-to-end image generation with the exported ONNX models:

```
python sd3_inference.py --prompt "A photo of a cat sitting on a windowsill" --steps 28 --output result.png
```

The inference script (`sd3_inference.py`) uses:
- **Text encoding**: ONNX Runtime with exported CLIP-L, CLIP-G, and T5-XXL encoders (run once)
- **Denoising**: ONNX Runtime with the exported SD3 transformer (28 steps)
- **VAE decoding**: ONNX Runtime with the exported VAE decoder

Options:
```
--prompt TEXT Text prompt for image generation
--steps N Number of denoising steps (default: 28)
--seed N Random seed (default: 42)
--output PATH Output image path (default: sd3_output.png)
--onnx_dir DIR Path to exported model directory (default: exported_sd3_full2)
```

> **Note.** SD3 is a gated model — you need `huggingface-cli login` or set `HF_TOKEN` to export.
> The tokenizers (CLIP and T5) still run via the `transformers` library.

---

## Recipe 2 — Vision-Language Model (`vlm_optimize_components.json`)

Same two-step Flow A for a VLM, using `Qwen/Qwen3-VL-2B-Instruct`.

### Step 1 — Export

```
olive capture-onnx-graph --model_name_or_path Qwen/Qwen3-VL-2B-Instruct --use_mobius_builder --output_path exported_vlm_pkg
```

Mobius exports this model as three components, each in its own subfolder:

```
exported_vlm_pkg/
decoder/model.onnx
vision_encoder/model.onnx
embedding/model.onnx
```

### Step 2 — Optimize

```
olive run --config vlm_optimize_components.json
```

| component | pipeline | intent |
|------------------|-----------------|-------------------------------------|
| `decoder` | `dynamic_quant` | INT8-quantize the language decoder |
| `vision_encoder` | `to_fp16` | keep the vision tower in FP16 |
| `embedding` | `to_fp16` | keep the embedding in FP16 |

> The three component names (`decoder`, `vision_encoder`, `embedding`) are exactly what Mobius
> produces for `Qwen/Qwen3-VL-2B-Instruct`. For a different VLM, adjust the component names in the
> config to match the subfolder names your export actually produced.

### Step 3 — Inference with ORT GenAI

Run text generation with the exported ONNX models using **onnxruntime-genai**:

```bash
# Text-only
python vlm_inference.py --prompt "The capital of France is"

# With image input
python vlm_inference.py --prompt "Describe this image." --image photo.jpg

# Custom settings
python vlm_inference.py --model_dir exported_vlm_pkg --max_new_tokens 256
```

The inference script (`vlm_inference.py`) uses ORT GenAI which handles:
- **Tokenization**: Built-in tokenizer from saved HF tokenizer files
- **Embedding**: ONNX `embedding/model.onnx` (token embed + image/audio feature mixing)
- **Vision encoding**: ONNX `vision_encoder/model.onnx` (when `--image` is provided)
- **Decoding**: ONNX `decoder/model.onnx` with KV cache (autoregressive generation)

Options:
```
--prompt TEXT Text prompt
--image PATH Optional image file for multimodal input
--max_new_tokens N Maximum tokens to generate (default: 128)
--model_dir DIR Path to exported model directory (default: exported_vlm_pkg)
```

#### Setup requirements

The export directory needs these files alongside the ONNX models:

```
exported_vlm_pkg/
genai_config.json # Model type, I/O mappings, search config
tokenizer.json # HF tokenizer
tokenizer_config.json
vision_processor.json # Vision preprocessing config
audio_processor.json # Audio preprocessing config (for Phi-4-multimodal)
decoder/model.onnx
vision_encoder/model.onnx
embedding/model.onnx
audio_encoder/model.onnx # Optional (Phi-4-multimodal)
```

To create `genai_config.json` and tokenizer files after export:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)
tokenizer.save_pretrained("exported_vlm_pkg")
```

For the `genai_config.json` structure, see the
[Mobius phi4mm example](https://github.com/microsoft/mobius/blob/main/examples/phi4mm_ort_genai.py)
which writes the config automatically.

> **Note.** Install `onnxruntime-genai` (`pip install onnxruntime-genai`) to use this script.

---

## Notes

- The passes here (`OnnxFloatToFloat16`, `OnnxDynamicQuantization`) are **illustrative** and chosen
to run without calibration data. Swap in `OrtTransformersOptimization`, `OnnxStaticQuantization`
(with a `data_config`), or other ONNX passes for production-quality optimization.
- The recipes target the **CPU** EP so they run anywhere. For GPU deployment, change the
`execution_providers` to e.g. `["CUDAExecutionProvider"]` and the device to `"gpu"`.
- `builds.components` selects which exported components to optimize. Only the components with a build
are touched; the rest remain as exported.
160 changes: 160 additions & 0 deletions multi_comp_recipe/sd3_inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#!/usr/bin/env python
"""SD3 end-to-end inference using all ONNX components (text encoders + transformer + VAE).

Usage:
python sd3_inference.py --prompt "A photo of a cat sitting on a windowsill"
python sd3_inference.py --prompt "A futuristic city" --steps 50 --output city.png
"""

import argparse
import os

import numpy as np
import onnxruntime as ort
import torch
from diffusers import FlowMatchEulerDiscreteScheduler
from PIL import Image
from transformers import CLIPTokenizer, T5TokenizerFast

MODEL_ID = "stabilityai/stable-diffusion-3-medium-diffusers"
ONNX_DIR = "exported_sd3_full2"


def encode_text(prompt: str, onnx_dir: str, model_id: str) -> tuple[np.ndarray, np.ndarray]:
"""Encode prompt using ONNX CLIP-L, CLIP-G, and T5-XXL text encoders.

Returns:
encoder_hidden_states: [1, 410, 4096]
pooled_projections: [1, 2048]
"""
# Load tokenizers (lightweight, no model weights)
tokenizer_l = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer")
tokenizer_g = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer_2")
tokenizer_t5 = T5TokenizerFast.from_pretrained(model_id, subfolder="tokenizer_3")

# Load ONNX sessions
sess_l = ort.InferenceSession(os.path.join(onnx_dir, "text_encoder", "model.onnx"))
sess_g = ort.InferenceSession(os.path.join(onnx_dir, "text_encoder_2", "model.onnx"))
sess_t5 = ort.InferenceSession(os.path.join(onnx_dir, "text_encoder_3", "model.onnx"))

# CLIP-L
tokens_l = tokenizer_l(prompt, padding="max_length", max_length=77, return_tensors="np", truncation=True)
out_l = sess_l.run(None, {
"input_ids": tokens_l["input_ids"].astype(np.int64),
"attention_mask": tokens_l["attention_mask"].astype(np.int64),
})
clip_l_hidden = out_l[0] # last_hidden_state [1, 77, 768]
clip_l_pooled = out_l[1] # text_embeds [1, 768]

# CLIP-G
tokens_g = tokenizer_g(prompt, padding="max_length", max_length=77, return_tensors="np", truncation=True)
out_g = sess_g.run(None, {
"input_ids": tokens_g["input_ids"].astype(np.int64),
"attention_mask": tokens_g["attention_mask"].astype(np.int64),
})
clip_g_hidden = out_g[0] # last_hidden_state [1, 77, 1280]
clip_g_pooled = out_g[1] # text_embeds [1, 1280]

# T5-XXL
tokens_t5 = tokenizer_t5(prompt, padding="max_length", max_length=256, return_tensors="np", truncation=True)
out_t5 = sess_t5.run(None, {"input_ids": tokens_t5["input_ids"].astype(np.int64)})
t5_hidden = out_t5[0] # last_hidden_state [1, 256, 4096]

# Pad CLIP outputs to 4096 and concatenate
clip_l_padded = np.pad(clip_l_hidden, ((0, 0), (0, 0), (0, 4096 - 768))) # [1, 77, 4096]
clip_g_padded = np.pad(clip_g_hidden, ((0, 0), (0, 0), (0, 4096 - 1280))) # [1, 77, 4096]
encoder_hidden_states = np.concatenate([clip_l_padded, clip_g_padded, t5_hidden], axis=1) # [1, 410, 4096]
pooled_projections = np.concatenate([clip_l_pooled, clip_g_pooled], axis=-1) # [1, 2048]

return encoder_hidden_states.astype(np.float32), pooled_projections.astype(np.float32)


def denoise(
onnx_dir: str,
encoder_hidden_states: np.ndarray,
pooled_projections: np.ndarray,
scheduler: FlowMatchEulerDiscreteScheduler,
latent_shape: tuple = (1, 16, 64, 64),
seed: int = 42,
) -> torch.Tensor:
"""Run the denoising loop using the ONNX transformer."""
sess = ort.InferenceSession(os.path.join(onnx_dir, "transformer", "model.onnx"))

torch.manual_seed(seed)
latents = torch.randn(latent_shape)

for i, t in enumerate(scheduler.timesteps):
noise_pred = sess.run(
None,
{
"sample": latents.numpy(),
"timestep": np.array([t.item()], dtype=np.int64),
"encoder_hidden_states": encoder_hidden_states,
"pooled_projections": pooled_projections,
},
)[0]
latents = scheduler.step(torch.from_numpy(noise_pred), t, latents, return_dict=False)[0]
if i % 7 == 0:
print(f" Step {i}/{len(scheduler.timesteps)}, t={t.item():.1f}")

return latents


def decode_latents(latents: torch.Tensor, onnx_dir: str) -> np.ndarray:
"""Decode latents to image using the ONNX VAE decoder."""
sess = ort.InferenceSession(os.path.join(onnx_dir, "vae_decoder", "model.onnx"))

# SD3 VAE scaling: latents / scaling_factor + shift_factor
# SD3 defaults: scaling_factor=1.5305, shift_factor=0.0609
scaling_factor = 1.5305
shift_factor = 0.0609
latents_scaled = latents / scaling_factor + shift_factor

output = sess.run(None, {"latent_sample": latents_scaled.numpy()})[0]
# output: [1, 3, H, W] in [-1, 1]
image = (output / 2 + 0.5).clip(0, 1)
image = np.transpose(image[0], (1, 2, 0)) # [H, W, 3]
return (image * 255).astype(np.uint8)


def main():
parser = argparse.ArgumentParser(description="SD3 all-ONNX inference")
parser.add_argument("--prompt", default="A photo of a cat sitting on a windowsill")
parser.add_argument("--steps", type=int, default=28)
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--output", default="sd3_output.png")
parser.add_argument("--model_id", default=MODEL_ID)
parser.add_argument("--onnx_dir", default=ONNX_DIR)
args = parser.parse_args()

# Verify exported model exists
transformer_path = os.path.join(args.onnx_dir, "transformer", "model.onnx")
if not os.path.exists(transformer_path):
print(f"Error: ONNX model not found at {args.onnx_dir}/")
print("Run: olive capture-onnx-graph --model_name_or_path "
"stabilityai/stable-diffusion-3-medium-diffusers "
"--use_mobius_builder --output_path exported_sd3_full2")
return

print(f"Prompt: {args.prompt}")
print(f"Steps: {args.steps}, Seed: {args.seed}")
print(f"ONNX dir: {args.onnx_dir}")

print("\n1. Encoding text (ONNX CLIP-L + CLIP-G + T5-XXL)...")
encoder_hidden_states, pooled_projections = encode_text(args.prompt, args.onnx_dir, args.model_id)
print(f" encoder_hidden_states: {encoder_hidden_states.shape}")
print(f" pooled_projections: {pooled_projections.shape}")

print("\n2. Denoising (ONNX SD3 transformer)...")
scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(args.model_id, subfolder="scheduler")
scheduler.set_timesteps(args.steps)
latents = denoise(args.onnx_dir, encoder_hidden_states, pooled_projections, scheduler, seed=args.seed)

print("\n3. Decoding latents (ONNX VAE decoder)...")
image = decode_latents(latents, args.onnx_dir)
Image.fromarray(image).save(args.output)
print(f"\nSaved: {args.output} ({image.shape[1]}x{image.shape[0]})")


if __name__ == "__main__":
main()
Loading