microsoft · xiaoyu-work · May 28, 2026 · May 28, 2026 · Jun 2, 2026 · Jun 16, 2026
diff --git a/multi-component-model-architecture-design.md b/multi-component-model-architecture-design.md
diff --git a/multi_comp_recipe/.gitignore b/multi_comp_recipe/.gitignore
@@ -0,0 +1,3 @@
+exported_pkg
+exported_vlm_pkg
+out
diff --git a/multi_comp_recipe/README.md b/multi_comp_recipe/README.md
@@ -0,0 +1,216 @@
+# Multi-Component Model Optimization Recipes
+
+These recipes demonstrate **Flow A — export first, then per-component optimization**: export a
+multi-component model to ONNX once, then run a single Olive config whose `builds` apply a
+**different pipeline to each component**.
+
+The flow is two explicit steps:
+
+1. **Export** the model to a directory of per-component ONNX subfolders using the Olive CLI with the
+   Mobius builder.
+2. **Optimize** by pointing an Olive config at that directory; each component subfolder becomes a
+   selectable component that a `build` can target.
+
+There is no need to memorize component names: each exported component lives in its own folder, and
+Olive loads the export directory as a `CompositeModel` whose **component names are the subfolder
+names**.
+
+---
+
+## Prerequisites
+
+```
+pip install olive-ai
+pip install mobius-ai
+```
+
+Exporting a diffusion pipeline also needs `diffusers`/`transformers` and access to the model on
+Hugging Face (Stable Diffusion 3 is a gated model — accept its license and `huggingface-cli login`
+first).
+
+---
+
+## Recipe 1 — Stable Diffusion 3 (`sd3_optimize_components.json`)
+
+### Step 1 — Export with the CLI
+
+```
+olive capture-onnx-graph --model_name_or_path stabilityai/stable-diffusion-3-medium-diffusers --use_mobius_builder --output_path exported_pkg
+```
+
+Mobius exports each neural-network component to its own subfolder:
+
+```
+exported_pkg/
+  text_encoder/model.onnx    # CLIP-L text encoder
+  text_encoder_2/model.onnx  # CLIP-G text encoder
+  text_encoder_3/model.onnx  # T5-XXL text encoder
+  transformer/model.onnx     # MMDiT denoising backbone
+  vae_encoder/model.onnx
+  vae_decoder/model.onnx
+```
+
+> **Note.** The exact subfolders depend on the pipeline; the optimize config below only
+> needs `builds` for the components you actually want to optimize.
+
+### Step 2 — Optimize each component
+
+Run from the directory that contains `exported_pkg/`:
+
+```
+olive run --config sd3_optimize_components.json
+```
+
+This applies a different pipeline per component:
+
+| component        | pipeline           | intent                                   |
+|------------------|--------------------|------------------------------------------|
+| `transformer`    | `dynamic_quant`    | INT8-quantize the heavy denoising backbone |
+| `text_encoder_3` | `to_fp16`          | keep T5-XXL in FP16                     |
+| `vae_encoder`    | `to_fp16`          | keep the VAE in FP16 to preserve quality |
+| `vae_decoder`    | `to_fp16`          | keep the VAE in FP16 to preserve quality |
+
+Output:
+
+```
+out/transformer/    # INT8 transformer
+out/text_encoder_3/ # FP16 T5-XXL
+out/vae_encoder/    # FP16 VAE encoder
+out/vae_decoder/    # FP16 VAE decoder
+```
+
+Each build writes one optimized component; components without a build stay as exported.
+
+### Step 3 — Inference
+
+Run end-to-end image generation with the exported ONNX models:
+
+```
+python sd3_inference.py --prompt "A photo of a cat sitting on a windowsill" --steps 28 --output result.png
+```
+
+The inference script (`sd3_inference.py`) uses:
+- **Text encoding**: ONNX Runtime with exported CLIP-L, CLIP-G, and T5-XXL encoders (run once)
+- **Denoising**: ONNX Runtime with the exported SD3 transformer (28 steps)
+- **VAE decoding**: ONNX Runtime with the exported VAE decoder
+
+Options:
+```
+--prompt TEXT       Text prompt for image generation
+--steps N           Number of denoising steps (default: 28)
+--seed N            Random seed (default: 42)
+--output PATH       Output image path (default: sd3_output.png)
+--onnx_dir DIR      Path to exported model directory (default: exported_sd3_full2)
+```
+
+> **Note.** SD3 is a gated model — you need `huggingface-cli login` or set `HF_TOKEN` to export.
+> The tokenizers (CLIP and T5) still run via the `transformers` library.
+
+---
+
+## Recipe 2 — Vision-Language Model (`vlm_optimize_components.json`)
+
+Same two-step Flow A for a VLM, using `Qwen/Qwen3-VL-2B-Instruct`.
+
+### Step 1 — Export
+
+```
+olive capture-onnx-graph --model_name_or_path Qwen/Qwen3-VL-2B-Instruct --use_mobius_builder --output_path exported_vlm_pkg
+```
+
+Mobius exports this model as three components, each in its own subfolder:
+
+```
+exported_vlm_pkg/
+  decoder/model.onnx
+  vision_encoder/model.onnx
+  embedding/model.onnx
+```
+
+### Step 2 — Optimize
+
+```
+olive run --config vlm_optimize_components.json
+```
+
+| component        | pipeline        | intent                              |
+|------------------|-----------------|-------------------------------------|
+| `decoder`        | `dynamic_quant` | INT8-quantize the language decoder  |
+| `vision_encoder` | `to_fp16`       | keep the vision tower in FP16       |
+| `embedding`      | `to_fp16`       | keep the embedding in FP16          |
+
+> The three component names (`decoder`, `vision_encoder`, `embedding`) are exactly what Mobius
+> produces for `Qwen/Qwen3-VL-2B-Instruct`. For a different VLM, adjust the component names in the
+> config to match the subfolder names your export actually produced.
+
+### Step 3 — Inference with ORT GenAI
+
+Run text generation with the exported ONNX models using **onnxruntime-genai**:
+
+```bash
+# Text-only
+python vlm_inference.py --prompt "The capital of France is"
+
+# With image input
+python vlm_inference.py --prompt "Describe this image." --image photo.jpg
+
+# Custom settings
+python vlm_inference.py --model_dir exported_vlm_pkg --max_new_tokens 256
+```
+
+The inference script (`vlm_inference.py`) uses ORT GenAI which handles:
+- **Tokenization**: Built-in tokenizer from saved HF tokenizer files
+- **Embedding**: ONNX `embedding/model.onnx` (token embed + image/audio feature mixing)
+- **Vision encoding**: ONNX `vision_encoder/model.onnx` (when `--image` is provided)
+- **Decoding**: ONNX `decoder/model.onnx` with KV cache (autoregressive generation)
+
+Options:
+```
+--prompt TEXT           Text prompt
+--image PATH            Optional image file for multimodal input
+--max_new_tokens N      Maximum tokens to generate (default: 128)
+--model_dir DIR         Path to exported model directory (default: exported_vlm_pkg)
+```
+
+#### Setup requirements
+
+The export directory needs these files alongside the ONNX models:
+
+```
+exported_vlm_pkg/
+  genai_config.json          # Model type, I/O mappings, search config
+  tokenizer.json             # HF tokenizer
+  tokenizer_config.json
+  vision_processor.json      # Vision preprocessing config
+  audio_processor.json       # Audio preprocessing config (for Phi-4-multimodal)
+  decoder/model.onnx
+  vision_encoder/model.onnx
+  embedding/model.onnx
+  audio_encoder/model.onnx   # Optional (Phi-4-multimodal)
+```
+
+To create `genai_config.json` and tokenizer files after export:
+
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)
+tokenizer.save_pretrained("exported_vlm_pkg")
+```
+
+For the `genai_config.json` structure, see the
+[Mobius phi4mm example](https://github.com/microsoft/mobius/blob/main/examples/phi4mm_ort_genai.py)
+which writes the config automatically.
+
+> **Note.** Install `onnxruntime-genai` (`pip install onnxruntime-genai`) to use this script.
+
+---
+
+## Notes
+
+- The passes here (`OnnxFloatToFloat16`, `OnnxDynamicQuantization`) are **illustrative** and chosen
+  to run without calibration data. Swap in `OrtTransformersOptimization`, `OnnxStaticQuantization`
+  (with a `data_config`), or other ONNX passes for production-quality optimization.
+- The recipes target the **CPU** EP so they run anywhere. For GPU deployment, change the
+  `execution_providers` to e.g. `["CUDAExecutionProvider"]` and the device to `"gpu"`.
+- `builds.components` selects which exported components to optimize. Only the components with a build
+  are touched; the rest remain as exported.
diff --git a/multi_comp_recipe/sd3_inference.py b/multi_comp_recipe/sd3_inference.py
@@ -0,0 +1,160 @@
+#!/usr/bin/env python
+"""SD3 end-to-end inference using all ONNX components (text encoders + transformer + VAE).
+
+Usage:
+    python sd3_inference.py --prompt "A photo of a cat sitting on a windowsill"
+    python sd3_inference.py --prompt "A futuristic city" --steps 50 --output city.png
+"""
+
+import argparse
+import os
+
+import numpy as np
+import onnxruntime as ort
+import torch
+from diffusers import FlowMatchEulerDiscreteScheduler
+from PIL import Image
+from transformers import CLIPTokenizer, T5TokenizerFast
+
+MODEL_ID = "stabilityai/stable-diffusion-3-medium-diffusers"
+ONNX_DIR = "exported_sd3_full2"
+
+
+def encode_text(prompt: str, onnx_dir: str, model_id: str) -> tuple[np.ndarray, np.ndarray]:
+    """Encode prompt using ONNX CLIP-L, CLIP-G, and T5-XXL text encoders.
+
+    Returns:
+        encoder_hidden_states: [1, 410, 4096]
+        pooled_projections: [1, 2048]
+    """
+    # Load tokenizers (lightweight, no model weights)
+    tokenizer_l = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer")
+    tokenizer_g = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer_2")
+    tokenizer_t5 = T5TokenizerFast.from_pretrained(model_id, subfolder="tokenizer_3")
+
+    # Load ONNX sessions
+    sess_l = ort.InferenceSession(os.path.join(onnx_dir, "text_encoder", "model.onnx"))
+    sess_g = ort.InferenceSession(os.path.join(onnx_dir, "text_encoder_2", "model.onnx"))
+    sess_t5 = ort.InferenceSession(os.path.join(onnx_dir, "text_encoder_3", "model.onnx"))
+
+    # CLIP-L
+    tokens_l = tokenizer_l(prompt, padding="max_length", max_length=77, return_tensors="np", truncation=True)
+    out_l = sess_l.run(None, {
+        "input_ids": tokens_l["input_ids"].astype(np.int64),
+        "attention_mask": tokens_l["attention_mask"].astype(np.int64),
+    })
+    clip_l_hidden = out_l[0]  # last_hidden_state [1, 77, 768]
+    clip_l_pooled = out_l[1]  # text_embeds [1, 768]
+
+    # CLIP-G
+    tokens_g = tokenizer_g(prompt, padding="max_length", max_length=77, return_tensors="np", truncation=True)
+    out_g = sess_g.run(None, {
+        "input_ids": tokens_g["input_ids"].astype(np.int64),
+        "attention_mask": tokens_g["attention_mask"].astype(np.int64),
+    })
+    clip_g_hidden = out_g[0]  # last_hidden_state [1, 77, 1280]
+    clip_g_pooled = out_g[1]  # text_embeds [1, 1280]
+
+    # T5-XXL
+    tokens_t5 = tokenizer_t5(prompt, padding="max_length", max_length=256, return_tensors="np", truncation=True)
+    out_t5 = sess_t5.run(None, {"input_ids": tokens_t5["input_ids"].astype(np.int64)})
+    t5_hidden = out_t5[0]  # last_hidden_state [1, 256, 4096]
+
+    # Pad CLIP outputs to 4096 and concatenate
+    clip_l_padded = np.pad(clip_l_hidden, ((0, 0), (0, 0), (0, 4096 - 768)))  # [1, 77, 4096]
+    clip_g_padded = np.pad(clip_g_hidden, ((0, 0), (0, 0), (0, 4096 - 1280)))  # [1, 77, 4096]
+    encoder_hidden_states = np.concatenate([clip_l_padded, clip_g_padded, t5_hidden], axis=1)  # [1, 410, 4096]
+    pooled_projections = np.concatenate([clip_l_pooled, clip_g_pooled], axis=-1)  # [1, 2048]
+
+    return encoder_hidden_states.astype(np.float32), pooled_projections.astype(np.float32)
+
+
+def denoise(
+    onnx_dir: str,
+    encoder_hidden_states: np.ndarray,
+    pooled_projections: np.ndarray,
+    scheduler: FlowMatchEulerDiscreteScheduler,
+    latent_shape: tuple = (1, 16, 64, 64),
+    seed: int = 42,
+) -> torch.Tensor:
+    """Run the denoising loop using the ONNX transformer."""
+    sess = ort.InferenceSession(os.path.join(onnx_dir, "transformer", "model.onnx"))
+
+    torch.manual_seed(seed)
+    latents = torch.randn(latent_shape)
+
+    for i, t in enumerate(scheduler.timesteps):
+        noise_pred = sess.run(
+            None,
+            {
+                "sample": latents.numpy(),
+                "timestep": np.array([t.item()], dtype=np.int64),
+                "encoder_hidden_states": encoder_hidden_states,
+                "pooled_projections": pooled_projections,
+            },
+        )[0]
+        latents = scheduler.step(torch.from_numpy(noise_pred), t, latents, return_dict=False)[0]
+        if i % 7 == 0:
+            print(f"  Step {i}/{len(scheduler.timesteps)}, t={t.item():.1f}")
+
+    return latents
+
+
+def decode_latents(latents: torch.Tensor, onnx_dir: str) -> np.ndarray:
+    """Decode latents to image using the ONNX VAE decoder."""
+    sess = ort.InferenceSession(os.path.join(onnx_dir, "vae_decoder", "model.onnx"))
+
+    # SD3 VAE scaling: latents / scaling_factor + shift_factor
+    # SD3 defaults: scaling_factor=1.5305, shift_factor=0.0609
+    scaling_factor = 1.5305
+    shift_factor = 0.0609
+    latents_scaled = latents / scaling_factor + shift_factor
+
+    output = sess.run(None, {"latent_sample": latents_scaled.numpy()})[0]
+    # output: [1, 3, H, W] in [-1, 1]
+    image = (output / 2 + 0.5).clip(0, 1)
+    image = np.transpose(image[0], (1, 2, 0))  # [H, W, 3]
+    return (image * 255).astype(np.uint8)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="SD3 all-ONNX inference")
+    parser.add_argument("--prompt", default="A photo of a cat sitting on a windowsill")
+    parser.add_argument("--steps", type=int, default=28)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output", default="sd3_output.png")
+    parser.add_argument("--model_id", default=MODEL_ID)
+    parser.add_argument("--onnx_dir", default=ONNX_DIR)
+    args = parser.parse_args()
+
+    # Verify exported model exists
+    transformer_path = os.path.join(args.onnx_dir, "transformer", "model.onnx")
+    if not os.path.exists(transformer_path):
+        print(f"Error: ONNX model not found at {args.onnx_dir}/")
+        print("Run: olive capture-onnx-graph --model_name_or_path "
+              "stabilityai/stable-diffusion-3-medium-diffusers "
+              "--use_mobius_builder --output_path exported_sd3_full2")
+        return
+
+    print(f"Prompt: {args.prompt}")
+    print(f"Steps: {args.steps}, Seed: {args.seed}")
+    print(f"ONNX dir: {args.onnx_dir}")
+
+    print("\n1. Encoding text (ONNX CLIP-L + CLIP-G + T5-XXL)...")
+    encoder_hidden_states, pooled_projections = encode_text(args.prompt, args.onnx_dir, args.model_id)
+    print(f"   encoder_hidden_states: {encoder_hidden_states.shape}")
+    print(f"   pooled_projections: {pooled_projections.shape}")
+
+    print("\n2. Denoising (ONNX SD3 transformer)...")
+    scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(args.model_id, subfolder="scheduler")
+    scheduler.set_timesteps(args.steps)
+    latents = denoise(args.onnx_dir, encoder_hidden_states, pooled_projections, scheduler, seed=args.seed)
+
+    print("\n3. Decoding latents (ONNX VAE decoder)...")
+    image = decode_latents(latents, args.onnx_dir)
+    Image.fromarray(image).save(args.output)
+    print(f"\nSaved: {args.output} ({image.shape[1]}x{image.shape[0]})")
+
+
+if __name__ == "__main__":
+    main()