Add Gemma 4 12B Unified Olive recipe (mobius)#503
Open
justinchuby wants to merge 3 commits into
Open
Conversation
Add an Olive recipe for google/gemma-4-12B-it, the encoder-free Unified member of the Gemma 4 family. Mirrors the existing gemma-4-E2B-it recipe: exports to ONNX via the MobiusBuilder pass and optionally quantizes the decoder with K-Quant (Q4_K_M) INT4. Four configs cover CPU (fp32, int4) and CUDA (fp16, int4). Includes info.yml, requirements, README documenting the encoder-free 4-component pipeline (decoder, embedding, encoder-free vision/audio embedders), plus eval.py (MMLU Pro 77.2% / GPQA Diamond 78.8% reference scores) and a text inference script. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new Olive “mobius” recipe set for Gemma 4 12B Unified (google/gemma-4-12B-it), mirroring the existing Gemma 4 E2B recipe structure to export ORT GenAI components via MobiusBuilder and optionally apply K-Quant INT4.
Changes:
- Introduces CPU (fp32/int4) and CUDA (fp16/int4) Olive configs targeting Mobius export + optional K-Quant.
- Adds runnable helper scripts for ORT GenAI inference (
inference.py) and lm-eval-harness evaluation (eval.py). - Adds recipe metadata (
info.yml), documentation (README.md), and dependency list (requirements.txt).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| google-gemma-4-12B-it/requirements.txt | Declares Python deps for running eval/inference helpers. |
| google-gemma-4-12B-it/README.md | Documents model/recipe intent, build steps, inference, and evaluation usage. |
| google-gemma-4-12B-it/LICENSE | Adds the recipe folder license text. |
| google-gemma-4-12B-it/info.yml | Registers recipe metadata for repo scanning and indexing. |
| google-gemma-4-12B-it/inference.py | Provides ORT GenAI text inference CLI for produced model packages. |
| google-gemma-4-12B-it/eval.py | Provides lm-eval-harness evaluation CLI for produced model packages. |
| google-gemma-4-12B-it/cpu/fp32/config.json | CPU fp32 MobiusBuilder config. |
| google-gemma-4-12B-it/cpu/int4/config.json | CPU fp32 MobiusBuilder + K-Quant INT4 config. |
| google-gemma-4-12B-it/cuda/fp16/config.json | CUDA fp16 MobiusBuilder config with CUDA EP target. |
| google-gemma-4-12B-it/cuda/int4/config.json | CUDA fp16 MobiusBuilder + K-Quant INT4 config with CUDA EP target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Resolve model dirs relative to __file__ in eval.py/inference.py so the scripts work from any working directory (not just the recipe folder). - Align eval.py usage docstring with the default task key (leaderboard_mmlu_pro). - README: use block_size=32 (matching the JSON configs) and drop the redundant explicit pip install in favor of requirements.txt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
| from pathlib import Path | ||
|
|
||
| # Register Olive's ORT GenAI evaluator with lm-eval | ||
| import olive.evaluator.lmeval_ort # noqa: F401 |
For the encoder-free gemma4_unified architecture, each of the vision and audio 'encoders' is a single projector MatMul that forms the entire image/audio embedding pathway. Quantizing it to INT4 injects disproportionate error (measured rel-L2 ~3.7% vision / ~9.2% audio) while the components are tiny (~76 MB / ~1.4 MB), so keeping them FP16 costs almost nothing. Exclude them via nodes_to_exclude: ['*/projector/*']. The decoder (including lm_head) stays INT4, where the size savings live and INT4 has negligible impact on output tokens (top-1 logit agreement ~100%, KL~0.004). The glob form of nodes_to_exclude requires microsoft/Olive#2518; with older Olive the pattern matches nothing and projectors are quantized as before. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an Olive recipe for
google/gemma-4-12B-it— the encoder-free Unified member of the Gemma 4 family — mirroring the existinggoogle-gemma-4-E2B-itrecipe.The model is exported to ONNX via the
MobiusBuilderpass and optionally quantized with K-Quant (Q4_K_M) INT4. Gemma 4 12B Unified projects raw image patches and audio waveform features directly into the decoder embedding space (no dedicated encoders), but the mobius pipeline still emits four ORT GenAI components:decoder,embedding, and encoder-freevision_encoder/audio_encoderembedders.Recipes
cpu/fp32/config.jsonMobiusBuilder(fp32)cpu/int4/config.jsonMobiusBuilder(fp32)→OnnxKQuantQuantizationcuda/fp16/config.jsonMobiusBuilder(fp16)cuda/int4/config.jsonMobiusBuilder(fp16)→OnnxKQuantQuantizationContents
info.yml,requirements.txt,README.mdeval.py(lm-eval; reference MMLU Pro 77.2%, GPQA Diamond 78.8%) andinference.py(ORT GenAI text inference)Validation
info.ymlparse; the repo scanner groups the 4 recipes correctly by arch/ep/device.eval.py/inference.pybyte-compile.model_type: gemma4_unified, ungated); supported by mobiusGemma4UnifiedModel/Gemma4UnifiedTask.Generated model artifacts and
.olive-cacheare intentionally not committed (gitignored), matching other large-model recipes (e.g. Qwen3-14B).Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com