feat(vllm): add Gemma 4 models, image, and ROCm serving recipes#144
feat(vllm): add Gemma 4 models, image, and ROCm serving recipes#144coketaste wants to merge 8 commits into
Conversation
coketaste
commented
Apr 14, 2026
- Register pyt_vllm_gemma-4-26b-a4b-it and pyt_vllm_gemma-4-31b-it in models.json (gemma4 Docker stack).
- Add docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile from vllm/vllm-openai-rocm:gemma4 with transformers 5.5.0.
- Extend scripts/vllm/configs/default.yaml with Gemma 4 serving blocks (TRITON_ATTN, gfx942 float16; 26B MoE disables AITER fused MoE).
- Quote JSON-like extra_args in run_vllm.py (shlex) for --limit-mm-per-prompt with existing --flag YAML keys.
- Document Gemma 4 in benchmark/vllm/README.md.
- Register pyt_vllm_gemma-4-26b-a4b-it and pyt_vllm_gemma-4-31b-it in models.json (gemma4 Docker stack). - Add docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile from vllm/vllm-openai-rocm:gemma4 with transformers 5.5.0. - Extend scripts/vllm/configs/default.yaml with Gemma 4 serving blocks (TRITON_ATTN, gfx942 float16; 26B MoE disables AITER fused MoE). - Quote JSON-like extra_args in run_vllm.py (shlex) for --limit-mm-per-prompt with existing --flag YAML keys. - Document Gemma 4 in benchmark/vllm/README.md.
There was a problem hiding this comment.
Pull request overview
Adds Gemma 4 (26B-A4B-it and 31B-it) vLLM serving support to the MAD benchmarking stack, including new model registrations, ROCm/Gemma4 Docker build plumbing, and documented serving recipes.
Changes:
- Registered two Gemma 4 vLLM models in
models.jsonand documented them inbenchmark/vllm/README.md. - Added a Gemma4-specific AMD Ubuntu Dockerfile based on
vllm/vllm-openai-rocm:gemma4and extendedscripts/vllm/configs/default.yamlwith Gemma 4 serving recipes/overrides. - Updated
scripts/vllm/run_vllm.pyto shell-quote JSON-like/whitespace-containingextra_argsvalues (notably--limit-mm-per-prompt).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/vllm/run_vllm.py | Adjusts extra_args formatting/quoting when composing the vLLM command line. |
| scripts/vllm/configs/default.yaml | Adds Gemma 4 serving benchmark blocks and gfx942 dtype overrides. |
| models.json | Registers Gemma 4 vLLM models and their MAD metadata/output CSV names. |
| docker/pyt_vllm_gemma4.ubuntu.amd.Dockerfile | Introduces a Gemma4-tagged base image Dockerfile and pins transformers. |
| benchmark/vllm/README.md | Documents Gemma 4 image tag usage, gating/token requirements, and recipe details. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rmers>=5.5.0 Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…artial) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds Gemma 4 vLLM benchmark/serving support and hardens run_vllm.py extra-args handling to better support JSON-like and shell-metacharacter-containing values.
Changes:
- Registers Gemma 4 models in
models.jsonand adds serving recipes toscripts/vllm/configs/default.yaml. - Updates the shared vLLM AMD Dockerfile to a newer upstream vLLM base image and installs newer Transformers.
- Switches
run_vllm.pyto shell-quote extra arg values and adds a new test module + README updates describing Gemma 4 usage.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/vllm/test_run_vllm_extra_args.py |
Adds tests intended to validate extra-args quoting behavior. |
scripts/vllm/run_vllm.py |
Quotes extra args via shlex.quote(str(v)) when building shell command strings. |
scripts/vllm/configs/default.yaml |
Adds Gemma 4 serving blocks (TRITON_ATTN, gfx942 float16 overrides; MoE AITER disable for 26B-A4B). |
models.json |
Registers pyt_vllm_gemma-4-26b-a4b-it and pyt_vllm_gemma-4-31b-it. |
docker/pyt_vllm.ubuntu.amd.Dockerfile |
Bumps vLLM base image version and installs newer Transformers in the shared vLLM stack. |
benchmark/vllm/README.md |
Documents Gemma 4 images/recipes and updates the available-models list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove redundant pip install transformers (v0.20.0 ships with v5) - Delete test_run_vllm_extra_args.py (duplicated inline logic) - Remove --async-scheduling from Gemma 4 configs (on by default) - Enable concurrency 32/128 for gemma-4-26B-A4B-it - Update README to reflect v0.20.0 as the standard base image Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…helper - Fix ENTRYPOINT [""] → ENTRYPOINT [] to properly clear upstream entrypoint - Skip bool False flags instead of emitting them on the command line - Extract build_extra_args_str() as importable module-level function - Rewrite tests to import and exercise the real production code path Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends the vLLM benchmarking/serving integration to support Google Gemma 4 models by registering new MAD model entries, adding serving recipes, tightening CLI extra-arg quoting in the vLLM runner, and updating the vLLM Docker base tag and documentation accordingly.
Changes:
- Register Gemma 4 models in
models.jsonand document usage/requirements inbenchmark/vllm/README.md. - Add Gemma 4 serving recipes (including AITER/MoE and gfx942 overrides) to
scripts/vllm/configs/default.yaml. - Refactor/strengthen
extra_argsshell-quoting viabuild_extra_args_str()with unit tests; bump vLLM ROCm base image tag and clear the entrypoint in the shared vLLM Dockerfile.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/vllm/test_run_vllm_extra_args.py |
Adds unit tests for build_extra_args_str() to validate quoting/flag behavior. |
tests/vllm/__init__.py |
Initializes the tests.vllm package (empty). |
scripts/vllm/run_vllm.py |
Introduces build_extra_args_str() using shlex.quote and switches main config processing to use it. |
scripts/vllm/configs/default.yaml |
Adds Gemma 4 serving config blocks with TRITON_ATTN, gfx942 float16 override, and MoE/AITER controls. |
models.json |
Registers pyt_vllm_gemma-4-26b-a4b-it and pyt_vllm_gemma-4-31b-it. |
docker/pyt_vllm.ubuntu.amd.Dockerfile |
Updates base image to v0.20.0 and clears the entrypoint via ENTRYPOINT []. |
benchmark/vllm/README.md |
Updates vLLM version/tag references and documents Gemma 4 models + recipes and required env vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -485,12 +499,7 @@ def main(): | |||
| env_vars = config.get("env", {}) | |||
| extra_args = config.get("extra_args", {}) | |||
| env_vars_str = " ".join(f"{k}={v}" for k, v in env_vars.items()) | |||
| @@ -36,4 +36,4 @@ WORKDIR $WORKSPACE_DIR | |||
| RUN pip3 list | |||
|
|
|||
| # Specify entrypoint to override upstream | |||
| ENTRYPOINT [""] | |||
| ENTRYPOINT [] | |||