Plumbing and core MoE logic for router replay by xuefgu · Pull Request #3881 · AI-Hypercomputer/maxtext

xuefgu · 2026-05-12T04:09:33Z

Description

This PR introduces forced expert routing in MaxText’s MoE blocks to support RL training (e.g., GRPO). In this pipeline, completions are generated on vLLM, and their exact expert routing selections must be strictly enforced during training on MaxText to prevent expert selection mismatch and policy gradient divergence.

The externally supplied 4D routing tensor ([bs, seq_len, num_layers, k]) enters via TunixMaxTextAdapter and the top-level Transformer model. It is then sliced layer-by-layer inside decoders.py (Linen) or nnx_decoders.py (NNX) and routed down through model-specific decoder layers (DeepSeek, Qwen3, Qwen3.5, Gemma4, and Mixtral) into the core MoE block (moe.py), where it overrides the model-determined expert selections.

The implementation also implements out-of-bounds safety by mapping vLLM's padding sentinel values (-1) to valid, evenly distributed dummy expert indices before calling JAX bincount in permute(), ensuring hardware load balancing and performance. It also preserves backpropagation to router weights by executing manual gating logit extraction based on the forced selections.

This PR contains the core plumbing, edge case handling, and unscanned decoder loop support. A follow-up PR will support forced expert routing for scan_layers=True.

Tests

Unit tests.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-12T04:38:20Z

Codecov Report

❌ Patch coverage is 57.85124% with 51 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/moe.py	66.66%	11 Missing and 6 partials ⚠️
src/maxtext/layers/nnx_decoders.py	30.43%	9 Missing and 7 partials ⚠️
src/maxtext/layers/decoders.py	31.81%	8 Missing and 7 partials ⚠️
...t/integration/vllm/maxtext_vllm_adapter/adapter.py	0.00%	1 Missing ⚠️
src/maxtext/layers/nnx_wrappers.py	93.33%	0 Missing and 1 partial ⚠️
src/maxtext/models/gemma4.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

shralex · 2026-05-12T15:42:00Z

@gemini-cli /investigate

github-actions · 2026-05-12T15:42:18Z

🤖 Hi @shralex, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-12T15:44:59Z

🤖 I'm sorry @shralex, but I was unable to process your request. Please see the logs for more details.

xuefgu force-pushed the xfgu-router-replay branch from 11a4a0c to 5877c95 Compare May 12, 2026 04:32

xuefgu force-pushed the xfgu-router-replay branch 10 times, most recently from e090d93 to 7c33eb9 Compare May 13, 2026 18:39

Plumbing and core MoE logic for router replay

13280fd

xuefgu force-pushed the xfgu-router-replay branch from 7c33eb9 to 13280fd Compare May 13, 2026 18:44

xuefgu marked this pull request as ready for review May 13, 2026 20:18

xuefgu requested review from RissyRan, bvandermoon, gagika, gobbleturk, jesselu-google, jiangjy1982, michelle-yooh, parambole, richjames0, shralex, shuningjin and suexu1025 as code owners May 13, 2026 20:18

xuefgu requested review from A9isha, NicoGrande, NuojCheng, SurbhiJainUSC, abhinavclemson, aireenmei, dipannita08, hengtaoguo, igorts-git, khatwanimohit and vipannalla as code owners May 13, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plumbing and core MoE logic for router replay#3881

Plumbing and core MoE logic for router replay#3881
xuefgu wants to merge 1 commit into
mainfrom
xfgu-router-replay

xuefgu commented May 12, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

shralex commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xuefgu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shralex commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xuefgu commented May 12, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading