Support sharded target logits for EAGLE3 online training by Yukino256 · Pull Request #558 · sgl-project/SpecForge

Yukino256 · 2026-05-25T03:41:19Z

Motivation

This PR adds a memory-efficient target-logits path for EAGLE3 online training with the SGLang target backend.

Previously, online EAGLE3 training materialized full target logits on every TP rank after SGLang's tensor-parallel logits all-gather. For long-context training, this creates large redundant logits tensors and can lead to OOM. This becomes especially problematic when enabling sequence parallel training, where the draft model only needs the local sequence shard instead of full-sequence logits on every rank.

Modifications

Add --shard-target-logits to scripts/train_eagle3.py.
For SGLang target models, optionally disable SGLang's target-logits TP all-gather and redistribute logits explicitly.
Add two sharded target-logits redistribution paths:
- Non-SP online training: redistribute from full-batch/local-vocab logits to local-batch/full-vocab logits.
- USP online training: redistribute from full-sequence/local-vocab logits to local-sequence/full-vocab logits.
Support online USP training with tp_size >= sp_size when tp_size % sp_size == 0.
Align target hidden states, input IDs, attention masks, loss masks, and position IDs with the local USP sequence shard.
Keep the original behavior unchanged unless --shard-target-logits is enabled.

Related Issues

N/A

Accuracy Test

Tested EAGLE3 online training on DeepSeek-V3 with the SGLang target backend.

Configuration:

--target-model-backend sglang
--tp-size 8
--sp-ring-size 1
--sp-ulysses-size 8
--attention-backend usp
--max-length 32768
--sglang-mem-fraction-static 0.8
--shard-target-logits

The 32K training run proceeds without OOM. Training loss decreases normally and training accuracy increases normally.

Training loss / accuracy curves:

Downstream acceptance length on the related evaluation data: 2.53. This is roughly consistent with the previous non-SP training result we used.

SGLang Inference/evaluation Logs:

Benchmark & Profiling

This PR is primarily intended to reduce target-logits memory usage during EAGLE3 online training.

Observed behavior:

Before this change, target logits were materialized redundantly on every TP rank.
With --shard-target-logits, each rank only keeps the target logits needed by its local batch shard or local USP sequence shard.
In the tested DeepSeek-V3 32K online USP setup, training runs successfully with --sglang-mem-fraction-static 0.8.

No throughput benchmark is included yet.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Code Review

This pull request implements support for sharding target logits and sequence parallelism (USP) during online data generation for Eagle3 training. It introduces the --shard-target-logits argument and updates the SGLang backend to handle logit and hidden state sharding via all-to-all communication and sequence slicing. Review feedback identifies a critical error in the position_ids calculation for sequence parallel mode and highlights several performance bottlenecks in the sharding utility functions caused by inefficient tensor allocations and copying.

gemini-code-assist · 2026-05-25T03:44:08Z

+                    seq_len = input_id.shape[1]
+                    sp_ulysses_size = max(1, sp_size // sp_ring_size)
+                    usp_chunk_size = max(seq_len - ttt_length, 0)
+                    ring_chunk = usp_chunk_size * sp_ulysses_size
+                    ring_start = sp_ring_rank * ring_chunk
+                    kept_position_ids.append(
+                        torch.arange(
+                            ring_start,
+                            ring_start + ring_chunk,
+                            dtype=torch.long,
+                            device=input_id.device,
+                        ).unsqueeze(0)
+                    )


The logic for generating position_ids in sequence parallel mode is incorrect. It uses sp_ring_rank and a ring_chunk size that does not match the local sequence length seq_len. Specifically, ring_chunk is calculated as usp_chunk_size * sp_ulysses_size, which represents the length of the entire Ring chunk, but input_id is only a Ulysses shard of that chunk. This will cause a shape mismatch and incorrect position encodings during training. The position_ids should be based on the absolute offset start calculated from _sp_chunk_bounds using the rank's sp_rank and the local sequence length.

original_seq_len = input_ids.shape[1] start, _ = _sp_chunk_bounds(original_seq_len, sp_rank, sp_size, ttt_length) seq_len = input_id.shape[1] kept_position_ids.append( torch.arange( start, start + seq_len, dtype=torch.long, device=input_id.device, ).unsqueeze(0) )

This is intentional and follows the existing offline USP preprocessing logic. In USP mode, position_ids are not local input slice positions. They need to match the Ulysses all-to-all expanded sequence length consumed by UspAdapter, i.e. (local_len - ttt_length) * sp_ulysses_size. The online path mirrors process_data_usp so online and offline USP use consistent position ids.

Yukino256 · 2026-05-25T03:52:26Z

Hi @jiapingW could you please help take a look?

This PR solves a similar issue as #524, but extends it to the online + USP + sharded target logits setting. It avoids redundant target logits across TP ranks and makes the SGLang target logits align with the local USP sequence shard.

I tested it on DeepSeek-V3 32K online training (tp=8, sp=8, mem_fraction_static=0.8). Training runs without OOM, loss/accuracy look normal, and the downstream acceptance length is 2.53, roughly consistent with our previous non-SP result.

Thanks!

jiapingW · 2026-05-26T00:12:55Z

Great. I’ll review it soon.

Yukino256 requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners May 25, 2026 03:41

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

Yukino256 force-pushed the pr-online-usp-shard-target-logits branch from 226a1da to 64a52de Compare May 25, 2026 03:50

Support sharded target logits for EAGLE3 online training

44571b6

Yukino256 force-pushed the pr-online-usp-shard-target-logits branch from 64a52de to 44571b6 Compare May 25, 2026 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sharded target logits for EAGLE3 online training#558

Support sharded target logits for EAGLE3 online training#558
Yukino256 wants to merge 1 commit into
sgl-project:mainfrom
Yukino256:pr-online-usp-shard-target-logits

Yukino256 commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

Yukino256 May 25, 2026

Uh oh!

Uh oh!

Uh oh!

Yukino256 commented May 25, 2026

Uh oh!

jiapingW commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yukino256 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Yukino256 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Yukino256 commented May 25, 2026

Uh oh!

jiapingW commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yukino256 commented May 25, 2026 •

edited

Loading