fix: correct 9 critical/high bugs found via multi-agent audit by imitater-dou · Pull Request #563 · sgl-project/SpecForge

imitater-dou · 2026-05-26T11:03:28Z

Summary

This PR fixes 9 confirmed bugs discovered through a multi-agent static analysis of the codebase, all verified by direct source inspection before patching.

🔴 Runtime Crashes (will crash on first use)

specforge/core/loss.py — LogSoftmaxLoss.backward returned 5 values for 3 inputs (forward takes logits, target, position_mask). PyTorch autograd raises RuntimeError on every backward pass.
specforge/distributed.py — Gather.backward returned 6 values for 5 inputs (one extra trailing None). Crashes any distributed training run that uses sequence-parallel gather.

🔴 Silent Training Correctness Bugs

specforge/data/preprocessing.py — USP position_ids were identical for all Ulysses ranks within the same ring group (ring_start = ring_rank * ring_chunk ignores ulysses_rank). With sp_ulysses_size > 1, all Ulysses peers claimed the same global position ID range, producing wrong attention patterns. Fixed by offsetting by ulysses_rank * usp_chunk_size.
scripts/train_eagle3.py — draft_model.eval() was set at the start of the evaluation loop but never reversed. All training steps after the first eval checkpoint ran with dropout disabled. Also, save_interval was not scaled by draft_accumulation_steps (unlike log_interval and eval_interval), causing 4× more frequent saves than intended.
scripts/train_dflash.py — On resume, only the LR scheduler state was restored (optimizer.scheduler.load_state_dict); the AdamW optimizer's first/second moment estimates (exp_avg, exp_avg_sq) were silently discarded, producing a loss spike at every resume. Fixed by calling optimizer.load_state_dict(resume_state) which restores both.
specforge/lr_scheduler.py — WarmupDelayerScheduler.step() passed epoch - warmup_epochs to the after-scheduler, missing the delay_epochs subtraction. The cosine schedule started at a non-zero offset, causing a discontinuous LR jump after the warmup+delay phase.

🔴 Logic Bugs in Data Pipeline

specforge/data/parse.py — HarmonyParser used or instead of and in the first-message guard, making the condition always True. Every conversation unconditionally received the reasoning effort prefix, including those that already had a system message.
specforge/data/template.py — TemplateRegistry.register assert was logically inverted: assert (not override and ...) raised AssertionError whenever override=True was passed, making the override parameter non-functional.

🟠 High Severity

specforge/core/eagle3.py — _compute_metric_acc used loss_mask.sum() as the denominator while the numerator was masked by position_mask (which excludes out-of-vocabulary positions). This produced systematically deflated accuracy metrics throughout training. Fixed by using position_mask.sum() as denominator.
specforge/modeling/target/eagle3_target_model.py — padding(target_out, left=False) was called unconditionally even when target_out is None (the case when return_logits=False). This raised AttributeError on None.__getitem__. Fixed with a None guard.

Test plan

Verify LogSoftmaxLoss backward runs without RuntimeError on a minimal synthetic batch
Verify Gather.backward runs without RuntimeError in a 2-GPU USP setup
Verify process_data_usp produces distinct position_ids for each SP rank when sp_ulysses_size > 1
Verify training resumes correctly (loss continues from pre-resume level, no spike)
Verify draft_model.train() mode is active during training steps after an eval checkpoint
Verify TemplateRegistry.register(name, tpl, override=True) no longer raises
Verify HarmonyParser does not prepend reasoning effort prefix when the first message is a system message

gemini-code-assist · 2026-05-26T11:03:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Runtime crashes: - loss.py: LogSoftmaxLoss.backward returned 5 values for 3 inputs - distributed.py: Gather.backward returned 6 values for 5 inputs Training correctness: - preprocessing.py: USP position_ids were identical for all Ulysses ranks within a ring group; now offset by ulysses_rank to match each rank's distinct sequence slice - train_eagle3.py: draft_model.eval() was never reversed after evaluation loop, corrupting rest of epoch; also save_interval now scaled by draft_accumulation_steps to match log/eval interval semantics - train_dflash.py: resume now calls optimizer.load_state_dict() to restore both AdamW moment estimates and scheduler (was only restoring scheduler) - lr_scheduler.py: WarmupDelayerScheduler.step() subtracted only warmup_epochs; now subtracts both warmup_epochs and delay_epochs for correct after-scheduler offset Logic/data bugs: - parse.py: HarmonyParser condition was a tautology (or → and), unconditionally prepending reasoning effort prefix to every conversation - template.py: TemplateRegistry.register assert was inverted; override=True always raised AssertionError instead of allowing re-registration - eagle3.py: _compute_metric_acc denominator now uses position_mask.sum() instead of loss_mask.sum() to match the numerator's masked domain - eagle3_target_model.py: guard padding(target_out) when target_out is None

imitater-dou requested review from FlamingoPg, FrankLeeeee, shuaills, sleepcoo and zyksir as code owners May 26, 2026 11:03

imitater-dou force-pushed the main branch from bb52cfa to 42cb50b Compare May 26, 2026 11:44

imitater-dou force-pushed the main branch from 42cb50b to bcb2ec2 Compare May 26, 2026 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct 9 critical/high bugs found via multi-agent audit#563

fix: correct 9 critical/high bugs found via multi-agent audit#563
imitater-dou wants to merge 1 commit into
sgl-project:mainfrom
imitater-dou:main

imitater-dou commented May 26, 2026

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

imitater-dou commented May 26, 2026

Summary

🔴 Runtime Crashes (will crash on first use)

🔴 Silent Training Correctness Bugs

🔴 Logic Bugs in Data Pipeline

🟠 High Severity

Test plan

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant