Skip to content

fix: correct 9 critical/high bugs found via multi-agent audit#563

Open
imitater-dou wants to merge 1 commit into
sgl-project:mainfrom
imitater-dou:main
Open

fix: correct 9 critical/high bugs found via multi-agent audit#563
imitater-dou wants to merge 1 commit into
sgl-project:mainfrom
imitater-dou:main

Conversation

@imitater-dou
Copy link
Copy Markdown

Summary

This PR fixes 9 confirmed bugs discovered through a multi-agent static analysis of the codebase, all verified by direct source inspection before patching.

🔴 Runtime Crashes (will crash on first use)

  • specforge/core/loss.pyLogSoftmaxLoss.backward returned 5 values for 3 inputs (forward takes logits, target, position_mask). PyTorch autograd raises RuntimeError on every backward pass.

  • specforge/distributed.pyGather.backward returned 6 values for 5 inputs (one extra trailing None). Crashes any distributed training run that uses sequence-parallel gather.

🔴 Silent Training Correctness Bugs

  • specforge/data/preprocessing.py — USP position_ids were identical for all Ulysses ranks within the same ring group (ring_start = ring_rank * ring_chunk ignores ulysses_rank). With sp_ulysses_size > 1, all Ulysses peers claimed the same global position ID range, producing wrong attention patterns. Fixed by offsetting by ulysses_rank * usp_chunk_size.

  • scripts/train_eagle3.pydraft_model.eval() was set at the start of the evaluation loop but never reversed. All training steps after the first eval checkpoint ran with dropout disabled. Also, save_interval was not scaled by draft_accumulation_steps (unlike log_interval and eval_interval), causing 4× more frequent saves than intended.

  • scripts/train_dflash.py — On resume, only the LR scheduler state was restored (optimizer.scheduler.load_state_dict); the AdamW optimizer's first/second moment estimates (exp_avg, exp_avg_sq) were silently discarded, producing a loss spike at every resume. Fixed by calling optimizer.load_state_dict(resume_state) which restores both.

  • specforge/lr_scheduler.pyWarmupDelayerScheduler.step() passed epoch - warmup_epochs to the after-scheduler, missing the delay_epochs subtraction. The cosine schedule started at a non-zero offset, causing a discontinuous LR jump after the warmup+delay phase.

🔴 Logic Bugs in Data Pipeline

  • specforge/data/parse.pyHarmonyParser used or instead of and in the first-message guard, making the condition always True. Every conversation unconditionally received the reasoning effort prefix, including those that already had a system message.

  • specforge/data/template.pyTemplateRegistry.register assert was logically inverted: assert (not override and ...) raised AssertionError whenever override=True was passed, making the override parameter non-functional.

🟠 High Severity

  • specforge/core/eagle3.py_compute_metric_acc used loss_mask.sum() as the denominator while the numerator was masked by position_mask (which excludes out-of-vocabulary positions). This produced systematically deflated accuracy metrics throughout training. Fixed by using position_mask.sum() as denominator.

  • specforge/modeling/target/eagle3_target_model.pypadding(target_out, left=False) was called unconditionally even when target_out is None (the case when return_logits=False). This raised AttributeError on None.__getitem__. Fixed with a None guard.

Test plan

  • Verify LogSoftmaxLoss backward runs without RuntimeError on a minimal synthetic batch
  • Verify Gather.backward runs without RuntimeError in a 2-GPU USP setup
  • Verify process_data_usp produces distinct position_ids for each SP rank when sp_ulysses_size > 1
  • Verify training resumes correctly (loss continues from pre-resume level, no spike)
  • Verify draft_model.train() mode is active during training steps after an eval checkpoint
  • Verify TemplateRegistry.register(name, tpl, override=True) no longer raises
  • Verify HarmonyParser does not prepend reasoning effort prefix when the first message is a system message

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Runtime crashes:
- loss.py: LogSoftmaxLoss.backward returned 5 values for 3 inputs
- distributed.py: Gather.backward returned 6 values for 5 inputs

Training correctness:
- preprocessing.py: USP position_ids were identical for all Ulysses ranks within a
  ring group; now offset by ulysses_rank to match each rank's distinct sequence slice
- train_eagle3.py: draft_model.eval() was never reversed after evaluation loop,
  corrupting rest of epoch; also save_interval now scaled by draft_accumulation_steps
  to match log/eval interval semantics
- train_dflash.py: resume now calls optimizer.load_state_dict() to restore both
  AdamW moment estimates and scheduler (was only restoring scheduler)
- lr_scheduler.py: WarmupDelayerScheduler.step() subtracted only warmup_epochs;
  now subtracts both warmup_epochs and delay_epochs for correct after-scheduler offset

Logic/data bugs:
- parse.py: HarmonyParser condition was a tautology (or → and), unconditionally
  prepending reasoning effort prefix to every conversation
- template.py: TemplateRegistry.register assert was inverted; override=True always
  raised AssertionError instead of allowing re-registration
- eagle3.py: _compute_metric_acc denominator now uses position_mask.sum() instead
  of loss_mask.sum() to match the numerator's masked domain
- eagle3_target_model.py: guard padding(target_out) when target_out is None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant