fix: correct 9 critical/high bugs found via multi-agent audit#563
Open
imitater-dou wants to merge 1 commit into
Open
fix: correct 9 critical/high bugs found via multi-agent audit#563imitater-dou wants to merge 1 commit into
imitater-dou wants to merge 1 commit into
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Runtime crashes: - loss.py: LogSoftmaxLoss.backward returned 5 values for 3 inputs - distributed.py: Gather.backward returned 6 values for 5 inputs Training correctness: - preprocessing.py: USP position_ids were identical for all Ulysses ranks within a ring group; now offset by ulysses_rank to match each rank's distinct sequence slice - train_eagle3.py: draft_model.eval() was never reversed after evaluation loop, corrupting rest of epoch; also save_interval now scaled by draft_accumulation_steps to match log/eval interval semantics - train_dflash.py: resume now calls optimizer.load_state_dict() to restore both AdamW moment estimates and scheduler (was only restoring scheduler) - lr_scheduler.py: WarmupDelayerScheduler.step() subtracted only warmup_epochs; now subtracts both warmup_epochs and delay_epochs for correct after-scheduler offset Logic/data bugs: - parse.py: HarmonyParser condition was a tautology (or → and), unconditionally prepending reasoning effort prefix to every conversation - template.py: TemplateRegistry.register assert was inverted; override=True always raised AssertionError instead of allowing re-registration - eagle3.py: _compute_metric_acc denominator now uses position_mask.sum() instead of loss_mask.sum() to match the numerator's masked domain - eagle3_target_model.py: guard padding(target_out) when target_out is None
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes 9 confirmed bugs discovered through a multi-agent static analysis of the codebase, all verified by direct source inspection before patching.
🔴 Runtime Crashes (will crash on first use)
specforge/core/loss.py—LogSoftmaxLoss.backwardreturned 5 values for 3 inputs (forwardtakeslogits, target, position_mask). PyTorch autograd raisesRuntimeErroron every backward pass.specforge/distributed.py—Gather.backwardreturned 6 values for 5 inputs (one extra trailingNone). Crashes any distributed training run that uses sequence-parallel gather.🔴 Silent Training Correctness Bugs
specforge/data/preprocessing.py— USPposition_idswere identical for all Ulysses ranks within the same ring group (ring_start = ring_rank * ring_chunkignoresulysses_rank). Withsp_ulysses_size > 1, all Ulysses peers claimed the same global position ID range, producing wrong attention patterns. Fixed by offsetting byulysses_rank * usp_chunk_size.scripts/train_eagle3.py—draft_model.eval()was set at the start of the evaluation loop but never reversed. All training steps after the first eval checkpoint ran with dropout disabled. Also,save_intervalwas not scaled bydraft_accumulation_steps(unlikelog_intervalandeval_interval), causing 4× more frequent saves than intended.scripts/train_dflash.py— On resume, only the LR scheduler state was restored (optimizer.scheduler.load_state_dict); the AdamW optimizer's first/second moment estimates (exp_avg,exp_avg_sq) were silently discarded, producing a loss spike at every resume. Fixed by callingoptimizer.load_state_dict(resume_state)which restores both.specforge/lr_scheduler.py—WarmupDelayerScheduler.step()passedepoch - warmup_epochsto the after-scheduler, missing thedelay_epochssubtraction. The cosine schedule started at a non-zero offset, causing a discontinuous LR jump after the warmup+delay phase.🔴 Logic Bugs in Data Pipeline
specforge/data/parse.py—HarmonyParserusedorinstead ofandin the first-message guard, making the condition alwaysTrue. Every conversation unconditionally received the reasoning effort prefix, including those that already had a system message.specforge/data/template.py—TemplateRegistry.registerassert was logically inverted:assert (not override and ...)raisedAssertionErrorwheneveroverride=Truewas passed, making the override parameter non-functional.🟠 High Severity
specforge/core/eagle3.py—_compute_metric_accusedloss_mask.sum()as the denominator while the numerator was masked byposition_mask(which excludes out-of-vocabulary positions). This produced systematically deflated accuracy metrics throughout training. Fixed by usingposition_mask.sum()as denominator.specforge/modeling/target/eagle3_target_model.py—padding(target_out, left=False)was called unconditionally even whentarget_out is None(the case whenreturn_logits=False). This raisedAttributeErroronNone.__getitem__. Fixed with aNoneguard.Test plan
LogSoftmaxLossbackward runs withoutRuntimeErroron a minimal synthetic batchGather.backwardruns withoutRuntimeErrorin a 2-GPU USP setupprocess_data_uspproduces distinctposition_idsfor each SP rank whensp_ulysses_size > 1draft_model.train()mode is active during training steps after an eval checkpointTemplateRegistry.register(name, tpl, override=True)no longer raisesHarmonyParserdoes not prepend reasoning effort prefix when the first message is a system message