feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU by curnane-lab · Pull Request #562 · sgl-project/SpecForge

curnane-lab · 2026-05-25T09:46:02Z

Motivation

The official SpecForge examples are GPU-centric: they assume CUDA availability, use CUDA_VISIBLE_DEVICES, default to flex_attention, and rely on the SGLang backend for large models. There is no reference configuration or script for running DFlash on Ascend NPU with small dense models.

Qwen3.5-4B is a dense (non-MoE) model with hybrid attention support. Unlike the existing MoE example (Qwen3.5-35B-A3B), which requires SGLang and multimodal language_model path handling, Qwen3.5-4B is a standard dense transformer that fits comfortably on a single NPU node with the HF backend and SDPA attention.

This PR provides the missing pieces for NPU users:

A draft model config (qwen3.5-4b-dflash.json) tuned for the 4B dense architecture
A reference training script that uses SDPA (NPU-compatible) and the HF backend (no SGLang service overhead)

Key differences from existing GPU examples:

Dense vs MoE: Qwen3.5-4B is dense (32 layers, 2560 hidden size); existing 35B-A3B example is MoE
Hybrid attention: Qwen3.5-4B uses hybrid local/global attention patterns
HF backend: 4B fits in single-node NPU memory without SGLang
Standard embedding path: model.embed_tokens.weight (no language_model nesting)

Modifications

configs/qwen3.5-4b-dflash.json:
- Add DFlash draft config for Qwen3.5-4B dense architecture
- hidden_size=2560, vocab_size=248320, num_target_layers=32
- target_layer_ids=[1, 8, 15, 22, 29] (hybrid attention layer distribution)
- tie_word_embeddings=true
examples/run_qwen3.5_4b_dflash.sh:
- Reference training script for Qwen3.5-4B on Ascend NPU
- Uses HF backend (single-node friendly, no SGLang service required)
- Uses SDPA attention (NPU-compatible)
- Paper-aligned hyperparameters: max_length=3072, num_anchors=512, lr=6e-4, gamma=7.0
- Embedding key: model.embed_tokens.weight (standard dense path)

Related Issues

N/A (new feature)

Accuracy Test

Not applicable — config and example only.

Benchmark & Profiling

Architecture verified: dense Qwen3.5-4B with hybrid attention, compatible with DFlashDraftModel.
Hyperparameters aligned with the DFlash paper (block_size=16, lr=6e-4, gamma=7.0, max_length=3072).

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed.

gemini-code-assist

Code Review

This pull request introduces a configuration file and a training script for the Qwen3.5-4B DFlash draft model. The review feedback identifies a mathematical discrepancy in the model's architectural parameters (hidden size vs. attention heads), recommends disabling word embedding tying to align with the project's model loader, and notes that several defined variables in the shell script are not being passed to the training command.

gemini-code-assist · 2026-05-25T09:47:52Z

+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2560,
+    "initializer_range": 0.02,
+    "intermediate_size": 9728,
+    "layer_types": [
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 262144,
+    "max_window_layers": 5,
+    "model_type": "qwen3",
+    "num_attention_heads": 32,


There is a discrepancy between hidden_size (2560), num_attention_heads (32), and head_dim (128). In standard transformer architectures like Qwen, hidden_size typically equals num_attention_heads * head_dim. Here, 32 * 128 = 4096, which does not match 2560. For a 4B model with hidden_size=2560, the number of heads is usually 20 (if head_dim=128) or the head_dim should be 80 (if num_heads=32). Please verify these parameters against the target model architecture.

gemini-code-assist · 2026-05-25T09:47:52Z

+    "rope_scaling": null,
+    "rope_theta": 10000000,
+    "sliding_window": null,
+    "tie_word_embeddings": true,


The tie_word_embeddings parameter is set to true, but the repository's AutoDraftModelConfig loader (in specforge/modeling/auto.py) explicitly overrides this to false for draft models. It is recommended to set it to false here to maintain consistency with the project's model loading logic and avoid potential confusion.

Suggested change

"tie_word_embeddings": true,

"tie_word_embeddings": false,

gemini-code-assist · 2026-05-25T09:47:52Z

+torchrun \
+    --standalone \
+    --nproc_per_node $NUM_GPUS \
+    $ROOT_DIR/scripts/train_dflash.py \
+    --target-model-path PATH/TO/Qwen3.5-4B \
+    --draft-config-path $ROOT_DIR/configs/qwen3.5-4b-dflash.json \
+    --train-data-path $ROOT_DIR/cache/dataset/train_regen.jsonl \
+    --output-dir $ROOT_DIR/outputs/qwen3.5-4b-dflash \
+    --num-epochs 10 \
+    --batch-size 2 \
+    --accumulation-steps 4 \
+    --learning-rate 6e-4 \
+    --warmup-ratio 0.04 \
+    --max-grad-norm 1.0 \
+    --max-length 3072 \
+    --chat-template qwen3.5 \
+    --attention-backend $ATTENTION_BACKEND \
+    --num-anchors 512 \
+    --loss-decay-gamma 7.0 \
+    --log-interval 50 \
+    --save-interval 10000 \
+    --report-to tensorboard \
+    --target-model-backend hf \
+    --block-size 16 \
+    --trust-remote-code


The variables TP_SIZE and BUILD_DATASET_NUM_PROC are defined but not passed to the train_dflash.py script. They should be included in the torchrun command to ensure the configuration is correctly applied during training.

torchrun \ --standalone \ --nproc_per_node $NUM_GPUS \ $ROOT_DIR/scripts/train_dflash.py \ --target-model-path PATH/TO/Qwen3.5-4B \ --draft-config-path $ROOT_DIR/configs/qwen3.5-4b-dflash.json \ --train-data-path $ROOT_DIR/cache/dataset/train_regen.jsonl \ --output-dir $ROOT_DIR/outputs/qwen3.5-4b-dflash \ --num-epochs 10 \ --batch-size 2 \ --accumulation-steps 4 \ --learning-rate 6e-4 \ --warmup-ratio 0.04 \ --max-grad-norm 1.0 \ --max-length 3072 \ --chat-template qwen3.5 \ --attention-backend $ATTENTION_BACKEND \ --num-anchors 512 \ --loss-decay-gamma 7.0 \ --log-interval 50 \ --save-interval 10000 \ --report-to tensorboard \ --target-model-backend hf \ --tp-size $TP_SIZE \ --build-dataset-num-proc $BUILD_DATASET_NUM_PROC \ --block-size 16 \ --trust-remote-code

- Add configs/qwen3.5-4b-dflash.json with model architecture aligned to Qwen3.5-4B (hidden_size=2560, vocab_size=248320, target_layer_ids=[1,8,15,22,29]). - Add examples/run_qwen3.5_4b_dflash_online_npu.sh as a reference training script using HF backend and paper-aligned hyperparameters: max_length=3072, num_anchors=512, lr=6e-4, gamma=7.0. - The config is compatible with the existing DFlash training pipeline and uses the same block_size=16 and attention backends as other Qwen3/3.5 examples.

curnane-lab requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners May 25, 2026 09:46

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

curnane-lab force-pushed the feat/add-qwen3.5-4b-config branch 5 times, most recently from 4f99c70 to 44c4be1 Compare May 26, 2026 10:32

curnane-lab force-pushed the feat/add-qwen3.5-4b-config branch from 44c4be1 to 02ebc89 Compare May 26, 2026 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU#562

feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU#562
curnane-lab wants to merge 1 commit into
sgl-project:mainfrom
curnane-lab:feat/add-qwen3.5-4b-config

curnane-lab commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

curnane-lab commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

curnane-lab commented May 25, 2026 •

edited

Loading