feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU#562
feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU#562curnane-lab wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a configuration file and a training script for the Qwen3.5-4B DFlash draft model. The review feedback identifies a mathematical discrepancy in the model's architectural parameters (hidden size vs. attention heads), recommends disabling word embedding tying to align with the project's model loader, and notes that several defined variables in the shell script are not being passed to the training command.
| "head_dim": 128, | ||
| "hidden_act": "silu", | ||
| "hidden_size": 2560, | ||
| "initializer_range": 0.02, | ||
| "intermediate_size": 9728, | ||
| "layer_types": [ | ||
| "full_attention", | ||
| "full_attention", | ||
| "full_attention", | ||
| "full_attention", | ||
| "full_attention" | ||
| ], | ||
| "max_position_embeddings": 262144, | ||
| "max_window_layers": 5, | ||
| "model_type": "qwen3", | ||
| "num_attention_heads": 32, |
There was a problem hiding this comment.
There is a discrepancy between hidden_size (2560), num_attention_heads (32), and head_dim (128). In standard transformer architectures like Qwen, hidden_size typically equals num_attention_heads * head_dim. Here, 32 * 128 = 4096, which does not match 2560. For a 4B model with hidden_size=2560, the number of heads is usually 20 (if head_dim=128) or the head_dim should be 80 (if num_heads=32). Please verify these parameters against the target model architecture.
| "rope_scaling": null, | ||
| "rope_theta": 10000000, | ||
| "sliding_window": null, | ||
| "tie_word_embeddings": true, |
There was a problem hiding this comment.
The tie_word_embeddings parameter is set to true, but the repository's AutoDraftModelConfig loader (in specforge/modeling/auto.py) explicitly overrides this to false for draft models. It is recommended to set it to false here to maintain consistency with the project's model loading logic and avoid potential confusion.
| "tie_word_embeddings": true, | |
| "tie_word_embeddings": false, |
| torchrun \ | ||
| --standalone \ | ||
| --nproc_per_node $NUM_GPUS \ | ||
| $ROOT_DIR/scripts/train_dflash.py \ | ||
| --target-model-path PATH/TO/Qwen3.5-4B \ | ||
| --draft-config-path $ROOT_DIR/configs/qwen3.5-4b-dflash.json \ | ||
| --train-data-path $ROOT_DIR/cache/dataset/train_regen.jsonl \ | ||
| --output-dir $ROOT_DIR/outputs/qwen3.5-4b-dflash \ | ||
| --num-epochs 10 \ | ||
| --batch-size 2 \ | ||
| --accumulation-steps 4 \ | ||
| --learning-rate 6e-4 \ | ||
| --warmup-ratio 0.04 \ | ||
| --max-grad-norm 1.0 \ | ||
| --max-length 3072 \ | ||
| --chat-template qwen3.5 \ | ||
| --attention-backend $ATTENTION_BACKEND \ | ||
| --num-anchors 512 \ | ||
| --loss-decay-gamma 7.0 \ | ||
| --log-interval 50 \ | ||
| --save-interval 10000 \ | ||
| --report-to tensorboard \ | ||
| --target-model-backend hf \ | ||
| --block-size 16 \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
The variables TP_SIZE and BUILD_DATASET_NUM_PROC are defined but not passed to the train_dflash.py script. They should be included in the torchrun command to ensure the configuration is correctly applied during training.
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_dflash.py \
--target-model-path PATH/TO/Qwen3.5-4B \
--draft-config-path $ROOT_DIR/configs/qwen3.5-4b-dflash.json \
--train-data-path $ROOT_DIR/cache/dataset/train_regen.jsonl \
--output-dir $ROOT_DIR/outputs/qwen3.5-4b-dflash \
--num-epochs 10 \
--batch-size 2 \
--accumulation-steps 4 \
--learning-rate 6e-4 \
--warmup-ratio 0.04 \
--max-grad-norm 1.0 \
--max-length 3072 \
--chat-template qwen3.5 \
--attention-backend $ATTENTION_BACKEND \
--num-anchors 512 \
--loss-decay-gamma 7.0 \
--log-interval 50 \
--save-interval 10000 \
--report-to tensorboard \
--target-model-backend hf \
--tp-size $TP_SIZE \
--build-dataset-num-proc $BUILD_DATASET_NUM_PROC \
--block-size 16 \
--trust-remote-code4f99c70 to
44c4be1
Compare
- Add configs/qwen3.5-4b-dflash.json with model architecture aligned to Qwen3.5-4B (hidden_size=2560, vocab_size=248320, target_layer_ids=[1,8,15,22,29]). - Add examples/run_qwen3.5_4b_dflash_online_npu.sh as a reference training script using HF backend and paper-aligned hyperparameters: max_length=3072, num_anchors=512, lr=6e-4, gamma=7.0. - The config is compatible with the existing DFlash training pipeline and uses the same block_size=16 and attention backends as other Qwen3/3.5 examples.
44c4be1 to
02ebc89
Compare
Motivation
The official SpecForge examples are GPU-centric: they assume CUDA availability, use
CUDA_VISIBLE_DEVICES, default toflex_attention, and rely on the SGLang backend for large models. There is no reference configuration or script for running DFlash on Ascend NPU with small dense models.Qwen3.5-4B is a dense (non-MoE) model with hybrid attention support. Unlike the existing MoE example (Qwen3.5-35B-A3B), which requires SGLang and multimodal
language_modelpath handling, Qwen3.5-4B is a standard dense transformer that fits comfortably on a single NPU node with the HF backend and SDPA attention.This PR provides the missing pieces for NPU users:
qwen3.5-4b-dflash.json) tuned for the 4B dense architectureKey differences from existing GPU examples:
model.embed_tokens.weight(nolanguage_modelnesting)Modifications
configs/qwen3.5-4b-dflash.json:hidden_size=2560,vocab_size=248320,num_target_layers=32target_layer_ids=[1, 8, 15, 22, 29](hybrid attention layer distribution)tie_word_embeddings=trueexamples/run_qwen3.5_4b_dflash.sh:max_length=3072,num_anchors=512,lr=6e-4,gamma=7.0model.embed_tokens.weight(standard dense path)Related Issues
N/A (new feature)
Accuracy Test
Benchmark & Profiling
DFlashDraftModel.Checklist