Skip to content

feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU#562

Open
curnane-lab wants to merge 1 commit into
sgl-project:mainfrom
curnane-lab:feat/add-qwen3.5-4b-config
Open

feat(configs): add Qwen3.5-4B DFlash draft config and training example for Ascend NPU#562
curnane-lab wants to merge 1 commit into
sgl-project:mainfrom
curnane-lab:feat/add-qwen3.5-4b-config

Conversation

@curnane-lab
Copy link
Copy Markdown

@curnane-lab curnane-lab commented May 25, 2026

Motivation

The official SpecForge examples are GPU-centric: they assume CUDA availability, use CUDA_VISIBLE_DEVICES, default to flex_attention, and rely on the SGLang backend for large models. There is no reference configuration or script for running DFlash on Ascend NPU with small dense models.

Qwen3.5-4B is a dense (non-MoE) model with hybrid attention support. Unlike the existing MoE example (Qwen3.5-35B-A3B), which requires SGLang and multimodal language_model path handling, Qwen3.5-4B is a standard dense transformer that fits comfortably on a single NPU node with the HF backend and SDPA attention.

This PR provides the missing pieces for NPU users:

  • A draft model config (qwen3.5-4b-dflash.json) tuned for the 4B dense architecture
  • A reference training script that uses SDPA (NPU-compatible) and the HF backend (no SGLang service overhead)

Key differences from existing GPU examples:

  • Dense vs MoE: Qwen3.5-4B is dense (32 layers, 2560 hidden size); existing 35B-A3B example is MoE
  • Hybrid attention: Qwen3.5-4B uses hybrid local/global attention patterns
  • HF backend: 4B fits in single-node NPU memory without SGLang
  • Standard embedding path: model.embed_tokens.weight (no language_model nesting)

Modifications

  • configs/qwen3.5-4b-dflash.json:

    • Add DFlash draft config for Qwen3.5-4B dense architecture
    • hidden_size=2560, vocab_size=248320, num_target_layers=32
    • target_layer_ids=[1, 8, 15, 22, 29] (hybrid attention layer distribution)
    • tie_word_embeddings=true
  • examples/run_qwen3.5_4b_dflash.sh:

    • Reference training script for Qwen3.5-4B on Ascend NPU
    • Uses HF backend (single-node friendly, no SGLang service required)
    • Uses SDPA attention (NPU-compatible)
    • Paper-aligned hyperparameters: max_length=3072, num_anchors=512, lr=6e-4, gamma=7.0
    • Embedding key: model.embed_tokens.weight (standard dense path)

Related Issues

N/A (new feature)

Accuracy Test

  • Not applicable — config and example only.

Benchmark & Profiling

  • Architecture verified: dense Qwen3.5-4B with hybrid attention, compatible with DFlashDraftModel.
  • Hyperparameters aligned with the DFlash paper (block_size=16, lr=6e-4, gamma=7.0, max_length=3072).

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configuration file and a training script for the Qwen3.5-4B DFlash draft model. The review feedback identifies a mathematical discrepancy in the model's architectural parameters (hidden size vs. attention heads), recommends disabling word embedding tying to align with the project's model loader, and notes that several defined variables in the shell script are not being passed to the training command.

Comment on lines +18 to +33
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2560,
"initializer_range": 0.02,
"intermediate_size": 9728,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 262144,
"max_window_layers": 5,
"model_type": "qwen3",
"num_attention_heads": 32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a discrepancy between hidden_size (2560), num_attention_heads (32), and head_dim (128). In standard transformer architectures like Qwen, hidden_size typically equals num_attention_heads * head_dim. Here, 32 * 128 = 4096, which does not match 2560. For a 4B model with hidden_size=2560, the number of heads is usually 20 (if head_dim=128) or the head_dim should be 80 (if num_heads=32). Please verify these parameters against the target model architecture.

"rope_scaling": null,
"rope_theta": 10000000,
"sliding_window": null,
"tie_word_embeddings": true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tie_word_embeddings parameter is set to true, but the repository's AutoDraftModelConfig loader (in specforge/modeling/auto.py) explicitly overrides this to false for draft models. It is recommended to set it to false here to maintain consistency with the project's model loading logic and avoid potential confusion.

Suggested change
"tie_word_embeddings": true,
"tie_word_embeddings": false,

Comment on lines +16 to +40
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_dflash.py \
--target-model-path PATH/TO/Qwen3.5-4B \
--draft-config-path $ROOT_DIR/configs/qwen3.5-4b-dflash.json \
--train-data-path $ROOT_DIR/cache/dataset/train_regen.jsonl \
--output-dir $ROOT_DIR/outputs/qwen3.5-4b-dflash \
--num-epochs 10 \
--batch-size 2 \
--accumulation-steps 4 \
--learning-rate 6e-4 \
--warmup-ratio 0.04 \
--max-grad-norm 1.0 \
--max-length 3072 \
--chat-template qwen3.5 \
--attention-backend $ATTENTION_BACKEND \
--num-anchors 512 \
--loss-decay-gamma 7.0 \
--log-interval 50 \
--save-interval 10000 \
--report-to tensorboard \
--target-model-backend hf \
--block-size 16 \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variables TP_SIZE and BUILD_DATASET_NUM_PROC are defined but not passed to the train_dflash.py script. They should be included in the torchrun command to ensure the configuration is correctly applied during training.

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    $ROOT_DIR/scripts/train_dflash.py \
    --target-model-path PATH/TO/Qwen3.5-4B \
    --draft-config-path $ROOT_DIR/configs/qwen3.5-4b-dflash.json \
    --train-data-path $ROOT_DIR/cache/dataset/train_regen.jsonl \
    --output-dir $ROOT_DIR/outputs/qwen3.5-4b-dflash \
    --num-epochs 10 \
    --batch-size 2 \
    --accumulation-steps 4 \
    --learning-rate 6e-4 \
    --warmup-ratio 0.04 \
    --max-grad-norm 1.0 \
    --max-length 3072 \
    --chat-template qwen3.5 \
    --attention-backend $ATTENTION_BACKEND \
    --num-anchors 512 \
    --loss-decay-gamma 7.0 \
    --log-interval 50 \
    --save-interval 10000 \
    --report-to tensorboard \
    --target-model-backend hf \
    --tp-size $TP_SIZE \
    --build-dataset-num-proc $BUILD_DATASET_NUM_PROC \
    --block-size 16 \
    --trust-remote-code

@curnane-lab curnane-lab force-pushed the feat/add-qwen3.5-4b-config branch 5 times, most recently from 4f99c70 to 44c4be1 Compare May 26, 2026 10:32
- Add configs/qwen3.5-4b-dflash.json with model architecture aligned
  to Qwen3.5-4B (hidden_size=2560, vocab_size=248320,
  target_layer_ids=[1,8,15,22,29]).

- Add examples/run_qwen3.5_4b_dflash_online_npu.sh as a reference training script
  using HF backend and paper-aligned hyperparameters:
  max_length=3072, num_anchors=512, lr=6e-4, gamma=7.0.

- The config is compatible with the existing DFlash training pipeline
  and uses the same block_size=16 and attention backends as other
  Qwen3/3.5 examples.
@curnane-lab curnane-lab force-pushed the feat/add-qwen3.5-4b-config branch from 44c4be1 to 02ebc89 Compare May 26, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants