Skip to content

add tbo support for m3#1373

Draft
zhuyuhua-v wants to merge 1 commit into
mainfrom
yuhua/tbo-m3
Draft

add tbo support for m3#1373
zhuyuhua-v wants to merge 1 commit into
mainfrom
yuhua/tbo-m3

Conversation

@zhuyuhua-v

Copy link
Copy Markdown
Collaborator

Motivation

add tbo support for m3

Test Plan

server:

model_path=/shared/data/amd_int/models/MiniMax-M3-MXFP8/
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0
export ATOM_FORCE_ATTN_TRITON=1

DEFAULT_HF_OVERRIDES='{"use_index_cache": true, "index_topk_freq": 4}'
HF_OVERRIDES="${HF_OVERRIDES:-${DEFAULT_HF_OVERRIDES}}"
HF_OVERRIDE_ARGS=()
if [[ -n "${HF_OVERRIDES}" ]]; then
  HF_OVERRIDE_ARGS=(--hf-overrides "${HF_OVERRIDES}")
fi

python -m atom.entrypoints.openai_server \
  --model "$model_path" \
  --tensor-parallel-size 4 \
  --enable-dp-attention \
  --server-port 8013 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --block-size 128 \
  --max-model-len 32768 \
  --no-enable_prefix_caching \
  "${HF_OVERRIDE_ARGS[@]}" \
  --max-num-seqs 256 \
  --kv_cache_dtype fp8 \
  --enable-tbo prefill \
  --online_quant_config '{"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "model.embed_tokens", "vision_tower", "multi_modal_projector", "patch_merge_mlp", "*.gate.*", "*.block_sparse_moe.experts*"]}' \
  --max-num-batched-tokens 32768 2>&1 | tee m3-mxfp4-server-tp4.log

Test Result

benchmark
w/o tbo

============ Serving Benchmark Result ============
Successful requests:                     768       
Benchmark duration (s):                  119.98    
Total input tokens:                      6291456   
Total generated tokens:                  768       
Request throughput (req/s):              6.40      
Output token throughput (tok/s):         6.40      
Total Token throughput (tok/s):          52443.72  
---------------Time to First Token----------------
Mean TTFT (ms):                          33885.05  
Median TTFT (ms):                        39732.99  
P99 TTFT (ms):                           40736.94  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.16      
Median ITL (ms):                         0.03      
P99 ITL (ms):                            1.65      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          33885.21  
Median E2EL (ms):                        39733.02  
P99 E2EL (ms):                           40736.97  
==================================================

w/ tbo

============ Serving Benchmark Result ============
Successful requests:                     768       
Benchmark duration (s):                  101.15    
Total input tokens:                      6291456   
Total generated tokens:                  768       
Request throughput (req/s):              7.59      
Output token throughput (tok/s):         7.59      
Total Token throughput (tok/s):          62203.86  
---------------Time to First Token----------------
Mean TTFT (ms):                          28363.62  
Median TTFT (ms):                        32906.76  
P99 TTFT (ms):                           35336.72  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.15      
Median ITL (ms):                         0.03      
P99 ITL (ms):                            1.83      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          28363.77  
Median E2EL (ms):                        32907.76  
P99 E2EL (ms):                           35336.74  
==================================================

accuracy
image
image

Submission Checklist

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant