Hi, I am trying to reproduce the numbers from the public table for Qwen3-4B-Instruct NVFP4.
With using llm_bench, I can obtain much better prefill performance than llm_inference. But can not get decode rate as good as 90.2 tok/s (I got about 55 tok/s).
Could you please help check my command or share your commands that generated the numbers?
My environment:
- TensorRT-Edge-LLM version: v.0.70
- Model name: Qwen3-4B-Instruct-2507
- Precision: NVFP4 (including
lm_head_quantization)
--maxInputLen: 1024
--maxKVCacheCapacity: 4096
--maxBatchSize: 1
--inputLen for llm_bench command: 364 (align with your table)
Hi, I am trying to reproduce the numbers from the public table for Qwen3-4B-Instruct NVFP4.
With using
llm_bench, I can obtain much better prefill performance thanllm_inference. But can not get decode rate as good as 90.2 tok/s (I got about 55 tok/s).Could you please help check my command or share your commands that generated the numbers?
My environment:
lm_head_quantization)--maxInputLen: 1024--maxKVCacheCapacity: 4096--maxBatchSize: 1--inputLenforllm_benchcommand: 364 (align with your table)