Skip to content

Can not reproduce decode rate for Qwen3-4B-Instruct NVFP4 on Jetson Thor #95

@yuchen-he

Description

@yuchen-he

Hi, I am trying to reproduce the numbers from the public table for Qwen3-4B-Instruct NVFP4.
With using llm_bench, I can obtain much better prefill performance than llm_inference. But can not get decode rate as good as 90.2 tok/s (I got about 55 tok/s).
Could you please help check my command or share your commands that generated the numbers?

My environment:

  • TensorRT-Edge-LLM version: v.0.70
  • Model name: Qwen3-4B-Instruct-2507
  • Precision: NVFP4 (including lm_head_quantization)
  • --maxInputLen: 1024
  • --maxKVCacheCapacity: 4096
  • --maxBatchSize: 1
  • --inputLen for llm_bench command: 364 (align with your table)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions