Can not reproduce decode rate for Qwen3-4B-Instruct NVFP4 on Jetson Thor

Hi, I am trying to reproduce the numbers from [the public table](/mnt/nas_server2/yuchen/code/TensorRT-Edge-LLM/experimental_results.md) for Qwen3-4B-Instruct NVFP4.
With using `llm_bench`, I can obtain much better prefill performance than `llm_inference`. But can not get decode rate as good as 90.2 tok/s (I got about 55 tok/s).
Could you please help check my command or share your commands that generated the numbers?

My environment:
- TensorRT-Edge-LLM version: v.0.70
- Model name: Qwen3-4B-Instruct-2507
- Precision: NVFP4 (including `lm_head_quantization`)
- `--maxInputLen`: 1024
- `--maxKVCacheCapacity`: 4096
- `--maxBatchSize`: 1
- `--inputLen` for `llm_bench` command: 364 (align with your table)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not reproduce decode rate for Qwen3-4B-Instruct NVFP4 on Jetson Thor #95

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Can not reproduce decode rate for Qwen3-4B-Instruct NVFP4 on Jetson Thor #95

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions