Describe the bug
I re-export the onnx through new llm_loader.export_all_cli, and it will use fuse_gdn_input_projections() to fuse the gemm before gdn, them split the output to mixed_qkb, a, b, z. This will improve the performance of GEMM, however, it introduce a large slice op and ultimately negate the performance gains from the fused gemm.
I tested Qwen3.5-2B-NVFP4 on my thor and found the prefill is worse than v0.7.0, and the decode is slightly fast.
Performance of v0.7.0:
ISL: 931 OSL: 128
Prefill: 24.14ms
Decode: 9.03ms
Performance of v0.7.1:
ISL: 931 OSL: 128
Prefill: 33.44ms
Decode: 8.84ms
The nsight report in prefill stage shows that: 78.496us(in_proj_qkv) + 80.712us(siluslice + in_prob_a/b/z + slice) --> 103.104us(in_proj_qkv/a/b/z) + 44.128us(slice)+41.568us(siluslice).
Any plan to optimize this problem?
Steps/Code to reproduce bug
- quantize Qwen/Qwen3.5-2B to nvfp4
- export quantized Qwen3.5-2B-NVFP4 to onnx graph
- build quantinzed Qwen3.5-2B-NVFP4 to TensorRT Engine
- benchmark with example/llm/llm_bech
Build configuration:
cmake .. -DCUDA_CTK_VERSION=13.0 -DCMAKE_CUDA_ARCHITECTURES=110a -DTRT_PACKAGE_DIR=/home/user/TensorRT-10.14.2.2/ -DAARCH64_BUILD=ON -DCUDA_DIR=/usr/local/cuda/targets/aarch64-linux -DCUDA_TARGET_DIR=/usr/local/cuda/thor/targets/aarch64-linux -DENABLE_CUTE_DSL=ALL
Runtime command used:
examples/llm/llm_bench --engineDir /home/user/trt_engines/Qwen3.5-2B-NVFP4/llm/ --mode prefill --inputLen 1024
examples/llm/llm_bench --engineDir /home/user/trt_engines/Qwen3.5-2B-NVFP4/llm/ --mode decode --pastKVLen 1024
Expected behavior
System information (Edge Device)
- Platform (e.g., NVIDIA Jetson Thor): Drive ThorX
- Software release (e.g., JetPack 7.1): DriveOS 7.0.2
- CPU architecture: ?
- GPU compute capability (e.g., SM110 for Jetson Thor): sm110a
- Total device memory: ?
- Build type (e.g., Release, Debug): ?
- Library versions:
- TensorRT Edge-LLM version or commit hash: v0.7.1
- CUDA: 13.0
- TensorRT: 10.14.2.2
- C++ compiler (e.g., GCC 11.4): ?
- CMake options used:
- CMAKE_TOOLCHAIN_FILE: ?
- EMBEDDED_TARGET: ?
- TRT_PACKAGE_DIR: ?
Any other details that may help: ?
Click to expand: Python script to automatically collect system information
import platform
import re
import subprocess
def get_cuda_version():
try:
nvcc_output = subprocess.check_output("nvcc --version", shell=True).decode("utf-8")
match = re.search(r"release (\d+\.\d+)", nvcc_output)
if match:
return match.group(1)
except Exception:
return "?"
def get_tensorrt_version():
try:
dpkg_output = subprocess.check_output("dpkg -l | grep tensorrt", shell=True).decode("utf-8")
match = re.search(r"(\d+\.\d+\.\d+)", dpkg_output)
if match:
return match.group(1)
except Exception:
return "?"
def get_gcc_version():
try:
gcc_output = subprocess.check_output("gcc --version", shell=True).decode("utf-8")
match = re.search(r"gcc.*?(\d+\.\d+\.\d+)", gcc_output)
if match:
return f"GCC {match.group(1)}"
except Exception:
return "?"
def get_gpu_compute_capability():
try:
# Try to get GPU compute capability from nvidia-smi
smi_output = subprocess.check_output(
"nvidia-smi --query-gpu=compute_cap --format=csv,noheader",
shell=True
).decode("utf-8").strip()
if smi_output:
cap = smi_output.replace(".", "")
return f"SM{cap}"
except Exception:
pass
return "?"
def get_total_memory():
try:
mem_output = subprocess.check_output("free -h | grep Mem", shell=True).decode("utf-8")
match = re.search(r"\s+(\d+\.\d+[GM]i?)\s+", mem_output)
if match:
return match.group(1)
except Exception:
return "?"
# Get system info
cpu_arch = platform.machine()
platform_name = "?"
software_release = "?"
gpu_compute_cap = get_gpu_compute_capability()
total_memory = get_total_memory()
cuda_version = get_cuda_version()
tensorrt_version = get_tensorrt_version()
gcc_version = get_gcc_version()
# Print system information in the format required for the issue template
print("=" * 70)
print("## System information (Edge Device)")
print()
print("- Platform (e.g., NVIDIA Jetson Thor, NVIDIA DRIVE Thor): " + platform_name)
print("- Software release (e.g., JetPack 7.1, NVIDIA DRIVE OS 7): " + software_release)
print("- CPU architecture: " + cpu_arch)
print("- GPU compute capability (e.g., SM87 for Jetson Thor): " + gpu_compute_cap)
print("- Total device memory: " + total_memory)
print("- Build type (e.g., Release, Debug): " + "?")
print("- Library versions:")
print(" - TensorRT Edge-LLM version or commit hash: " + "?")
print(" - CUDA: " + cuda_version)
print(" - TensorRT: " + tensorrt_version)
print(" - C++ compiler (e.g., GCC 11.4): " + gcc_version)
print("- CMake options used:")
print(" - CMAKE_TOOLCHAIN_FILE: " + "?")
print(" - EMBEDDED_TARGET: " + "?")
print(" - TRT_PACKAGE_DIR: " + "?")
print("- Any other details that may help: " + "?")
print("=" * 70)
Describe the bug
I re-export the onnx through new llm_loader.export_all_cli, and it will use fuse_gdn_input_projections() to fuse the gemm before gdn, them split the output to mixed_qkb, a, b, z. This will improve the performance of GEMM, however, it introduce a large slice op and ultimately negate the performance gains from the fused gemm.
I tested Qwen3.5-2B-NVFP4 on my thor and found the prefill is worse than v0.7.0, and the decode is slightly fast.
Performance of v0.7.0:
ISL: 931 OSL: 128
Prefill: 24.14ms
Decode: 9.03ms
Performance of v0.7.1:
ISL: 931 OSL: 128
Prefill: 33.44ms
Decode: 8.84ms
The nsight report in prefill stage shows that: 78.496us(in_proj_qkv) + 80.712us(siluslice + in_prob_a/b/z + slice) --> 103.104us(in_proj_qkv/a/b/z) + 44.128us(slice)+41.568us(siluslice).
Any plan to optimize this problem?
Steps/Code to reproduce bug
Build configuration:
Runtime command used:
Expected behavior
System information (Edge Device)
Any other details that may help: ?
Click to expand: Python script to automatically collect system information