Skip to content

Worse performance for Qwen3.5-2B #96

@gesanqiu

Description

@gesanqiu

Describe the bug

I re-export the onnx through new llm_loader.export_all_cli, and it will use fuse_gdn_input_projections() to fuse the gemm before gdn, them split the output to mixed_qkb, a, b, z. This will improve the performance of GEMM, however, it introduce a large slice op and ultimately negate the performance gains from the fused gemm.
I tested Qwen3.5-2B-NVFP4 on my thor and found the prefill is worse than v0.7.0, and the decode is slightly fast.
Performance of v0.7.0:
ISL: 931 OSL: 128
Prefill: 24.14ms
Decode: 9.03ms
Performance of v0.7.1:
ISL: 931 OSL: 128
Prefill: 33.44ms
Decode: 8.84ms
The nsight report in prefill stage shows that: 78.496us(in_proj_qkv) + 80.712us(siluslice + in_prob_a/b/z + slice) --> 103.104us(in_proj_qkv/a/b/z) + 44.128us(slice)+41.568us(siluslice).

Any plan to optimize this problem?

Steps/Code to reproduce bug

  1. quantize Qwen/Qwen3.5-2B to nvfp4
  2. export quantized Qwen3.5-2B-NVFP4 to onnx graph
  3. build quantinzed Qwen3.5-2B-NVFP4 to TensorRT Engine
  4. benchmark with example/llm/llm_bech

Build configuration:

cmake .. -DCUDA_CTK_VERSION=13.0 -DCMAKE_CUDA_ARCHITECTURES=110a -DTRT_PACKAGE_DIR=/home/user/TensorRT-10.14.2.2/ -DAARCH64_BUILD=ON -DCUDA_DIR=/usr/local/cuda/targets/aarch64-linux -DCUDA_TARGET_DIR=/usr/local/cuda/thor/targets/aarch64-linux -DENABLE_CUTE_DSL=ALL

Runtime command used:

examples/llm/llm_bench --engineDir /home/user/trt_engines/Qwen3.5-2B-NVFP4/llm/ --mode prefill --inputLen 1024
examples/llm/llm_bench --engineDir /home/user/trt_engines/Qwen3.5-2B-NVFP4/llm/ --mode decode --pastKVLen 1024

Expected behavior

System information (Edge Device)

  • Platform (e.g., NVIDIA Jetson Thor): Drive ThorX
  • Software release (e.g., JetPack 7.1): DriveOS 7.0.2
  • CPU architecture: ?
  • GPU compute capability (e.g., SM110 for Jetson Thor): sm110a
  • Total device memory: ?
  • Build type (e.g., Release, Debug): ?
  • Library versions:
    • TensorRT Edge-LLM version or commit hash: v0.7.1
    • CUDA: 13.0
    • TensorRT: 10.14.2.2
    • C++ compiler (e.g., GCC 11.4): ?
  • CMake options used:
    • CMAKE_TOOLCHAIN_FILE: ?
    • EMBEDDED_TARGET: ?
    • TRT_PACKAGE_DIR: ?
      Any other details that may help: ?
Click to expand: Python script to automatically collect system information
import platform
import re
import subprocess


def get_cuda_version():
    try:
        nvcc_output = subprocess.check_output("nvcc --version", shell=True).decode("utf-8")
        match = re.search(r"release (\d+\.\d+)", nvcc_output)
        if match:
            return match.group(1)
    except Exception:
        return "?"


def get_tensorrt_version():
    try:
        dpkg_output = subprocess.check_output("dpkg -l | grep tensorrt", shell=True).decode("utf-8")
        match = re.search(r"(\d+\.\d+\.\d+)", dpkg_output)
        if match:
            return match.group(1)
    except Exception:
        return "?"


def get_gcc_version():
    try:
        gcc_output = subprocess.check_output("gcc --version", shell=True).decode("utf-8")
        match = re.search(r"gcc.*?(\d+\.\d+\.\d+)", gcc_output)
        if match:
            return f"GCC {match.group(1)}"
    except Exception:
        return "?"


def get_gpu_compute_capability():
    try:
        # Try to get GPU compute capability from nvidia-smi
        smi_output = subprocess.check_output(
            "nvidia-smi --query-gpu=compute_cap --format=csv,noheader",
            shell=True
        ).decode("utf-8").strip()
        if smi_output:
            cap = smi_output.replace(".", "")
            return f"SM{cap}"
    except Exception:
        pass
    return "?"


def get_total_memory():
    try:
        mem_output = subprocess.check_output("free -h | grep Mem", shell=True).decode("utf-8")
        match = re.search(r"\s+(\d+\.\d+[GM]i?)\s+", mem_output)
        if match:
            return match.group(1)
    except Exception:
        return "?"


# Get system info
cpu_arch = platform.machine()
platform_name = "?"
software_release = "?"
gpu_compute_cap = get_gpu_compute_capability()
total_memory = get_total_memory()
cuda_version = get_cuda_version()
tensorrt_version = get_tensorrt_version()
gcc_version = get_gcc_version()

# Print system information in the format required for the issue template
print("=" * 70)
print("## System information (Edge Device)")
print()
print("- Platform (e.g., NVIDIA Jetson Thor, NVIDIA DRIVE Thor): " + platform_name)
print("- Software release (e.g., JetPack 7.1, NVIDIA DRIVE OS 7): " + software_release)
print("- CPU architecture: " + cpu_arch)
print("- GPU compute capability (e.g., SM87 for Jetson Thor): " + gpu_compute_cap)
print("- Total device memory: " + total_memory)
print("- Build type (e.g., Release, Debug): " + "?")
print("- Library versions:")
print("  - TensorRT Edge-LLM version or commit hash: " + "?")
print("  - CUDA: " + cuda_version)
print("  - TensorRT: " + tensorrt_version)
print("  - C++ compiler (e.g., GCC 11.4): " + gcc_version)
print("- CMake options used:")
print("  - CMAKE_TOOLCHAIN_FILE: " + "?")
print("  - EMBEDDED_TARGET: " + "?")
print("  - TRT_PACKAGE_DIR: " + "?")
print("- Any other details that may help: " + "?")
print("=" * 70)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions