Worse performance for Qwen3.5-2B

## Describe the bug


I re-export the onnx through new llm_loader.export_all_cli, and it will use fuse_gdn_input_projections() to fuse the gemm before gdn, them split the output to mixed_qkb, a, b, z. This will improve the performance of GEMM, however, it introduce a large slice op and ultimately negate the performance gains from the fused gemm.
I tested Qwen3.5-2B-NVFP4 on my thor and found the prefill is worse than v0.7.0, and the decode is slightly fast.
Performance of v0.7.0：
ISL: 931 OSL: 128
Prefill: 24.14ms
Decode: 9.03ms
Performance of v0.7.1:
ISL: 931 OSL: 128
Prefill: 33.44ms
Decode: 8.84ms
The nsight report in prefill stage shows that: 78.496us(in_proj_qkv) + 80.712us(siluslice + in_prob_a/b/z + slice) --> 103.104us(in_proj_qkv/a/b/z) + 44.128us(slice)+41.568us(siluslice).

Any plan to optimize this problem?

### Steps/Code to reproduce bug



1. quantize Qwen/Qwen3.5-2B to nvfp4
2. export quantized Qwen3.5-2B-NVFP4 to onnx graph
3. build quantinzed Qwen3.5-2B-NVFP4 to TensorRT Engine
4. benchmark with example/llm/llm_bech

**Build configuration:**
```bash
cmake .. -DCUDA_CTK_VERSION=13.0 -DCMAKE_CUDA_ARCHITECTURES=110a -DTRT_PACKAGE_DIR=/home/user/TensorRT-10.14.2.2/ -DAARCH64_BUILD=ON -DCUDA_DIR=/usr/local/cuda/targets/aarch64-linux -DCUDA_TARGET_DIR=/usr/local/cuda/thor/targets/aarch64-linux -DENABLE_CUTE_DSL=ALL
```

**Runtime command used:**
```bash
examples/llm/llm_bench --engineDir /home/user/trt_engines/Qwen3.5-2B-NVFP4/llm/ --mode prefill --inputLen 1024
examples/llm/llm_bench --engineDir /home/user/trt_engines/Qwen3.5-2B-NVFP4/llm/ --mode decode --pastKVLen 1024
```

### Expected behavior

## System information (Edge Device)

- Platform (e.g., NVIDIA Jetson Thor): Drive ThorX
- Software release (e.g., JetPack 7.1): DriveOS 7.0.2
- CPU architecture: ? 
- GPU compute capability (e.g., SM110 for Jetson Thor): sm110a
- Total device memory: ?
- Build type (e.g., Release, Debug): ?
- Library versions:
    - TensorRT Edge-LLM version or commit hash: v0.7.1
    - CUDA: 13.0
    - TensorRT: 10.14.2.2
    - C++ compiler (e.g., GCC 11.4): ?
- CMake options used:
    - CMAKE_TOOLCHAIN_FILE: ?
    - EMBEDDED_TARGET: ?
    - TRT_PACKAGE_DIR: ?
Any other details that may help: ?

<details>
<summary><b>Click to expand: Python script to automatically collect system information</b></summary>

```python
import platform
import re
import subprocess


def get_cuda_version():
    try:
        nvcc_output = subprocess.check_output("nvcc --version", shell=True).decode("utf-8")
        match = re.search(r"release (\d+\.\d+)", nvcc_output)
        if match:
            return match.group(1)
    except Exception:
        return "?"


def get_tensorrt_version():
    try:
        dpkg_output = subprocess.check_output("dpkg -l | grep tensorrt", shell=True).decode("utf-8")
        match = re.search(r"(\d+\.\d+\.\d+)", dpkg_output)
        if match:
            return match.group(1)
    except Exception:
        return "?"


def get_gcc_version():
    try:
        gcc_output = subprocess.check_output("gcc --version", shell=True).decode("utf-8")
        match = re.search(r"gcc.*?(\d+\.\d+\.\d+)", gcc_output)
        if match:
            return f"GCC {match.group(1)}"
    except Exception:
        return "?"


def get_gpu_compute_capability():
    try:
        # Try to get GPU compute capability from nvidia-smi
        smi_output = subprocess.check_output(
            "nvidia-smi --query-gpu=compute_cap --format=csv,noheader",
            shell=True
        ).decode("utf-8").strip()
        if smi_output:
            cap = smi_output.replace(".", "")
            return f"SM{cap}"
    except Exception:
        pass
    return "?"


def get_total_memory():
    try:
        mem_output = subprocess.check_output("free -h | grep Mem", shell=True).decode("utf-8")
        match = re.search(r"\s+(\d+\.\d+[GM]i?)\s+", mem_output)
        if match:
            return match.group(1)
    except Exception:
        return "?"


# Get system info
cpu_arch = platform.machine()
platform_name = "?"
software_release = "?"
gpu_compute_cap = get_gpu_compute_capability()
total_memory = get_total_memory()
cuda_version = get_cuda_version()
tensorrt_version = get_tensorrt_version()
gcc_version = get_gcc_version()

# Print system information in the format required for the issue template
print("=" * 70)
print("## System information (Edge Device)")
print()
print("- Platform (e.g., NVIDIA Jetson Thor, NVIDIA DRIVE Thor): " + platform_name)
print("- Software release (e.g., JetPack 7.1, NVIDIA DRIVE OS 7): " + software_release)
print("- CPU architecture: " + cpu_arch)
print("- GPU compute capability (e.g., SM87 for Jetson Thor): " + gpu_compute_cap)
print("- Total device memory: " + total_memory)
print("- Build type (e.g., Release, Debug): " + "?")
print("- Library versions:")
print("  - TensorRT Edge-LLM version or commit hash: " + "?")
print("  - CUDA: " + cuda_version)
print("  - TensorRT: " + tensorrt_version)
print("  - C++ compiler (e.g., GCC 11.4): " + gcc_version)
print("- CMake options used:")
print("  - CMAKE_TOOLCHAIN_FILE: " + "?")
print("  - EMBEDDED_TARGET: " + "?")
print("  - TRT_PACKAGE_DIR: " + "?")
print("- Any other details that may help: " + "?")
print("=" * 70)
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worse performance for Qwen3.5-2B #96

Describe the bug

Steps/Code to reproduce bug

Expected behavior

System information (Edge Device)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Worse performance for Qwen3.5-2B #96

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

System information (Edge Device)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions