Qwen3-30B-A3B-GPTQ-Int4 export failed

## Describe the bug
tensorrt-edgellm-export-llm --model_dir /workspace/models/Qwen3-30B-A3B-GPTQ-Int4--output_dir $MODEL_NAME/onnx --device cpu
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
/workspace/venv/tesorrt-edge-llm/lib/python3.10/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 5.8.0 is not tested with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
ModelOpt save/restore enabled for `transformers` library.
ModelOpt save/restore enabled for `peft` library.
ModelOpt save/restore enabled for `transformers` library.
ModelOpt save/restore enabled for `peft` library.
Exporting standard model to ONNX format
Loading standard model from /workspace/models/Qwen3-30B-A3B-GPTQ-Int4
Loading GPTQ MoE model from /workspace/models/Qwen3-30B-A3B-GPTQ-Int4. You might see warnings saying 'Some weights of the model checkpoint at Qwen/Qwen3-30B-A3B-GPTQ-Int4 were not used when initializing Qwen3MoeForCausalLM', which is expected. The weights will be fixed automatically afterwards.
[transformers] `torch_dtype` is deprecated! Use `dtype` instead!

WARN  Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env `PYTHON_GIL=0`.
WARN  Feature `utils/Perplexity` requires Python < 3.14 and Python GIL enabled and Python >= 3.13.3T (T for Threading-Free edition of Python) plus Torch 2.8. Feature is currently skipped/disabled.
INFO  ENV: Auto setting PYTORCH_ALLOC_CONF='expandable_segments:True,max_split_size_mb:256,garbage_collection_threshold:0.7' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO

_____/\\\\\\\\\\\\__/\\\\\\\\\\\\\____/\\\\\\\\\\\\\\\______________________/\\\________/\\\\____________/\\\\_______________________/\\\__________________/\\\\\\____
 ___/\\\//////////__\/\\\/////////\\\_\///////\\\/////____________________/\\\\/\\\\____\/\\\\\\________/\\\\\\______________________\/\\\_________________\////\\\____
  __/\\\_____________\/\\\_______\/\\\_______\/\\\_______________________/\\\//\////\\\__\/\\\//\\\____/\\\//\\\______________________\/\\\____________________\/\\\____
   _\/\\\____/\\\\\\\_\/\\\\\\\\\\\\\/________\/\\\________/\\\\\\\\\\\__/\\\______\//\\\_\/\\\\///\\\/\\\/_\/\\\_____/\\\\\___________\/\\\______/\\\\\\\\_____\/\\\____
    _\/\\\___\/////\\\_\/\\\/////////__________\/\\\_______\///////////__\//\\\______/\\\__\/\\\__\///\\\/___\/\\\___/\\\///\\\____/\\\\\\\\\____/\\\/////\\\____\/\\\____
     _\/\\\_______\/\\\_\/\\\___________________\/\\\______________________\///\\\\/\\\\/___\/\\\____\///_____\/\\\__/\\\__\//\\\__/\\\////\\\___/\\\\\\\\\\\_____\/\\\____
      _\/\\\_______\/\\\_\/\\\___________________\/\\\________________________\////\\\//_____\/\\\_____________\/\\\_\//\\\__/\\\__\/\\\__\/\\\__\//\\///////______\/\\\____
       _\//\\\\\\\\\\\\/__\/\\\___________________\/\\\___________________________\///\\\\\\__\/\\\_____________\/\\\__\///\\\\\/___\//\\\\\\\/\\__\//\\\\\\\\\\__/\\\\\\\\\_
        __\////////////____\///____________________\///______________________________\//////___\///______________\///_____\/////______\///////\//____\//////////__\/////////__

[W508 02:32:55.984626263 Context.cpp:424] Warning: torch.backends.cuda.preferred_linalg_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
/workspace/venv/tesorrt-edge-llm/lib/python3.10/site-packages/kernels/utils.py:401: FutureWarning: Future versions of `kernels` (>=0.15) will require specifying a kernel version or revision. See: https://huggingface.co/docs/kernels/migration
  revision = select_revision_or_version(repo_id, revision=revision, version=version)
'[Errno 101] Network is unreachable' thrown while requesting HEAD https://huggingface.co/kernels/kernels-community/quantization_gptq/resolve/main/kernel-status.toml
WARNING:huggingface_hub.utils._http:'[Errno 101] Network is unreachable' thrown while requesting HEAD https://huggingface.co/kernels/kernels-community/quantization_gptq/resolve/main/kernel-status.toml
Retrying in 1s [Retry 1/5].
WARNING:huggingface_hub.utils._http:Retrying in 1s [Retry 1/5].
Failed to load CPU gemm_4bit kernel: Cannot send a request, as the client has been closed.. Use fallback path.                         Please make sure you already `pip install kernels` and the kernels >= 0.11.1
INFO  Kernel: Auto-selection: adding candidate `TorchFusedQuantLinear`
INFO  Kernel: selected -> `TorchFusedQuantLinear`.
[transformers] `loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 963/963 [00:00<00:00, 1833.36it/s]
[transformers] Qwen3MoeForCausalLM LOAD REPORT from: /workspace/models/Qwen3-30B-A3B-GPTQ-Int4
Key                                                           | Status     | Details
--------------------------------------------------------------+------------+--------
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.g_idx   | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.scales    | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.qweight   | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.qweight | UNEXPECTED |
model.layers.{0...47}.mlp.gate.g_idx                          | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.qzeros  | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.qzeros  | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.scales  | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.scales  | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.qweight | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.qzeros    | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.g_idx   | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.g_idx     | UNEXPECTED |
model.layers.{0...47}.mlp.gate.qweight                        | UNEXPECTED |
model.layers.{0...47}.mlp.gate.scales                         | UNEXPECTED |
model.layers.{0...47}.mlp.gate.qzeros                         | UNEXPECTED |
model.layers.{0...47}.mlp.gate.weight                         | MISSING    |
model.layers.{0...47}.mlp.experts.down_proj                   | MISSING    |
model.layers.{0...47}.mlp.experts.gate_up_proj                | MISSING    |

Notes:
- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
INFO  QuantizeConfig: offload_to_disk_path auto set to `./gptqmodel_offload/hqdpgrum-rkaakpxx/`
INFO  Format: Converting `checkpoint_format` from `gptq` to internal `gptq_v2`.
INFO  Format: Converting GPTQ v1 to v2
INFO  Optimize: `TorchFusedQuantLinear` compilation triggered.
INFO  gc.collect() reclaimed 10 objects in 0.226s
Warning: No gate weights found in checkpoint
GPTQ load dtype normalization: cast 0 params and 2 buffers to torch.float16; skipped 192 GPTQ quantized modules.
Warning: Loaded processor from /workspace/models/Qwen3-30B-A3B-GPTQ-Int4. The processor will skip image processing for images smaller than 128x28x28 or bigger than 2048x32x32 due to excessive memory usage during image quantization.

=== Exporting model ===
Detected MoE model, replacing MoE blocks with Int4MoePlugin
Registered ONNX symbolic functions for custom Int4MoePlugin
Error during LLM model export: 'Qwen3MoeSparseMoeBlock' object has no attribute 'num_experts'
Traceback:
Traceback (most recent call last):
  File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/scripts/export_llm.py", line 113, in main
    export_llm_model(model_dir=args.model_dir,
  File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/onnx_export/llm_export.py", line 1023, in export_llm_model
    model = replace_moe_blocks_with_plugin(model)
  File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/llm_models/layers/int4_moe_plugin.py", line 544, in replace_moe_blocks_with_plugin
    new_module = Int4MoePluginModule(module, group_size)
  File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/llm_models/layers/int4_moe_plugin.py", line 429, in __init__
    self.num_experts = moe_block.num_experts
  File "/workspace/venv/tesorrt-edge-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1965, in __getattr__
    raise AttributeError(
AttributeError: 'Qwen3MoeSparseMoeBlock' object has no attribute 'num_experts'

### Expected behavior

## System information (x86 Host with GPU)
TensorRT Edge-LLM：0.7.0

Package                  Version
------------------------ -----------
accelerate               1.13.0
aiohappyeyeballs         2.6.1
aiohttp                  3.13.5
aiosignal                1.4.0
annotated-doc            0.0.4
annotated-types          0.7.0
anyio                    4.13.0
async-timeout            5.0.1
attrs                    26.1.0
audioread                3.1.0
backoff                  2.2.1
certifi                  2026.4.22
cffi                     2.0.0
charset-normalizer       3.4.7
click                    8.3.3
coloredlogs              15.0.1
cppimport                26.4.17
cuda-bindings            12.9.4
cuda-pathfinder          1.5.4
cupy-cuda12x             14.0.1
datasets                 4.4.2
decorator                5.2.1
Device-SMI               0.5.6
dill                     0.4.0
einops                   0.8.2
exceptiongroup           1.3.1
filelock                 3.29.0
flatbuffers              25.12.19
frozenlist               1.8.0
fsspec                   2025.10.0
GPTQModel                5.7.0
h11                      0.16.0
hf_transfer              0.1.9
hf-xet                   1.5.0
httpcore                 1.0.9
httpx                    0.28.1
huggingface_hub          1.14.0
humanfriendly            10.0
idna                     3.13
Jinja2                   3.1.6
joblib                   1.5.3
kernels                  0.14.0
kernels-data             0.14.0
lazy-loader              0.5
librosa                  0.11.0
llvmlite                 0.47.0
LogBar                   0.4.3
Mako                     1.3.12
markdown-it-py           4.1.0
MarkupSafe               3.0.3
maturin                  1.13.1
mdurl                    0.1.2
ml_dtypes                0.5.4
mpmath                   1.3.0
msgpack                  1.1.2
multidict                6.7.1
multiprocess             0.70.18
networkx                 3.4.2
ninja                    1.13.0
numba                    0.65.1
numpy                    2.2.6
nvidia-cublas-cu12       12.8.4.1
nvidia-cuda-cupti-cu12   12.8.90
nvidia-cuda-nvrtc-cu12   12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12        9.10.2.21
nvidia-cufft-cu12        11.3.3.83
nvidia-cufile-cu12       1.13.1.3
nvidia-curand-cu12       10.3.9.90
nvidia-cusolver-cu12     11.7.3.90
nvidia-cusparse-cu12     12.5.8.93
nvidia-cusparselt-cu12   0.7.1
nvidia-ml-py             13.595.45
nvidia-modelopt          0.39.0
nvidia-nccl-cu12         2.27.5
nvidia-nvjitlink-cu12    12.8.93
nvidia-nvshmem-cu12      3.4.5
nvidia-nvtx-cu12         12.8.90
onnx                     1.19.0
onnx_graphsurgeon        0.6.1
onnx-ir                  0.2.1
onnxconverter-common     1.16.0
onnxruntime-gpu          1.22.0
onnxscript               0.7.0
onnxsim                  0.6.3
optimum                  2.1.0
packaging                26.2
pandas                   2.3.3
peft                     0.18.1
pillow                   12.1.1
pip                      22.0.2
platformdirs             4.9.6
polygraphy               0.49.26
pooch                    1.9.0
propcache                0.4.1
protobuf                 7.34.1
psutil                   7.2.2
PuLP                     3.3.1
pyarrow                  24.0.0
pybind11                 3.0.4
pycparser                3.0
pydantic                 2.13.4
pydantic_core            2.46.4
Pygments                 2.20.0
PyPcre                   0.3.2
python-dateutil          2.9.0.post0
pytz                     2026.2
PyYAML                   6.0.3
regex                    2026.4.4
requests                 2.33.1
rich                     15.0.0
safetensors              0.7.0
scikit-learn             1.7.2
scipy                    1.15.3
sentencepiece            0.2.1
setuptools               59.6.0
shellingham              1.5.4
six                      1.17.0
soundfile                0.13.1
soxr                     1.1.0
sympy                    1.14.0
tensorrt-edgellm         0.7.0
threadpoolctl            3.6.0
tiktoken                 0.12.0
TokeNicer                0.0.13
tokenizers               0.22.2
tomli                    2.4.1
tomlkit                  0.14.0
torch                    2.10.0
torchao                  0.17.0
torchprofile             0.1.0
torchvision              0.25.0
tqdm                     4.67.3
transformers             5.3.0
triton                   3.6.0
typer                    0.25.1
typing_extensions        4.15.0
typing-inspection        0.4.2
tzdata                   2026.2
urllib3                  2.6.3
xxhash                   3.7.0
yarl                     1.23.0


</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-30B-A3B-GPTQ-Int4 export failed #83

Describe the bug

Expected behavior

System information (x86 Host with GPU)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen3-30B-A3B-GPTQ-Int4 export failed #83

Description

Describe the bug

Expected behavior

System information (x86 Host with GPU)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions