Skip to content

Qwen3-30B-A3B-GPTQ-Int4 export failed #83

@bigbighuang

Description

@bigbighuang

Describe the bug

tensorrt-edgellm-export-llm --model_dir /workspace/models/Qwen3-30B-A3B-GPTQ-Int4--output_dir $MODEL_NAME/onnx --device cpu
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu128).
/workspace/venv/tesorrt-edge-llm/lib/python3.10/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 5.8.0 is not tested with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
[transformers] Qwen2VLImageProcessorFast is deprecated. The Fast suffix for image processors has been removed; use Qwen2VLImageProcessor instead.
ModelOpt save/restore enabled for transformers library.
ModelOpt save/restore enabled for peft library.
ModelOpt save/restore enabled for transformers library.
ModelOpt save/restore enabled for peft library.
Exporting standard model to ONNX format
Loading standard model from /workspace/models/Qwen3-30B-A3B-GPTQ-Int4
Loading GPTQ MoE model from /workspace/models/Qwen3-30B-A3B-GPTQ-Int4. You might see warnings saying 'Some weights of the model checkpoint at Qwen/Qwen3-30B-A3B-GPTQ-Int4 were not used when initializing Qwen3MoeForCausalLM', which is expected. The weights will be fixed automatically afterwards.
[transformers] torch_dtype is deprecated! Use dtype instead!

WARN Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env PYTHON_GIL=0.
WARN Feature utils/Perplexity requires Python < 3.14 and Python GIL enabled and Python >= 3.13.3T (T for Threading-Free edition of Python) plus Torch 2.8. Feature is currently skipped/disabled.
INFO ENV: Auto setting PYTORCH_ALLOC_CONF='expandable_segments:True,max_split_size_mb:256,garbage_collection_threshold:0.7' for memory saving.
INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO

/\\\\\\/\\\\\\_/\\\\\\\_/\_/\\/\\/\_/\\\____
/\///////////\/////////\_///////\/////
/\\/\\____/\\\/\\\/\_////\_
/\_
/\_/\_/\_/\//////\_/\//\_/\//\_/\_/\_
/\_/\\\_/\\\\\\/________/\_
/\\\\\_
/\______//\_/\\///\/\/
/\_
/\\_/\_/\\\\/\_
/\___/////\_/\/////////
/\_/////////////\_/\_/\_///\//\_/\///\_/\\\\_/\/////\_/\_
/\_/\_/\_/\_///\\/\\//\_////\_/\_//\__/\////\_/\\\\\_/\_
/\_
/\_/\_/\_////\///\_/\_//\_/\_/\_/\_//\////////\___
//\\\\\\//\_/\_///\\\/\_/\_///\\///\\\/\//\\\\\/\\\\_
////////////____///
////////////////////////////////////////////////_

[W508 02:32:55.984626263 Context.cpp:424] Warning: torch.backends.cuda.preferred_linalg_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
/workspace/venv/tesorrt-edge-llm/lib/python3.10/site-packages/kernels/utils.py:401: FutureWarning: Future versions of kernels (>=0.15) will require specifying a kernel version or revision. See: https://huggingface.co/docs/kernels/migration
revision = select_revision_or_version(repo_id, revision=revision, version=version)
'[Errno 101] Network is unreachable' thrown while requesting HEAD https://huggingface.co/kernels/kernels-community/quantization_gptq/resolve/main/kernel-status.toml
WARNING:huggingface_hub.utils._http:'[Errno 101] Network is unreachable' thrown while requesting HEAD https://huggingface.co/kernels/kernels-community/quantization_gptq/resolve/main/kernel-status.toml
Retrying in 1s [Retry 1/5].
WARNING:huggingface_hub.utils._http:Retrying in 1s [Retry 1/5].
Failed to load CPU gemm_4bit kernel: Cannot send a request, as the client has been closed.. Use fallback path. Please make sure you already pip install kernels and the kernels >= 0.11.1
INFO Kernel: Auto-selection: adding candidate TorchFusedQuantLinear
INFO Kernel: selected -> TorchFusedQuantLinear.
[transformers] loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 963/963 [00:00<00:00, 1833.36it/s]
[transformers] Qwen3MoeForCausalLM LOAD REPORT from: /workspace/models/Qwen3-30B-A3B-GPTQ-Int4
Key | Status | Details
--------------------------------------------------------------+------------+--------
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.g_idx | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.scales | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.qweight | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.qweight | UNEXPECTED |
model.layers.{0...47}.mlp.gate.g_idx | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.qzeros | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.qzeros | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.scales | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.scales | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.qweight | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.qzeros | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.g_idx | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.g_idx | UNEXPECTED |
model.layers.{0...47}.mlp.gate.qweight | UNEXPECTED |
model.layers.{0...47}.mlp.gate.scales | UNEXPECTED |
model.layers.{0...47}.mlp.gate.qzeros | UNEXPECTED |
model.layers.{0...47}.mlp.gate.weight | MISSING |
model.layers.{0...47}.mlp.experts.down_proj | MISSING |
model.layers.{0...47}.mlp.experts.gate_up_proj | MISSING |

Notes:

  • UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
  • MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
    INFO QuantizeConfig: offload_to_disk_path auto set to ./gptqmodel_offload/hqdpgrum-rkaakpxx/
    INFO Format: Converting checkpoint_format from gptq to internal gptq_v2.
    INFO Format: Converting GPTQ v1 to v2
    INFO Optimize: TorchFusedQuantLinear compilation triggered.
    INFO gc.collect() reclaimed 10 objects in 0.226s
    Warning: No gate weights found in checkpoint
    GPTQ load dtype normalization: cast 0 params and 2 buffers to torch.float16; skipped 192 GPTQ quantized modules.
    Warning: Loaded processor from /workspace/models/Qwen3-30B-A3B-GPTQ-Int4. The processor will skip image processing for images smaller than 128x28x28 or bigger than 2048x32x32 due to excessive memory usage during image quantization.

=== Exporting model ===
Detected MoE model, replacing MoE blocks with Int4MoePlugin
Registered ONNX symbolic functions for custom Int4MoePlugin
Error during LLM model export: 'Qwen3MoeSparseMoeBlock' object has no attribute 'num_experts'
Traceback:
Traceback (most recent call last):
File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/scripts/export_llm.py", line 113, in main
export_llm_model(model_dir=args.model_dir,
File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/onnx_export/llm_export.py", line 1023, in export_llm_model
model = replace_moe_blocks_with_plugin(model)
File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/llm_models/layers/int4_moe_plugin.py", line 544, in replace_moe_blocks_with_plugin
new_module = Int4MoePluginModule(module, group_size)
File "/workspace/TensorRT-Edge-LLM/tensorrt_edgellm/llm_models/layers/int4_moe_plugin.py", line 429, in init
self.num_experts = moe_block.num_experts
File "/workspace/venv/tesorrt-edge-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1965, in getattr
raise AttributeError(
AttributeError: 'Qwen3MoeSparseMoeBlock' object has no attribute 'num_experts'

Expected behavior

System information (x86 Host with GPU)

TensorRT Edge-LLM:0.7.0

Package Version


accelerate 1.13.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.5
aiosignal 1.4.0
annotated-doc 0.0.4
annotated-types 0.7.0
anyio 4.13.0
async-timeout 5.0.1
attrs 26.1.0
audioread 3.1.0
backoff 2.2.1
certifi 2026.4.22
cffi 2.0.0
charset-normalizer 3.4.7
click 8.3.3
coloredlogs 15.0.1
cppimport 26.4.17
cuda-bindings 12.9.4
cuda-pathfinder 1.5.4
cupy-cuda12x 14.0.1
datasets 4.4.2
decorator 5.2.1
Device-SMI 0.5.6
dill 0.4.0
einops 0.8.2
exceptiongroup 1.3.1
filelock 3.29.0
flatbuffers 25.12.19
frozenlist 1.8.0
fsspec 2025.10.0
GPTQModel 5.7.0
h11 0.16.0
hf_transfer 0.1.9
hf-xet 1.5.0
httpcore 1.0.9
httpx 0.28.1
huggingface_hub 1.14.0
humanfriendly 10.0
idna 3.13
Jinja2 3.1.6
joblib 1.5.3
kernels 0.14.0
kernels-data 0.14.0
lazy-loader 0.5
librosa 0.11.0
llvmlite 0.47.0
LogBar 0.4.3
Mako 1.3.12
markdown-it-py 4.1.0
MarkupSafe 3.0.3
maturin 1.13.1
mdurl 0.1.2
ml_dtypes 0.5.4
mpmath 1.3.0
msgpack 1.1.2
multidict 6.7.1
multiprocess 0.70.18
networkx 3.4.2
ninja 1.13.0
numba 0.65.1
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 13.595.45
nvidia-modelopt 0.39.0
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.4.5
nvidia-nvtx-cu12 12.8.90
onnx 1.19.0
onnx_graphsurgeon 0.6.1
onnx-ir 0.2.1
onnxconverter-common 1.16.0
onnxruntime-gpu 1.22.0
onnxscript 0.7.0
onnxsim 0.6.3
optimum 2.1.0
packaging 26.2
pandas 2.3.3
peft 0.18.1
pillow 12.1.1
pip 22.0.2
platformdirs 4.9.6
polygraphy 0.49.26
pooch 1.9.0
propcache 0.4.1
protobuf 7.34.1
psutil 7.2.2
PuLP 3.3.1
pyarrow 24.0.0
pybind11 3.0.4
pycparser 3.0
pydantic 2.13.4
pydantic_core 2.46.4
Pygments 2.20.0
PyPcre 0.3.2
python-dateutil 2.9.0.post0
pytz 2026.2
PyYAML 6.0.3
regex 2026.4.4
requests 2.33.1
rich 15.0.0
safetensors 0.7.0
scikit-learn 1.7.2
scipy 1.15.3
sentencepiece 0.2.1
setuptools 59.6.0
shellingham 1.5.4
six 1.17.0
soundfile 0.13.1
soxr 1.1.0
sympy 1.14.0
tensorrt-edgellm 0.7.0
threadpoolctl 3.6.0
tiktoken 0.12.0
TokeNicer 0.0.13
tokenizers 0.22.2
tomli 2.4.1
tomlkit 0.14.0
torch 2.10.0
torchao 0.17.0
torchprofile 0.1.0
torchvision 0.25.0
tqdm 4.67.3
transformers 5.3.0
triton 3.6.0
typer 0.25.1
typing_extensions 4.15.0
typing-inspection 0.4.2
tzdata 2026.2
urllib3 2.6.3
xxhash 3.7.0
yarl 1.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions