Skip to content

Add ViT Attention Plugin Support for Qwen, Mllama, and SigLIP Visual Models#4241

Open
micwill755 wants to merge 4 commits into
pytorch:mainfrom
micwill755:4108-vit
Open

Add ViT Attention Plugin Support for Qwen, Mllama, and SigLIP Visual Models#4241
micwill755 wants to merge 4 commits into
pytorch:mainfrom
micwill755:4108-vit

Conversation

@micwill755
Copy link
Copy Markdown

@micwill755 micwill755 commented May 8, 2026

Summary

This PR adds ViT attention plugin integration and validation support to the TensorRT Dynamo examples/tooling path. It wires ViTAttentionPlugin conversion through the Torch-TensorRT/Dynamo flow, supports Qwen-style packed/windowed attention metadata via cu_seqlens and max_seq_len, and adds end-to-end visual model validation for Qwen2.5-VL, Llama 3.2 Vision/Mllama, and GR00T/Eagle/SigLIP-style models.

Changes

  • Added ViT attention plugin converter support.
  • Added mask_type and max_seq_len plugin field propagation.
  • Added support for dense additive masks and packed cu_seqlens inputs.
  • Added Qwen2.5-VL visual attention handling with window attention metadata.
  • Added support for Qwen2.5-VL head_dim=80.
  • Updated plugin loading to locate the TensorRT-Edge-LLM plugin .so.
  • Updated standalone ViT attention plugin example for correctness validation.
  • Added/updated end-to-end ViT visual model example covering:
    • Qwen2.5-VL
    • Llama 3.2 Vision/Mllama
    • GR00T/Eagle/SigLIP-style visual model

Testing

  • Ran ViT attention plugin correctness example.
  • Ran end-to-end ViT attention plugin visual benchmark.
  • Confirmed output shapes match PyTorch references across tested visual models.
  • Validated Qwen2.5-VL visual path through Torch-TensorRT/Dynamo with the custom ViT attention plugin.
  • Validated Llama 3.2 Vision/Mllama visual path through Torch-TensorRT/Dynamo.
  • Validated GR00T/Eagle/SigLIP-style visual path through Torch-TensorRT/Dynamo.

@meta-cla meta-cla Bot added the cla signed label May 8, 2026
Copy link
Copy Markdown
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to use: https://huggingface.co/docs/transformers/v5.5.0/en/serialization#exporting-to-production to avoid as much patching on the model side?

@micwill755 micwill755 changed the title model agnostic vit - tested Qwen, Groot, llama Vision on RTX 5090 Add ViT Attention Plugin Support for Qwen, Mllama, and SigLIP Visual Models May 11, 2026
Copy link
Copy Markdown
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the new plugin operator get inserted into the graph?

Comment thread tools/llm/run_vlm.py
position_ids = torch.arange(input_embeds.shape[1]).unsqueeze(0).to(device)

use_fp32_acc = False
use_explicit_typing = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled precision is deprecated in TRT 10.16 and will be removed in the next version so we dont need this code path

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I’ll clean this up by removing.

@micwill755
Copy link
Copy Markdown
Author

How does the new plugin operator get inserted into the graph?

It follows the same pattern as the existing AttentionPlugin integration.

At a high level, we insert a Torch custom op into the Dynamo graph by wrapping/replacing the model attention module. That custom op is only a graph marker on the PyTorch side. During Torch-TensorRT conversion, the registered converter lowers that marker to the real TensorRT plugin layer by looking up the plugin creator and calling add_plugin_v2.

  1. PyTorch model attention module
    replace_attention_with_plugin(...)
  2. Wrapper calls a registered torch custom op
    Dynamo captures that op in the FX graph
  3. Torch-TensorRT converter sees the custom op
    converter creates TensorRT AttentionPlugin
    TRT network gets a plugin layer via add_plugin_v2(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants