[AIMIGRAPHX-1017] Skip Q/DQ for Attention Ops by eddieliao · Pull Request #4900 · ROCm/AMDMIGraphX

eddieliao · 2026-05-20T17:16:05Z

Motivation

Skips inserting Q/DQ pairs for attention patterns so they can be fused later on.

Technical Details

Before:

2026-05-20 17:05:48.707049 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1134] Running [ MIGraphX Version: 2.16.0.20250912-17-427-g854b494ae ]: ./build-develop/bin/driver perf attention_fp8_test.onnx --fp8
2026-05-20 17:05:48.707179 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:448] Reading: attention_fp8_test.onnx
2026-05-20 17:05:48.707908 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:754] Quantizing to fp8 ...
2026-05-20 17:05:53.643159 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:762] Compiling ...
module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
@1 = hip::hip_allocate_memory[shape=int8_type, {295698432}, {1},id=main:scratch] -> int8_type, {295698432}, {1}
@2 = load[offset=18874368,end=85983232](@1) -> float_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}
k = @param:k -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
q = @param:q -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@5 = gpu::code_object[code_object=16440,symbol_name=mlir_quantizelinear_transpose_quantizelinear_quant_dot,global=131072,local=256,output_arg=2,](q,k,@2) -> float_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}
@6 = load[offset=16777216,end=18874368](@1) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
v = @param:v -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@8 = gpu::code_object[code_object=6504,symbol_name=quantizelinear_kernel,global=2097152,local=1024,](v,@6) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@9 = load[offset=0,end=16777216](@1) -> fp8e4m3fn_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}
@10 = gpu::code_object[code_object=10472,symbol_name=dequantizelinear_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel,global=4194304,local=256,](@5,@9) -> fp8e4m3fn_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}
@11 = load[offset=287309824,end=295698432](@1) -> float_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@12 = load[offset=18874368,end=287309824](@1) -> uint8_type, {268435456}, {1}
@13 = gpu::hip_quant_gemm[alpha=1,beta=0,trans_batch=0,solution_idx=0](@10,@8,@12,@11) -> float_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
main:#output_0 = @param:main:#output_0 -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
w_o = @param:w_o -> half_type, {1, 16, 128, 128}, {262144, 16384, 128, 1}
@16 = gpu::code_object[code_object=11112,symbol_name=mlir_dequantizelinear_quantizelinear_quantizelinear_quant_dot_dequantizelinear,global=65536,local=256,output_arg=2,](@13,w_o,main:#output_0) -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@17 = @return(@16)


2026-05-20 17:05:57.213673 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:933] Allocating params ...
2026-05-20 17:05:57.238790 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:935] Running performance report ...
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}: 0.00036984ms, 1%
@1 = hip::hip_allocate_memory[shape=int8_type, {295698432}, {1},id=main:scratch] -> int8_type, {295698432}, {1}: 0.00039316ms, 1%
@2 = load[offset=18874368,end=85983232](@1) -> float_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}: 0.00046054ms, 1%
k = @param:k -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00031982ms, 1%
q = @param:q -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00032ms, 1%
@5 = gpu::code_object[code_object=16440,symbol_name=mlir_quantizelinear_transpose_quantizelinear_quant_dot,global=131072,local=256,output_arg=2,](q,k,@2) -> float_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}: 0.0294877ms, 19%
@6 = load[offset=16777216,end=18874368](@1) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00047354ms, 1%
v = @param:v -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00032ms, 1%
@8 = gpu::code_object[code_object=6504,symbol_name=quantizelinear_kernel,global=2097152,local=1024,](v,@6) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.0188608ms, 12%
@9 = load[offset=0,end=16777216](@1) -> fp8e4m3fn_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}: 0.0004709ms, 1%
@10 = gpu::code_object[code_object=10472,symbol_name=dequantizelinear_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel,global=4194304,local=256,](@5,@9) -> fp8e4m3fn_type, {1, 16, 1024, 1024}, {16777216, 1048576, 1024, 1}: 0.056494ms, 36%
@11 = load[offset=287309824,end=295698432](@1) -> float_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00047252ms, 1%
@12 = load[offset=18874368,end=287309824](@1) -> uint8_type, {268435456}, {1}: 0.00045208ms, 1%
@13 = gpu::hip_quant_gemm[alpha=1,beta=0,trans_batch=0,solution_idx=0](@10,@8,@12,@11) -> float_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.0317516ms, 20%
main:#output_0 = @param:main:#output_0 -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00035904ms, 1%
w_o = @param:w_o -> half_type, {1, 16, 128, 128}, {262144, 16384, 128, 1}: 0.00032ms, 1%
@16 = gpu::code_object[code_object=11112,symbol_name=mlir_dequantizelinear_quantizelinear_quantizelinear_quant_dot_dequantizelinear,global=65536,local=256,output_arg=2,](@13,w_o,main:#output_0) -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.0181806ms, 12%
@17 = @return(@16)
Summary:
gpu::code_object::dequantizelinear_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.056494ms / 1 = 0.056494ms, 36%
gpu::hip_quant_gemm: 0.0317516ms / 1 = 0.0317516ms, 20%
gpu::code_object::mlir_quantizelinear_transpose_quantizelinear_quant_dot: 0.0294877ms / 1 = 0.0294877ms, 19%
gpu::code_object::quantizelinear_kernel: 0.0188608ms / 1 = 0.0188608ms, 12%
gpu::code_object::mlir_dequantizelinear_quantizelinear_quantizelinear_quant_dot_dequantizelinear: 0.0181806ms / 1 = 0.0181806ms, 12%
load: 0.00232958ms / 5 = 0.000465916ms, 2%
@param: 0.00163886ms / 5 = 0.000327772ms, 2%
hip::hip_allocate_memory: 0.00039316ms / 1 = 0.00039316ms, 1%
check_context::migraphx::gpu::context: 0.00036984ms / 1 = 0.00036984ms, 1%

Batch size: 1
Rate: 9745.39 inferences/sec
Total time: 0.102613ms (Min: 0.09991ms, Max: 0.19147ms, Mean: 0.104161ms, Median: 0.1026ms)
Percentiles (90%, 95%, 99%): (0.10583ms, 0.10612ms, 0.15152ms)
Total instructions time: 0.159506ms
Overhead time: 0.00174ms, -0.0568935ms
Overhead: 2%, -55%
2026-05-20 17:05:57.336887 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1143] MIGRAPHX_MLIR_USE_SPECIFIC_OPS=attention\
2026-05-20 17:05:57.336940 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1155] [ MIGraphX Version: 2.16.0.20250912-17-427-g854b494ae ] Complete(8.62984s): ./build-develop/bin/driver perf attention_fp8_test.onnx --fp8

After:

2026-05-20 17:06:02.500580 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1134] Running [ MIGraphX Version: 2.16.0.20250912-17-452-g2ae91710e ]: ./build/bin/driver perf attention_fp8_test.onnx --fp8
2026-05-20 17:06:02.500727 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:448] Reading: attention_fp8_test.onnx
2026-05-20 17:06:02.501537 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:754] Quantizing to fp8 ...
2026-05-20 17:06:04.842209 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:762] Compiling ...
module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
@1 = hip::hip_allocate_memory[shape=int8_type, {2097152}, {1},id=main:scratch] -> int8_type, {2097152}, {1}
@2 = load[offset=0,end=2097152](@1) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
v = @param:v -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
k = @param:k -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
q = @param:q -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@6 = gpu::code_object[code_object=11328,symbol_name=mlir_transpose_dot_convert_reshape_reduce_max_reshape_sub_exp_reshape_reduce_sum_reshape_div_convert_dot_quantizelinear,global=65536,local=256,output_arg=3,](q,k,v,@2) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
main:#output_0 = @param:main:#output_0 -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
w_o = @param:w_o -> half_type, {1, 16, 128, 128}, {262144, 16384, 128, 1}
@9 = gpu::code_object[code_object=7208,symbol_name=mlir_quantizelinear_quant_dot_dequantizelinear,global=131072,local=256,output_arg=2,](w_o,@6,main:#output_0) -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}
@10 = @return(@9)


2026-05-20 17:06:07.429231 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:933] Allocating params ...
2026-05-20 17:06:07.454508 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:935] Running performance report ...
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}: 0.0003777ms, 1%
@1 = hip::hip_allocate_memory[shape=int8_type, {2097152}, {1},id=main:scratch] -> int8_type, {2097152}, {1}: 0.00040294ms, 1%
@2 = load[offset=0,end=2097152](@1) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00045ms, 1%
v = @param:v -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.0003138ms, 1%
k = @param:k -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00029554ms, 1%
q = @param:q -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.00031052ms, 1%
@6 = gpu::code_object[code_object=11328,symbol_name=mlir_transpose_dot_convert_reshape_reduce_max_reshape_sub_exp_reshape_reduce_sum_reshape_div_convert_dot_quantizelinear,global=65536,local=256,output_arg=3,](q,k,v,@2) -> fp8e4m3fn_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.0427579ms, 70%
main:#output_0 = @param:main:#output_0 -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.0003567ms, 1%
w_o = @param:w_o -> half_type, {1, 16, 128, 128}, {262144, 16384, 128, 1}: 0.00030078ms, 1%
@9 = gpu::code_object[code_object=7208,symbol_name=mlir_quantizelinear_quant_dot_dequantizelinear,global=131072,local=256,output_arg=2,](w_o,@6,main:#output_0) -> half_type, {1, 16, 1024, 128}, {2097152, 131072, 128, 1}: 0.015936ms, 26%
@10 = @return(@9)
Summary:
gpu::code_object::mlir_transpose_dot_convert_reshape_reduce_max_reshape_sub_exp_reshape_reduce_sum_reshape_div_convert_dot_quantizelinear: 0.0427579ms / 1 = 0.0427579ms, 70%
gpu::code_object::mlir_quantizelinear_quant_dot_dequantizelinear: 0.015936ms / 1 = 0.015936ms, 26%
@param: 0.00157734ms / 5 = 0.000315468ms, 3%
load: 0.00045ms / 1 = 0.00045ms, 1%
hip::hip_allocate_memory: 0.00040294ms / 1 = 0.00040294ms, 1%
check_context::migraphx::gpu::context: 0.0003777ms / 1 = 0.0003777ms, 1%

Batch size: 1
Rate: 20371.4 inferences/sec
Total time: 0.0490884ms (Min: 0.04807ms, Max: 0.052309ms, Mean: 0.0491553ms, Median: 0.04904ms)
Percentiles (90%, 95%, 99%): (0.04982ms, 0.05042ms, 0.0521ms)
Total instructions time: 0.0615019ms
Overhead time: 0.000947ms, -0.0124135ms
Overhead: 2%, -25%
2026-05-20 17:06:07.471109 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1143] MIGRAPHX_MLIR_USE_SPECIFIC_OPS=attention\
2026-05-20 17:06:07.471138 [INFO] [/code/AMDMIGraphX/src/driver/main.cpp:1155] [ MIGraphX Version: 2.16.0.20250912-17-452-g2ae91710e ] Complete(4.97051s): ./build/bin/driver perf attention_fp8_test.onnx --fp8

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

eddieliao · 2026-05-20T17:16:52Z

PR currently in draft as I am looking for some feedback on the premise of this change before I go and work on/clean up the actual implementation.

Copilot

Pull request overview

This PR updates the FP8 quantization pipeline to avoid inserting Q/DQ pairs into dot -> softmax -> dot attention subgraphs so those regions can be fused later (improving performance and reducing scratch usage), and factors the attention-pattern matcher into a reusable header.

Changes:

Extracts a reusable match::dot_softmax_dot matcher and uses it in GPU prefusion matching.
Adds a skip_instructions mechanism to capture_arguments_pass and wires it through FP8 quantization.
Detects attention regions in quantize_fp8() and skips capture/QDQ insertion for those instructions.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/targets/gpu/prefuse_ops.cpp	Switches attention prefusion matching to the new shared `dot_softmax_dot` matcher.
src/quantize_8bits.cpp	Skips inserting `capture` ops (and therefore Q/DQ) for a provided set of instructions.
src/quantization.cpp	Detects attention regions and passes them into the capture pass to skip Q/DQ insertion.
src/include/migraphx/quantize_8bits.hpp	Extends `capture_arguments_pass` API to carry a skip set.
src/include/migraphx/match/dot_softmax_dot.hpp	Introduces a reusable matcher for undecomposed attention (`dot -> softmax -> dot`).

 struct MIGRAPHX_EXPORT capture_arguments_pass
 {
    std::unordered_set<std::string> ins_names = {"dot", "convolution"};
    std::function<void(std::size_t, std::vector<argument>)> f{};
    std::size_t* param_index = nullptr;
+    std::unordered_set<instruction_ref> skip_instructions{};
    std::string name() const { return "capture_arguments"; }


+/// Match the (undecomposed) `dot -> softmax -> dot` attention pattern, with
+/// optional `mul` (scale), `add` (bias), or `where` (mask) ops between the
+/// first dot and the softmax. This is the form before `rewrite_reduce`
+/// decomposes softmax into its `div(exp(sub(x, max)), sum(exp(...)))` chain.
+///
+/// `gemm_pred` is applied to both dot operations; pass `match::any()` to
+/// match any dot. `bias_pred` is applied to the optional `add` (bias) op.


codecov · 2026-05-20T17:30:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4900   +/-   ##
========================================
  Coverage    92.88%   92.88%           
========================================
  Files          587      588    +1     
  Lines        30348    30365   +17     
========================================
+ Hits         28187    28204   +17     
  Misses        2161     2161

Files with missing lines	Coverage Δ
src/include/migraphx/match/dot_softmax_dot.hpp	`100.00% <100.00%> (ø)`
src/include/migraphx/quantize_8bits.hpp	`100.00% <ø> (ø)`
src/quantization.cpp	`86.42% <100.00%> (+1.09%)`	⬆️
src/quantize_8bits.cpp	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TedThemistokleous · 2026-05-27T21:08:23Z

+    std::unordered_set<instruction_ref> skip_instructions;
+    for(auto ins : iterator_for(*mm))
+    {
+        auto r = match::match_instruction(*mm, ins, match::dot_softmax_dot());


I'm assuming we just want to always assume this will be in the dot_softmax_dot form to match and use then?

Yes that is the assumption, let me know if you think this may not always be true.

TedThemistokleous

Some questions more about the matcher predicate for the no input args case

The rest makes sense to me though - You're moving things to a separate file to reuse for the quant step. Just not sure if the match::any() is needed as its to broad instead of having something a bit more specific.

If we're assuming we have gemm->softmax->gemm already setup here then we can write something specific. let me know if this is incorrect though and you're using any() as a way to just grab everything

TedThemistokleous · 2026-05-27T21:09:15Z

+    return match::name("dot")(gemm_pred.bind("gemm2"))(match::arg(0)(softmax));
+}
+
+inline auto dot_softmax_dot() { return dot_softmax_dot(match::any(), match::any()); }


Does this have to be match::any(), is the match::any? This seems too broad here.

I'm a little confused about writing something more specific. What constraints can you put on gemm and bias pred here?

Prototype skip 8 bit quantization for attention

2ae9171

eddieliao requested a review from pfultz2 May 20, 2026 17:16

eddieliao self-assigned this May 20, 2026

eddieliao added Matchers Updates or adds a change to compile time Matchers FP8 issues related to FP8 implemenation Perf Improve labels May 20, 2026

eddieliao requested a review from Copilot May 20, 2026 17:17

Copilot started reviewing on behalf of eddieliao May 20, 2026 17:17 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

eddieliao added 3 commits May 20, 2026 18:49

Move attention matcher to after optimize pass

0c1f4a6

Licensing

b87c404

Add test

4d25c32

eddieliao marked this pull request as ready for review May 27, 2026 19:38

eddieliao requested a review from causten as a code owner May 27, 2026 19:38

eddieliao requested review from TedThemistokleous and turneram May 27, 2026 19:38

TedThemistokleous reviewed May 27, 2026

View reviewed changes

kahmed10 requested a review from shivadbhavsar May 28, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIMIGRAPHX-1017] Skip Q/DQ for Attention Ops#4900

[AIMIGRAPHX-1017] Skip Q/DQ for Attention Ops#4900
eddieliao wants to merge 4 commits into
developfrom
skip_attention_qdq

eddieliao commented May 20, 2026

Uh oh!

eddieliao commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

TedThemistokleous May 27, 2026

Uh oh!

eddieliao May 28, 2026

Uh oh!

TedThemistokleous left a comment

Uh oh!

TedThemistokleous May 27, 2026

Uh oh!

eddieliao May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eddieliao commented May 20, 2026

Motivation

Technical Details

Changelog Category

Uh oh!

eddieliao commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TedThemistokleous May 27, 2026

Choose a reason for hiding this comment

Uh oh!

eddieliao May 28, 2026

Choose a reason for hiding this comment

Uh oh!

TedThemistokleous left a comment

Choose a reason for hiding this comment

Uh oh!

TedThemistokleous May 27, 2026

Choose a reason for hiding this comment

Uh oh!

eddieliao May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 20, 2026 •

edited

Loading