Skip to content

Quantized SDPA#1515

Closed
barronalex wants to merge 5 commits into
mainfrom
q-sdpa
Closed

Quantized SDPA#1515
barronalex wants to merge 5 commits into
mainfrom
q-sdpa

Conversation

@barronalex

Copy link
Copy Markdown
Contributor

First pass at adapting @angeloskath's flash attention to support quantized keys and values.

Still needs some optimization work since it's currently faster to write out the quantized_matmuls rather than use this fused version.

E.g. 4 bit on M2 Ultra for L=32768:

Timing sdpa ... 2.51938 msec
Timing quant_sdpa ... 0.97137 msec
Timing attention ... 1.31419 msec
Timing quant_attention ... 0.92342 msec

@awni awni mentioned this pull request Apr 28, 2025
@bghira

bghira commented Sep 18, 2025

Copy link
Copy Markdown

jfyi i have working int8 and int4 quantised attn, MIT licensed.

@CC-Yeh CC-Yeh mentioned this pull request Jan 20, 2026
7 tasks
@zcbenz

zcbenz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

I'm closing this in favor of #3026.

@zcbenz zcbenz closed this Jun 24, 2026
@zcbenz zcbenz deleted the q-sdpa branch June 24, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants