Releases: MooreThreads/mate
Release v0.2.1
Highlights
This release expands FMHA compatibility, improves FP8 output support, adds MODEL1(DS V4) support for DSA, optimizes GDN performance, and introduces new mHC and DeepGEMM MQA Logits capabilities.
Starting from this release, wrapper versions are strictly aligned with the MATE version to prevent incompatible package combinations.
What's Changed
FMHA Updates
Added compatibility support for additional FMHA features:
- RoPE
k_leftpadkv_batch_idx
FP8 SageAttention & FP8 DenseGEMM
Added support for FP8 output and quantization scale outputs.
- Added FP8 output support.
- Added quant scale output support.
DeepSeek Sparse Attention
Added MODEL1(DS V4) support for DeepSeek Sparse Attention.
- Added DSA Prefill support for MODEL1(DS V4).
- Added DSA Decode support for MODEL1(DS V4).
GDN Updates
Improved GDN performance and expanded Decode capability.
- Optimized Prefill performance.
- Optimized Decode performance.
- Added MTP support for Decode.
mHC
Added support for TF32 mHC pre-norm.
DeepGEMM MQA Logits
Improved paged MQA Logits support for larger batch sizes.
- Paged MQA Logits now supports larger batch sizes.
- The maximum supported batch size is only limited by shared memory capacity.
Wrapper Updates
Starting from v0.2.1, wrapper versions are strictly aligned with the MATE version to avoid incompatible package combinations.
- Use
mate checkto verify wrapper consistency, including version and commit information.
DeepGEMM Wrapper
Added new interfaces:
mHCbf16_gemm_nt
FlashMLA Wrapper
Added support for MODEL1(DS V4) related input arguments.
Bug Fixes
Fixed the following issues:
- Fixed NaN outputs in Fused MoE Gate under certain scenarios.
- Fixed IMA issues in MQA Logits.
- Fixed incorrect FA3 backend selection for Softcap scenarios.
Release v0.2.0
Highlights
This release introduces major architecture and runtime improvements, including Torch decoupling via the TVM-FFI ABI and full JIT compilation support across all MATE operators. It also adds new attention implementations, expands FMHA capabilities, integrates FP8 SageAttention, and updates multiple compatibility wrappers.
What's Changed
Torch Decoupling
MATE is now decoupled from a single Torch version through the TVM-FFI ABI.
- Starting from
v0.2.0, MATE can support multiple Torch versions at the same time. - This improves package compatibility and reduces dependency constraints for downstream users.
Full JIT Compilation Support
All MATE operators now support JIT compilation.
- Enables more flexible runtime compilation.
- Improves compatibility across different deployment environments.
FMHA Updates
FMHA support has been enhanced with new functionality and bug fixes.
Added support for:
- AppendKV functionality.
Fixed issues in:
- JIT compilation failure with
HeadDim 192-192. - Incorrect SWA kernel selection in some scenarios.
- Scheduler metadata kernel JIT errors when
batch-size > 992.
FP8 SageAttention Integration
Integrated the assembly-based SageAttention implementation.
Supported quantization modes:
- QK INT8 + PV FP8
- QK FP8 + PV FP8
Supported capabilities:
- Multiple quantization granularities.
- Configurable quantization precision and granularity through the wrapper interface.
DeepSeek Sparse Attention
Added DeepSeek Sparse Attention, also known as DSA.
- Added TileLangMUSA-based DSA Prefill implementation.
- Added TileLangMUSA-based DSA Decode implementation.
GDN Support
Added GDN support with a unified and stable API.
- Added TileLangMUSA-based GDN Prefill implementation.
- Added TileLangMUSA-based GDN Decode implementation.
Wrapper Updates
Updated and added multiple wrappers to improve compatibility with upstream projects and common usage patterns.
FlashAttention 3 Wrapper
Refactored the FlashAttention 3 wrapper.
- Strictly compatible with the FA3 package name and import style.
- Added export for the
flash_attn_funcinterface.
FlashMLA Wrapper
Added a new FlashMLA wrapper compatible with the official FlashMLA repository.
Supported computation modes:
- Dense
- Sparse
Supported model scenarios:
- DS V1
- DS R1
- DS V3.2
- GLM5
Known limitation:
MODEL1is not supported yet.
SageAttention Wrapper
Added a new SageAttention wrapper compatible with part of the official SageAttention repository capabilities.
- Provides the
sageattninterface. - Uses QK INT8 + PV FP8 quantization by default.
- Supports specifying other quantization precisions and quantization granularities.
Release v0.1.3
Highlights
This release improves compatibility across FMHA Forward, DeepGEMM, extensions, wrappers, and CLI tooling. It expands FlashAttention 3 scenario coverage, adds more DeepGEMM API support, introduces additional MoE Fused Gate configurations, and provides new debugging utilities through the mate CLI.
What's Changed
FMHA Forward Compatibility
Enhanced FMHA Forward compatibility for broader FlashAttention 3 scenarios.
Supported QKV input modes:
NormalRaggedPaddedPaged— KV only
Supported mask modes:
NoneCausalLocalLocal w/ sink
Supported score modes:
NoneSoftcap
Supported configurations:
PageSize: arbitrary page size is supported;64is recommended.DataType:bf16,fp16.HeadDim: arbitrary head dimension up to512.
Optimization knobs:
SplitKVPackGQASchedulerMetadata
Additional compatibility:
ContextParallel: compatible with VLLM-style usage.Compile: JIT enabled.
DeepGEMM Compatibility
Enhanced DeepGEMM compatibility with additional API and edge-case support.
Added support for:
m_grouped_bf16_gemm_nt_*APIsm_grouped_fp8_gemm_nt_*APIsk_grouped_fp8_gemm_tn_contiguous- FP8 MQA Logits Prefill / Decode
NextN=4scenarios for Decodem/n/k = 0edge cases
Extensions
Added more MoE Fused Gate expert configurations:
160experts384experts256experts with1 group
Wrappers
Added compatibility wrappers to simplify migration and integration.
mate-deep-gemm
- Compatible with the
deep-gemmimport style. - Compatible with existing DeepGEMM API usage patterns.
mate-flash-attention
- Compatible with the
flash-attention3import style. - Compatible with FlashAttention 3 API usage patterns.
- Extended compatibility for SGL / VLLM FA3 fork usage patterns.
MATE CLI
Introduced the mate CLI for environment inspection and debugging.
New commands:
-
mate show-config- Displays environment status, commit ID, and related runtime/build information.
-
mate env- Displays available MATE-related environment variables.
Debugging improvements:
- Added new environment variables for dumping input/output data during debugging.
For more details, please refer to the repository documentation and mate --help.
Release v0.1.0
Initial release.
Added
- Partial support for
deep_gemm. - Partial support for
flash_attention. - Partial support for
flash_mla.