Skip to content

Releases: MooreThreads/mate

Release v0.2.1

23 May 05:36

Choose a tag to compare

Highlights

This release expands FMHA compatibility, improves FP8 output support, adds MODEL1(DS V4) support for DSA, optimizes GDN performance, and introduces new mHC and DeepGEMM MQA Logits capabilities.

Starting from this release, wrapper versions are strictly aligned with the MATE version to prevent incompatible package combinations.

What's Changed

FMHA Updates

Added compatibility support for additional FMHA features:

  • RoPE
  • k_leftpad
  • kv_batch_idx

FP8 SageAttention & FP8 DenseGEMM

Added support for FP8 output and quantization scale outputs.

  • Added FP8 output support.
  • Added quant scale output support.

DeepSeek Sparse Attention

Added MODEL1(DS V4) support for DeepSeek Sparse Attention.

  • Added DSA Prefill support for MODEL1(DS V4).
  • Added DSA Decode support for MODEL1(DS V4).

GDN Updates

Improved GDN performance and expanded Decode capability.

  • Optimized Prefill performance.
  • Optimized Decode performance.
  • Added MTP support for Decode.

mHC

Added support for TF32 mHC pre-norm.

DeepGEMM MQA Logits

Improved paged MQA Logits support for larger batch sizes.

  • Paged MQA Logits now supports larger batch sizes.
  • The maximum supported batch size is only limited by shared memory capacity.

Wrapper Updates

Starting from v0.2.1, wrapper versions are strictly aligned with the MATE version to avoid incompatible package combinations.

  • Use mate check to verify wrapper consistency, including version and commit information.

DeepGEMM Wrapper

Added new interfaces:

  • mHC
  • bf16_gemm_nt

FlashMLA Wrapper

Added support for MODEL1(DS V4) related input arguments.

Bug Fixes

Fixed the following issues:

  • Fixed NaN outputs in Fused MoE Gate under certain scenarios.
  • Fixed IMA issues in MQA Logits.
  • Fixed incorrect FA3 backend selection for Softcap scenarios.

Release v0.2.0

23 May 05:31

Choose a tag to compare

Highlights

This release introduces major architecture and runtime improvements, including Torch decoupling via the TVM-FFI ABI and full JIT compilation support across all MATE operators. It also adds new attention implementations, expands FMHA capabilities, integrates FP8 SageAttention, and updates multiple compatibility wrappers.

What's Changed

Torch Decoupling

MATE is now decoupled from a single Torch version through the TVM-FFI ABI.

  • Starting from v0.2.0, MATE can support multiple Torch versions at the same time.
  • This improves package compatibility and reduces dependency constraints for downstream users.

Full JIT Compilation Support

All MATE operators now support JIT compilation.

  • Enables more flexible runtime compilation.
  • Improves compatibility across different deployment environments.

FMHA Updates

FMHA support has been enhanced with new functionality and bug fixes.

Added support for:

  • AppendKV functionality.

Fixed issues in:

  • JIT compilation failure with HeadDim 192-192.
  • Incorrect SWA kernel selection in some scenarios.
  • Scheduler metadata kernel JIT errors when batch-size > 992.

FP8 SageAttention Integration

Integrated the assembly-based SageAttention implementation.

Supported quantization modes:

  • QK INT8 + PV FP8
  • QK FP8 + PV FP8

Supported capabilities:

  • Multiple quantization granularities.
  • Configurable quantization precision and granularity through the wrapper interface.

DeepSeek Sparse Attention

Added DeepSeek Sparse Attention, also known as DSA.

  • Added TileLangMUSA-based DSA Prefill implementation.
  • Added TileLangMUSA-based DSA Decode implementation.

GDN Support

Added GDN support with a unified and stable API.

  • Added TileLangMUSA-based GDN Prefill implementation.
  • Added TileLangMUSA-based GDN Decode implementation.

Wrapper Updates

Updated and added multiple wrappers to improve compatibility with upstream projects and common usage patterns.

FlashAttention 3 Wrapper

Refactored the FlashAttention 3 wrapper.

  • Strictly compatible with the FA3 package name and import style.
  • Added export for the flash_attn_func interface.

FlashMLA Wrapper

Added a new FlashMLA wrapper compatible with the official FlashMLA repository.

Supported computation modes:

  • Dense
  • Sparse

Supported model scenarios:

  • DS V1
  • DS R1
  • DS V3.2
  • GLM5

Known limitation:

  • MODEL1 is not supported yet.

SageAttention Wrapper

Added a new SageAttention wrapper compatible with part of the official SageAttention repository capabilities.

  • Provides the sageattn interface.
  • Uses QK INT8 + PV FP8 quantization by default.
  • Supports specifying other quantization precisions and quantization granularities.

Release v0.1.3

23 May 05:29

Choose a tag to compare

Highlights

This release improves compatibility across FMHA Forward, DeepGEMM, extensions, wrappers, and CLI tooling. It expands FlashAttention 3 scenario coverage, adds more DeepGEMM API support, introduces additional MoE Fused Gate configurations, and provides new debugging utilities through the mate CLI.

What's Changed

FMHA Forward Compatibility

Enhanced FMHA Forward compatibility for broader FlashAttention 3 scenarios.

Supported QKV input modes:

  • Normal
  • Ragged
  • Padded
  • Paged — KV only

Supported mask modes:

  • None
  • Causal
  • Local
  • Local w/ sink

Supported score modes:

  • None
  • Softcap

Supported configurations:

  • PageSize: arbitrary page size is supported; 64 is recommended.
  • DataType: bf16, fp16.
  • HeadDim: arbitrary head dimension up to 512.

Optimization knobs:

  • SplitKV
  • PackGQA
  • SchedulerMetadata

Additional compatibility:

  • ContextParallel: compatible with VLLM-style usage.
  • Compile: JIT enabled.

DeepGEMM Compatibility

Enhanced DeepGEMM compatibility with additional API and edge-case support.

Added support for:

  • m_grouped_bf16_gemm_nt_* APIs
  • m_grouped_fp8_gemm_nt_* APIs
  • k_grouped_fp8_gemm_tn_contiguous
  • FP8 MQA Logits Prefill / Decode
  • NextN=4 scenarios for Decode
  • m/n/k = 0 edge cases

Extensions

Added more MoE Fused Gate expert configurations:

  • 160 experts
  • 384 experts
  • 256 experts with 1 group

Wrappers

Added compatibility wrappers to simplify migration and integration.

mate-deep-gemm

  • Compatible with the deep-gemm import style.
  • Compatible with existing DeepGEMM API usage patterns.

mate-flash-attention

  • Compatible with the flash-attention3 import style.
  • Compatible with FlashAttention 3 API usage patterns.
  • Extended compatibility for SGL / VLLM FA3 fork usage patterns.

MATE CLI

Introduced the mate CLI for environment inspection and debugging.

New commands:

  • mate show-config

    • Displays environment status, commit ID, and related runtime/build information.
  • mate env

    • Displays available MATE-related environment variables.

Debugging improvements:

  • Added new environment variables for dumping input/output data during debugging.

For more details, please refer to the repository documentation and mate --help.

Release v0.1.0

23 May 05:25

Choose a tag to compare

Initial release.

Added

  • Partial support for deep_gemm.
  • Partial support for flash_attention.
  • Partial support for flash_mla.