Skip to content

Metal: FP8-packed compressed-KV cache + long-context memory optimizations#418

Open
aledesogusbusiness-hue wants to merge 1 commit into
antirez:mainfrom
aledesogusbusiness-hue:main
Open

Metal: FP8-packed compressed-KV cache + long-context memory optimizations#418
aledesogusbusiness-hue wants to merge 1 commit into
antirez:mainfrom
aledesogusbusiness-hue:main

Conversation

@aledesogusbusiness-hue

Copy link
Copy Markdown

Summary

Three memory optimizations for the Metal backend's compressed-KV (MLA latent) path that together cut context-dependent KV memory by ~2.3x with bit-identical output and speed-neutral decode throughput.

  • Packed FP8 comp cache (opt-in, DS4_METAL_FP8_KV_STORE=1): stores comp rows as e4m3 + ue8m0 scale + f16 rot = 584 B/row vs 1024 B/row f16. Dequant uses a 128-entry LUT avoiding branch+exp2.
  • comp_mask stored as f16 (always on): binary −∞/0 mask fits exactly in f16, halving the mask buffer at all context lengths.
  • indexer_scores token-tiling (always on): DS4_INDEXER_SCORE_TILE=512 reduces the score working buffer from comp_cap × prefill_cap to comp_cap × 512 (~8x reduction).

Results

Context KV cache reduction Output diff
8k–96k ~2.3x Bit-identical

Decode speed: break-even (bandwidth saving offset by per-element e4m3 dequant).

Controls

  • DS4_DISABLE_KV_OPTS=1 — reverts to pre-optimization layout for A/B comparison
  • DS4_METAL_FP8_KV_STORE=1 — enables packed FP8 (opt-in, precision change)

Test environment

Apple M5 Max · DeepSeek-V4-Flash · macOS 27.0 · Metal 4.1 · 8k–96k context

🤖 Generated with Claude Code

…ions

Three memory optimizations for the Metal backend's compressed-KV (MLA latent) path:

1. Packed FP8 comp cache (opt-in, DS4_METAL_FP8_KV_STORE=1): stores comp rows
   as e4m3 + ue8m0 scale + f16 rot = 584 B/row vs 1024 B/row f16. Dequant uses
   a 128-entry LUT (ds4_e4m3_lut) avoiding branch+exp2. Validated bit-identical.

2. comp_mask stored as f16 (always on): binary -inf/0 mask fits exactly in f16,
   halving the mask buffer size at all context lengths.

3. indexer_scores token-tiling (always on): DS4_INDEXER_SCORE_TILE=512 reduces
   the score working buffer from comp_cap*prefill_cap to comp_cap*512 (~8x).

Together: ~2.3x KV cache reduction at long context, bit-identical output,
speed-neutral decode (bandwidth saving offset by per-element dequant cost).

Revert with DS4_DISABLE_KV_OPTS=1. Tested on M5 Max, 8k-96k context.

Fixes: antirez#416

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fry69

fry69 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

This is the same as #416 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants