Metal: FP8-packed compressed-KV cache + long-context memory optimizations by aledesogusbusiness-hue · Pull Request #418 · antirez/ds4

aledesogusbusiness-hue · 2026-06-16T00:35:57Z

Summary

Three memory optimizations for the Metal backend's compressed-KV (MLA latent) path that together cut context-dependent KV memory by ~2.3x with bit-identical output and speed-neutral decode throughput.

Packed FP8 comp cache (opt-in, DS4_METAL_FP8_KV_STORE=1): stores comp rows as e4m3 + ue8m0 scale + f16 rot = 584 B/row vs 1024 B/row f16. Dequant uses a 128-entry LUT avoiding branch+exp2.
comp_mask stored as f16 (always on): binary −∞/0 mask fits exactly in f16, halving the mask buffer at all context lengths.
indexer_scores token-tiling (always on): DS4_INDEXER_SCORE_TILE=512 reduces the score working buffer from comp_cap × prefill_cap to comp_cap × 512 (~8x reduction).

Results

Context	KV cache reduction	Output diff
8k–96k	~2.3x	Bit-identical

Decode speed: break-even (bandwidth saving offset by per-element e4m3 dequant).

Controls

DS4_DISABLE_KV_OPTS=1 — reverts to pre-optimization layout for A/B comparison
DS4_METAL_FP8_KV_STORE=1 — enables packed FP8 (opt-in, precision change)

Test environment

Apple M5 Max · DeepSeek-V4-Flash · macOS 27.0 · Metal 4.1 · 8k–96k context

🤖 Generated with Claude Code

…ions Three memory optimizations for the Metal backend's compressed-KV (MLA latent) path: 1. Packed FP8 comp cache (opt-in, DS4_METAL_FP8_KV_STORE=1): stores comp rows as e4m3 + ue8m0 scale + f16 rot = 584 B/row vs 1024 B/row f16. Dequant uses a 128-entry LUT (ds4_e4m3_lut) avoiding branch+exp2. Validated bit-identical. 2. comp_mask stored as f16 (always on): binary -inf/0 mask fits exactly in f16, halving the mask buffer size at all context lengths. 3. indexer_scores token-tiling (always on): DS4_INDEXER_SCORE_TILE=512 reduces the score working buffer from comp_cap*prefill_cap to comp_cap*512 (~8x). Together: ~2.3x KV cache reduction at long context, bit-identical output, speed-neutral decode (bandwidth saving offset by per-element dequant cost). Revert with DS4_DISABLE_KV_OPTS=1. Tested on M5 Max, 8k-96k context. Fixes: antirez#416 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fry69 · 2026-06-16T04:41:31Z

This is the same as #416 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: FP8-packed compressed-KV cache + long-context memory optimizations#418

Metal: FP8-packed compressed-KV cache + long-context memory optimizations#418
aledesogusbusiness-hue wants to merge 1 commit into
antirez:mainfrom
aledesogusbusiness-hue:main

aledesogusbusiness-hue commented Jun 16, 2026

Uh oh!

fry69 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aledesogusbusiness-hue commented Jun 16, 2026

Summary

Results

Controls

Test environment

Uh oh!

fry69 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants