[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector by yhl-amd · Pull Request #1318 · ROCm/ATOM

yhl-amd · 2026-06-23T02:55:11Z

Summary

Adds a standalone KV-offload subsystem that offloads the ATOM KV cache to
LMCache-backed CPU/NVMe storage and reloads it on cache hits, so evicted
prefixes are restored instead of recomputed.

Squashed into 3 logical commits:

[Feature] OFFLOAD: add LMCache CPU/NVMe KV-offload subsystem — new
atom/kv_transfer/offload package (LMCacheOffloadConnector, ATOM↔LMCache
GPU connector, byte codec for ATOM's packed KV layout, Triton staging
kernel, metadata/config) + disaggregation base/factory/types/aggregator
wiring.
[Feature] OFFLOAD: integrate KV offload load/save into engine and scheduler
— async load dispatch + worker transfer polling + idle KV-transfer advance;
deferred block free until background D2H save completes; wake parked
prefills for local recompute on load miss; chunked-prefill deferred-output
handling across park/resume.
[Frontend] Support max completion tokens in OpenAI API — honor
max_completion_tokens / max_tokens so offload benchmarks can bound
generation length.

Test plan

pytest tests/test_lmcache_offload_connector.py (connector + byte-codec round-trip)
pytest tests/test_scheduler.py
pytest tests/entrypoints/test_protocol.py tests/entrypoints/test_api_server_helpers.py
MI325X micro-bench: evicted 32K prefix reload (CPU ~0.32s / NVMe ~0.46s) vs recompute ~2.5s → 6–8× faster TTFT
Reviewer: CI green on ROCm runners

🤖 Generated with Claude Code

Add a standalone KV-offload subsystem that offloads ATOM KV cache to LMCache-backed CPU/NVMe storage and reloads it on cache hits, avoiding prefill recompute for evicted prefixes. - New atom/kv_transfer/offload package: LMCacheOffloadConnector, the ATOM<->LMCache GPU connector, a byte codec for ATOM's packed KV layout, a Triton staging kernel, plus metadata and config. - Wire the connector into the disaggregation base/factory/types and aggregate per-worker finished/failed transfer states. - Unit tests for the connector and byte-codec round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eduler Drive offload load/save from the engine loop and scheduler: - Dispatch async KV load after connector metadata, poll worker transfer status, and advance idle KV transfer when no forward batch runs. - Defer block free until a background D2H save has read the KV, and wake parked prefills for local recompute on a load miss (failed_recving). - Handle chunked-prefill deferred output across the offload park/resume boundary so stale sampled tokens are dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Honor max_completion_tokens (and the max_tokens alias) in the OpenAI completion/chat protocol and server so offload benchmarks can bound generation length. Adds protocol and server-helper unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document the ATOM standalone lmcache_offload connector: design, module map, scheduler/worker architecture, byte codec and AITER layout bridge, MemoryObj/segment layout, completion protocol, reload decision and chunk-alignment handoff, correctness/fp8/failure handling, the LMCache reuse-vs-override boundary, configuration, benchmarks, and tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com>

ZhangLirong-amd · 2026-06-25T03:46:23Z

+        if not self.kv_transfer_enabled:
+            return
+        connector = getattr(self.scheduler, "kv_connector", None)
+        if connector is None or not getattr(connector, "is_offload", False):


We use mooncake connector and moriio connector in PD, can not use connector, "is_offload" here? We should use a new variable and plugin in old connector

The byte codec assumed a block-major KV layout (tensor.shape[0] == block count), which holds for MHA/GQA but not for MLA: MLA stores a single per-layer latent cache (kv_lora_rank + qk_rope_head_dim, e.g. 576) viewed token-major as (num_blocks * block_size, 1, latent), with no separate V or scale tensors. So shape[0] is the token count and the codec computed a per-token (not per-block) byte stride, corrupting the offloaded KV. Both layouts share an identical contiguous byte layout (block b always starts at b * bytes_per_block), so instead of branching we take the physical block count explicitly and derive each segment's per-block stride as segment_bytes / num_blocks. The Triton fused staging kernel is byte-addressed and needs no change. - ATOMKVByteCodec: accept explicit num_blocks; per-block bytes from it; require contiguous + numel divisible by num_blocks (replaces the "same shape[0]" check). Falls back to shape[0] when num_blocks is None, preserving non-MLA behaviour. - Thread num_physical_kvcache_blocks: model_runner.allocate_kv_cache -> forward_context.set_kv_cache_data -> connector.register_kv_caches -> codec. Optional num_blocks kwarg added to the connector base + mooncake + moriio impls (ignored there). - build_lmcache_metadata: emit an MLA-shaped kv_shape (latent dim) for bookkeeping; storage stays opaque BINARY so use_mla remains False. - Tests: MLA token-major block accounting + byte-identical round-trip. Validated on DeepSeek-V3-5layer (real MLA, TP=2) end-to-end: offload save + reload (cxs multi-round, round 2 hits cached:[~33k], no recompute). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document MLA (DeepSeek R1/V3, Kimi) KV-offload support in the connector README: token-major latent cache layout, explicit num_blocks threading, and BINARY opaque storage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com>

…_UNALIGNED_HANDOFF Unaligned HBM-prefix loads now always take the handoff path (recompute the misaligned head up to the next chunk boundary, then load the aligned remainder from CPU) instead of being gated behind the OFFLOAD_UNALIGNED_HANDOFF env var. The env read and the gate check in _maybe_start_unaligned_handoff are removed; the min_load / boundary guards are unchanged. - connector.py: drop _allow_unaligned_handoff + OFFLOAD_UNALIGNED_HANDOFF read; handoff is now unconditional. - README.md / README.zh-CN.md: remove the env var from the tuning table and the example commands, add a "removed" note, and document the always-on behaviour. Also track the previously-untracked zh-CN README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com>

yhl-amd force-pushed the feature/lmcache-offload-merge branch from 7145965 to b9b7c24 Compare June 23, 2026 03:06

yhl-amd and others added 3 commits June 22, 2026 22:13

yhl-amd force-pushed the feature/lmcache-offload-merge branch from b9b7c24 to 6bb1fbc Compare June 23, 2026 03:14

ZhangLirong-amd reviewed Jun 25, 2026

View reviewed changes

zufayu requested review from amd-ruitang3 and removed request for amd-ruitang3 June 26, 2026 06:12

yhl-amd and others added 3 commits June 26, 2026 02:07

zufayu requested review from JiaoliangYu and amd-ruitang3 and removed request for amd-ruitang3 June 26, 2026 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector#1318

[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector#1318
yhl-amd wants to merge 7 commits into
ROCm:mainfrom
yhl-amd:feature/lmcache-offload-merge

yhl-amd commented Jun 23, 2026

Uh oh!

ZhangLirong-amd Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yhl-amd commented Jun 23, 2026

Summary

Test plan

Uh oh!

ZhangLirong-amd Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants