Skip to content

[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector#1318

Open
yhl-amd wants to merge 7 commits into
ROCm:mainfrom
yhl-amd:feature/lmcache-offload-merge
Open

[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector#1318
yhl-amd wants to merge 7 commits into
ROCm:mainfrom
yhl-amd:feature/lmcache-offload-merge

Conversation

@yhl-amd

@yhl-amd yhl-amd commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a standalone KV-offload subsystem that offloads the ATOM KV cache to
LMCache-backed CPU/NVMe storage and reloads it on cache hits, so evicted
prefixes are restored instead of recomputed.

Squashed into 3 logical commits:

  1. [Feature] OFFLOAD: add LMCache CPU/NVMe KV-offload subsystem — new
    atom/kv_transfer/offload package (LMCacheOffloadConnector, ATOM↔LMCache
    GPU connector, byte codec for ATOM's packed KV layout, Triton staging
    kernel, metadata/config) + disaggregation base/factory/types/aggregator
    wiring.
  2. [Feature] OFFLOAD: integrate KV offload load/save into engine and scheduler
    — async load dispatch + worker transfer polling + idle KV-transfer advance;
    deferred block free until background D2H save completes; wake parked
    prefills for local recompute on load miss; chunked-prefill deferred-output
    handling across park/resume.
  3. [Frontend] Support max completion tokens in OpenAI API — honor
    max_completion_tokens / max_tokens so offload benchmarks can bound
    generation length.

Test plan

  • pytest tests/test_lmcache_offload_connector.py (connector + byte-codec round-trip)
  • pytest tests/test_scheduler.py
  • pytest tests/entrypoints/test_protocol.py tests/entrypoints/test_api_server_helpers.py
  • MI325X micro-bench: evicted 32K prefix reload (CPU ~0.32s / NVMe ~0.46s) vs recompute ~2.5s → 6–8× faster TTFT
  • Reviewer: CI green on ROCm runners

🤖 Generated with Claude Code

@yhl-amd yhl-amd force-pushed the feature/lmcache-offload-merge branch from 7145965 to b9b7c24 Compare June 23, 2026 03:06
yhl-amd and others added 3 commits June 22, 2026 22:13
Add a standalone KV-offload subsystem that offloads ATOM KV cache to
LMCache-backed CPU/NVMe storage and reloads it on cache hits, avoiding
prefill recompute for evicted prefixes.

- New atom/kv_transfer/offload package: LMCacheOffloadConnector, the
  ATOM<->LMCache GPU connector, a byte codec for ATOM's packed KV layout,
  a Triton staging kernel, plus metadata and config.
- Wire the connector into the disaggregation base/factory/types and
  aggregate per-worker finished/failed transfer states.
- Unit tests for the connector and byte-codec round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eduler

Drive offload load/save from the engine loop and scheduler:
- Dispatch async KV load after connector metadata, poll worker transfer
  status, and advance idle KV transfer when no forward batch runs.
- Defer block free until a background D2H save has read the KV, and wake
  parked prefills for local recompute on a load miss (failed_recving).
- Handle chunked-prefill deferred output across the offload park/resume
  boundary so stale sampled tokens are dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Honor max_completion_tokens (and the max_tokens alias) in the OpenAI
completion/chat protocol and server so offload benchmarks can bound
generation length. Adds protocol and server-helper unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yhl-amd yhl-amd force-pushed the feature/lmcache-offload-merge branch from b9b7c24 to 6bb1fbc Compare June 23, 2026 03:14
Document the ATOM standalone lmcache_offload connector: design, module
map, scheduler/worker architecture, byte codec and AITER layout bridge,
MemoryObj/segment layout, completion protocol, reload decision and
chunk-alignment handoff, correctness/fp8/failure handling, the LMCache
reuse-vs-override boundary, configuration, benchmarks, and tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>
if not self.kv_transfer_enabled:
return
connector = getattr(self.scheduler, "kv_connector", None)
if connector is None or not getattr(connector, "is_offload", False):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use mooncake connector and moriio connector in PD, can not use connector, "is_offload" here? We should use a new variable and plugin in old connector

@zufayu zufayu requested review from amd-ruitang3 and removed request for amd-ruitang3 June 26, 2026 06:12
yhl-amd and others added 3 commits June 26, 2026 02:07
The byte codec assumed a block-major KV layout (tensor.shape[0] == block
count), which holds for MHA/GQA but not for MLA: MLA stores a single
per-layer latent cache (kv_lora_rank + qk_rope_head_dim, e.g. 576) viewed
token-major as (num_blocks * block_size, 1, latent), with no separate V or
scale tensors. So shape[0] is the token count and the codec computed a
per-token (not per-block) byte stride, corrupting the offloaded KV.

Both layouts share an identical contiguous byte layout (block b always
starts at b * bytes_per_block), so instead of branching we take the physical
block count explicitly and derive each segment's per-block stride as
segment_bytes / num_blocks. The Triton fused staging kernel is byte-addressed
and needs no change.

- ATOMKVByteCodec: accept explicit num_blocks; per-block bytes from it;
  require contiguous + numel divisible by num_blocks (replaces the
  "same shape[0]" check). Falls back to shape[0] when num_blocks is None,
  preserving non-MLA behaviour.
- Thread num_physical_kvcache_blocks: model_runner.allocate_kv_cache ->
  forward_context.set_kv_cache_data -> connector.register_kv_caches ->
  codec. Optional num_blocks kwarg added to the connector base + mooncake +
  moriio impls (ignored there).
- build_lmcache_metadata: emit an MLA-shaped kv_shape (latent dim) for
  bookkeeping; storage stays opaque BINARY so use_mla remains False.
- Tests: MLA token-major block accounting + byte-identical round-trip.

Validated on DeepSeek-V3-5layer (real MLA, TP=2) end-to-end: offload save +
reload (cxs multi-round, round 2 hits cached:[~33k], no recompute).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document MLA (DeepSeek R1/V3, Kimi) KV-offload support in the connector
README: token-major latent cache layout, explicit num_blocks threading,
and BINARY opaque storage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>
…_UNALIGNED_HANDOFF

Unaligned HBM-prefix loads now always take the handoff path (recompute the
misaligned head up to the next chunk boundary, then load the aligned remainder
from CPU) instead of being gated behind the OFFLOAD_UNALIGNED_HANDOFF env var.
The env read and the gate check in _maybe_start_unaligned_handoff are removed;
the min_load / boundary guards are unchanged.

- connector.py: drop _allow_unaligned_handoff + OFFLOAD_UNALIGNED_HANDOFF read;
  handoff is now unconditional.
- README.md / README.zh-CN.md: remove the env var from the tuning table and the
  example commands, add a "removed" note, and document the always-on behaviour.
  Also track the previously-untracked zh-CN README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yihonglie <hyi@amd.com>
@zufayu zufayu requested review from JiaoliangYu and amd-ruitang3 and removed request for amd-ruitang3 June 26, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants