[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector#1318
Open
yhl-amd wants to merge 7 commits into
Open
[Feature] OFFLOAD: standalone LMCache CPU/NVMe KV-offload connector#1318yhl-amd wants to merge 7 commits into
yhl-amd wants to merge 7 commits into
Conversation
7145965 to
b9b7c24
Compare
Add a standalone KV-offload subsystem that offloads ATOM KV cache to LMCache-backed CPU/NVMe storage and reloads it on cache hits, avoiding prefill recompute for evicted prefixes. - New atom/kv_transfer/offload package: LMCacheOffloadConnector, the ATOM<->LMCache GPU connector, a byte codec for ATOM's packed KV layout, a Triton staging kernel, plus metadata and config. - Wire the connector into the disaggregation base/factory/types and aggregate per-worker finished/failed transfer states. - Unit tests for the connector and byte-codec round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eduler Drive offload load/save from the engine loop and scheduler: - Dispatch async KV load after connector metadata, poll worker transfer status, and advance idle KV transfer when no forward batch runs. - Defer block free until a background D2H save has read the KV, and wake parked prefills for local recompute on a load miss (failed_recving). - Handle chunked-prefill deferred output across the offload park/resume boundary so stale sampled tokens are dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Honor max_completion_tokens (and the max_tokens alias) in the OpenAI completion/chat protocol and server so offload benchmarks can bound generation length. Adds protocol and server-helper unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b9b7c24 to
6bb1fbc
Compare
Document the ATOM standalone lmcache_offload connector: design, module map, scheduler/worker architecture, byte codec and AITER layout bridge, MemoryObj/segment layout, completion protocol, reload decision and chunk-alignment handoff, correctness/fp8/failure handling, the LMCache reuse-vs-override boundary, configuration, benchmarks, and tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com>
| if not self.kv_transfer_enabled: | ||
| return | ||
| connector = getattr(self.scheduler, "kv_connector", None) | ||
| if connector is None or not getattr(connector, "is_offload", False): |
Collaborator
There was a problem hiding this comment.
We use mooncake connector and moriio connector in PD, can not use connector, "is_offload" here? We should use a new variable and plugin in old connector
The byte codec assumed a block-major KV layout (tensor.shape[0] == block count), which holds for MHA/GQA but not for MLA: MLA stores a single per-layer latent cache (kv_lora_rank + qk_rope_head_dim, e.g. 576) viewed token-major as (num_blocks * block_size, 1, latent), with no separate V or scale tensors. So shape[0] is the token count and the codec computed a per-token (not per-block) byte stride, corrupting the offloaded KV. Both layouts share an identical contiguous byte layout (block b always starts at b * bytes_per_block), so instead of branching we take the physical block count explicitly and derive each segment's per-block stride as segment_bytes / num_blocks. The Triton fused staging kernel is byte-addressed and needs no change. - ATOMKVByteCodec: accept explicit num_blocks; per-block bytes from it; require contiguous + numel divisible by num_blocks (replaces the "same shape[0]" check). Falls back to shape[0] when num_blocks is None, preserving non-MLA behaviour. - Thread num_physical_kvcache_blocks: model_runner.allocate_kv_cache -> forward_context.set_kv_cache_data -> connector.register_kv_caches -> codec. Optional num_blocks kwarg added to the connector base + mooncake + moriio impls (ignored there). - build_lmcache_metadata: emit an MLA-shaped kv_shape (latent dim) for bookkeeping; storage stays opaque BINARY so use_mla remains False. - Tests: MLA token-major block accounting + byte-identical round-trip. Validated on DeepSeek-V3-5layer (real MLA, TP=2) end-to-end: offload save + reload (cxs multi-round, round 2 hits cached:[~33k], no recompute). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document MLA (DeepSeek R1/V3, Kimi) KV-offload support in the connector README: token-major latent cache layout, explicit num_blocks threading, and BINARY opaque storage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com>
…_UNALIGNED_HANDOFF Unaligned HBM-prefix loads now always take the handoff path (recompute the misaligned head up to the next chunk boundary, then load the aligned remainder from CPU) instead of being gated behind the OFFLOAD_UNALIGNED_HANDOFF env var. The env read and the gate check in _maybe_start_unaligned_handoff are removed; the min_load / boundary guards are unchanged. - connector.py: drop _allow_unaligned_handoff + OFFLOAD_UNALIGNED_HANDOFF read; handoff is now unconditional. - README.md / README.zh-CN.md: remove the env var from the tuning table and the example commands, add a "removed" note, and document the always-on behaviour. Also track the previously-untracked zh-CN README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yihonglie <hyi@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a standalone KV-offload subsystem that offloads the ATOM KV cache to
LMCache-backed CPU/NVMe storage and reloads it on cache hits, so evicted
prefixes are restored instead of recomputed.
Squashed into 3 logical commits:
[Feature] OFFLOAD: add LMCache CPU/NVMe KV-offload subsystem— newatom/kv_transfer/offloadpackage (LMCacheOffloadConnector, ATOM↔LMCacheGPU connector, byte codec for ATOM's packed KV layout, Triton staging
kernel, metadata/config) + disaggregation base/factory/types/aggregator
wiring.
[Feature] OFFLOAD: integrate KV offload load/save into engine and scheduler— async load dispatch + worker transfer polling + idle KV-transfer advance;
deferred block free until background D2H save completes; wake parked
prefills for local recompute on load miss; chunked-prefill deferred-output
handling across park/resume.
[Frontend] Support max completion tokens in OpenAI API— honormax_completion_tokens/max_tokensso offload benchmarks can boundgeneration length.
Test plan
pytest tests/test_lmcache_offload_connector.py(connector + byte-codec round-trip)pytest tests/test_scheduler.pypytest tests/entrypoints/test_protocol.py tests/entrypoints/test_api_server_helpers.py🤖 Generated with Claude Code