Skip to content

Optimize memory usage for KV cache#838

Open
hnwyllmm wants to merge 4 commits into
masterfrom
task/2026051100116012737
Open

Optimize memory usage for KV cache#838
hnwyllmm wants to merge 4 commits into
masterfrom
task/2026051100116012737

Conversation

@hnwyllmm

@hnwyllmm hnwyllmm commented Jun 2, 2026

Copy link
Copy Markdown
Member

Task Description

In scenarios with small memory specifications (e.g., 2GB mini_mode) or cold starts, the original design of the KV Cache—featuring static large array pre-allocation and frequent small memory requests—led to severe memory fragmentation and inefficient Used/Hold memory ratios (with Hold memory being artificially high).

This MR integrates two key optimization commits, refactoring the allocation model for three core memory labels of the KV Cache: CACHE_MAP_NODE, CACHE_MAP_BKT, and CACHE_MB_HANDLE. After optimization, the basic metadata startup overhead for the KV Cache in a 2GB memory specification dropped sharply from nearly 20MB to ~4MB. The Used/Hold ratio improved significantly, virtually eliminating memory fragmentation waste.

┌───────────────────────┬──────────────────────────┬───────────────────────┬────────────────────────────────────────────┐
│ Memory Label          │ Pre-optimization Hold    │ Post-optimization Hold│ Optimization Effect / Savings              │
├───────────────────────┼──────────────────────────┼───────────────────────┼────────────────────────────────────────────┤
│ CACHE_MAP_NODE        │ ~832 KB (multi-instance) │ Shared Global        │ Saves at least 832 KB base overhead       │
│ CACHE_MAP_BKT         │ ~8.0 MB                 │ ~2.0 MB              │ Reduced by 75%, Used/Hold ratio ~100%     │
│ CACHE_MB_HANDLE       │ ~8.0 MB (static full)   │ ~0 MB (on-demand)    │ Reduced ~100% (cold start), 8KB increments│
│ KvstCachWashStr       │ ~1.1MB                  │ ~8000 B              │ Changed to temp variable, allocated when needed. │
│ Total Metadata Overhead│ ~17 MB                 │ ~2.8 MB (cold start) │ Cold start overhead reduced by 80%+        │
└───────────────────────┴──────────────────────────┴───────────────────────┴────────────────────────────────────────────┘

Solution Description

2. Core Optimization Design & Implementation (Modifications & Design)

2.1 Shared Map Node Allocator (Target: CACHE_MAP_NODE)

  • Background & Problem:
    Originally, each cache instance (ObKVCacheInst, up to MAX_CACHE_NUM) held its own independent lock-free FIFO allocator ObLfFIFOAllocator node_allocator_. Since each allocator instance pre-allocated underlying buffers (blocks/chunks), this caused significant static memory waste as the number of registered cache instances increased.
  • Refactoring Solution:
    Refactored to use a global shared allocator. All ObKVCacheInst instances now share a single global node_allocator_ belonging to ObKVCacheMap.
  • Optimization Effect:
    • Directly saves at least 832KB of pre-allocated memory per tenant.
    • Prevents linear expansion of the allocator with increasing cache instance count.

────────────────────────────────────────

2.2 Hash Bucket Pointer Contiguous Large Memory Allocation (Target: CACHE_MAP_BKT)

  • Background & Problem:
    In the old code, the bucket_cnt hash buckets in ObKVCacheMap were allocated via bucket_allocator_ through many small memory allocations (a loop of bucket_cnt times, each allocating only sizeof(Node*) * bucket_size_). Frequent small allocations came with huge allocator management metadata overhead, resulting in an extremely low Used/Hold ratio (in a 2GB spec, actual usage was ~1.6MB, but the allocator held ~8MB of physical memory).
  • Refactoring Solution:
    Changed to a single large allocation, split by offset. During the init phase, a single contiguous block of large memory (sizeof(Node*) * bucket_num_) is allocated via bucket_allocator_ and split by pointer offset for each bucket during initialization. Deallocation requires only a single free of the head pointer.
  • Optimization Effect:
    • Dramatic memory reduction: In a 2GB configuration, the Hold memory for CACHE_MAP_BKT dropped sharply from 8MB to ~2MB.
    • Fragmentation eliminated: With hundreds of small memory fragment requests removed, the Used/Hold ratio is now nearly 100%.

────────────────────────────────────────

2.3 Two-Dimensional Segmented Array Dynamic On-Demand Allocation (Target: CACHE_MB_HANDLE)

  • Background & Problem:
    Previously, ObKVCacheStore would pre-allocate a flat, one-dimensional large array mb_handles_ at startup based on max_cache_size to hold all memory block handles. In small memory specs, due to constraints like Hazard Pointer retirement limits and thread count constants, the calculated max_mb_num remained large. This forced a mandatory pre-allocation of ~8MB memory during a cold start, even with no data. Switching to a purely dynamic pool allocation would impact the performance of global traversal in high-frequency background tasks like refresh_score and wash, and introduce concurrency risks.
  • Refactoring Solution:
    Introduced a 2D segmented array ObKVMBHandleArray implementing a dynamically expandable Block mechanism.
    1. Segmentation by Block: Uses the system's standard large block size OB_MALLOC_NORMAL_BLOCK_SIZE (8KB) as the physical allocation unit (BLOCK), replacing the flat large array. Handles are stored contiguously within a BLOCK, maximizing memory utilization.
    2. Zero Pre-allocation at Cold Start & On-Demand Growth: No BLOCK physical memory is allocated at startup. When try_supply_mb is triggered to supply a block, ensure_blocks is called on-demand to expand and initialize the corresponding BLOCK.
    3. Traversal Performance & Lock-Free Safety: Handle location uses an extremely lightweight 2D array index calculation (idx / HANDLE_BLOCK_SIZE and idx % HANDLE_BLOCK_SIZE), preserving the efficiency of the original traversal logic. Safety during concurrent BLOCK expansion by multiple threads is ensured via ATOMIC_LOAD and ATOMIC_BCAS.
    4. Decoupled Pool Pre-allocation: The mb_handles_pool_ was modified to be initialized via an allocator, removing its dependency on the physical memory continuity of the original one-dimensional array.
  • Optimization Effect:
    • The memory overhead for CACHE_MB_HANDLE during startup is reduced to nearly zero (only a small number of active blocks are allocated on demand).
    • Completely eliminates the 8MB memory waste during cold starts, achieving smooth, stepwise growth of handle memory as data volume increases (in 8KB increments).

Passed Regressions

Upgrade Compatibility

Other Information

  • Related links:
    • DIMA-2026051100116012737

Release Note

@hnwyllmm

hnwyllmm commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

The mapping Dima issue is detailed in the Optimization Analysis.

@hnwyllmm

hnwyllmm commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

@hnwyllmm

hnwyllmm commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Core Testing
Execution result: [FAILED] Task execution failed. GID: 4758000268 Details

@hnwyllmm hnwyllmm force-pushed the task/2026051100116012737 branch from 082d9f5 to cbd7570 Compare June 2, 2026 05:54
@hnwyllmm hnwyllmm changed the title Optimize memory usage for KV Cache Optimize memory usage for KV cache Jun 2, 2026
@hnwyllmm

hnwyllmm commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Core test
Execution process: [FAILED] Task run failed. GID: 4758000268 Details

@hnwyllmm

hnwyllmm commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Core test
Execution process: [SUCCESS] Task ran successfully. GID: 4758000442 Details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant