LLM 推理 KV Cache 分布式缓存方案 #860

AlexStocks · 2026-01-09T07:01:19Z

AlexStocks
Jan 9, 2026
Collaborator

概述

本文档描述如何在 Dubbo-Go-Pixiu AI 网关中实现 LLM 推理的 KV Cache 分布式缓存功能，通过将 KV Cache 卸载到 Redis 集群，实现无状态 LLM 推理架构，大幅降低推理成本。

背景与痛点

当前 LLM 推理架构的局限性

状态耦合问题
- KV Cache 绑定在特定推理服务器的 GPU 显存或本地内存
- 多轮对话必须路由到同一服务器才能命中缓存
- 缓存未命中需要重新执行预填充（Prefill），成本高达 10 倍
弹性扩展受限
- 服务器宕机导致缓存丢失
- 无法按需水平扩展推理实例
- 负载均衡需要感知缓存分布，调度复杂度高
资源利用率低
- 需要预留缓存空间避免冲突
- 服务器无法满负载运行
- GPU 资源浪费

解决方案：无状态 LLM 推理架构

借鉴互联网后端架构演进经验，将 KV Cache 从推理实例中剥离，卸载到分布式存储（Redis 集群），实现：

无状态推理服务：任意实例可处理任意请求
100% 缓存命中率：多轮对话共享全局 KV Cache
弹性扩展：按需增减推理实例，无状态迁移
成本降低：避免预填充重计算，降本 50%+

架构设计

整体架构图

┌──────────────────────────────────────────────────────────────┐
│                      AI Client Applications                  │
└────────────────────────────┬─────────────────────────────────┘
                             │
┌────────────────────────────▼─────────────────────────────────┐
│               Pixiu AI Gateway (Stateless)                   │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  AI KV Cache Filter (dgp.filter.ai.kvcache)            │  │
│  │  - Cache Key Generation                                │  │
│  │  - KV Cache Offload/Load                               │  │
│  │  - Cache Hit/Miss Handling                             │  │
│  └────────────────────────────────────────────────────────┘  │
└────────────────────────────┬─────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│Redis Cluster  │    │Redis Cluster  │    │Redis Cluster  │
│  Node 1       │    │  Node 2       │    │  Node 3       │
│(Master+Slave) │    │(Master+Slave) │    │(Master+Slave) │
└───────┬───────┘    └───────┬───────┘    └───────┬───────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│LLM Inference  │    │LLM Inference  │    │LLM Inference  │
│  Instance 1   │    │  Instance 2   │    │  Instance N   │
│ (SGLang/vLLM) │    │ (SGLang/vLLM) │    │ (SGLang/vLLM) │
└───────────────┘    └───────────────┘    └───────────────┘

核心组件

1. AI KV Cache Filter

位于 Pixiu 网关的请求处理链路中，负责：

缓存键生成：基于会话 ID、模型名、Prompt Hash 生成唯一缓存键
KV Cache 卸载：推理完成后，将 KV Cache 存入 Redis 集群
KV Cache 加载：推理前，从 Redis 集群读取历史 KV Cache
缓存淘汰：支持 TTL、LRU、容量上限等淘汰策略
性能优化：零拷贝、批量操作、异步写入

2. Redis 集群

部署模式：Redis Cluster（主从复制 + 分片）
数据结构：使用 String 类型存储序列化的 KV Cache
持久化：RDB + AOF 混合持久化，防止数据丢失
高可用：Sentinel 监控，自动故障转移

3. LLM 推理实例（SGLang/vLLM）

状态剥离：不维护本地 KV Cache，完全依赖 Redis
接口扩展：支持接收外部 KV Cache 输入
兼容性：适配 SGLang RadixAttention、vLLM PagedAttention

技术实现

3.1 目录结构

pkg/filter/ai/kvcache/
├── kvcache.go              # Filter 主逻辑
├── cache_manager.go        # 缓存管理器
├── redis_backend.go        # Redis 后端实现
├── cache_key.go            # 缓存键生成器
├── serializer.go           # KV Cache 序列化/反序列化
├── metrics.go              # 指标采集
└── config.go               # 配置定义

3.2 核心接口设计

KVCacheManager Interface

// pkg/filter/ai/kvcache/cache_manager.go
package kvcache

import (
	"context"
	"time"
)

// KVCacheManager 管理 LLM 推理的 KV Cache
type KVCacheManager interface {
	// Get 获取 KV Cache
	Get(ctx context.Context, key string) (*KVCache, error)
	
	// Set 存储 KV Cache
	Set(ctx context.Context, key string, cache *KVCache, ttl time.Duration) error
	
	// Delete 删除 KV Cache
	Delete(ctx context.Context, key string) error
	
	// Exists 检查 KV Cache 是否存在
	Exists(ctx context.Context, key string) (bool, error)
	
	// GetMulti 批量获取 KV Cache
	GetMulti(ctx context.Context, keys []string) (map[string]*KVCache, error)
	
	// SetMulti 批量存储 KV Cache
	SetMulti(ctx context.Context, caches map[string]*KVCache, ttl time.Duration) error
	
	// Stats 返回缓存统计信息
	Stats() CacheStats
	
	// Close 关闭缓存管理器
	Close() error
}

// KVCache 表示 LLM 推理的 KV Cache 数据
type KVCache struct {
	SessionID   string    // 会话 ID
	ModelName   string    // 模型名称
	PromptHash  string    // Prompt 的 Hash
	Keys        []byte    // K Cache（序列化后的字节流）
	Values      []byte    // V Cache（序列化后的字节流）
	TokenCount  int       // Token 数量
	LayerCount  int       // 层数
	HeadCount   int       // 注意力头数
	HiddenSize  int       // 隐藏层大小
	CreatedAt   time.Time // 创建时间
	AccessedAt  time.Time // 最后访问时间
	Metadata    map[string]string // 扩展元数据
}

// CacheStats 缓存统计信息
type CacheStats struct {
	HitCount        int64   // 缓存命中次数
	MissCount       int64   // 缓存未命中次数
	HitRatio        float64 // 缓存命中率
	TotalKeys       int64   // 总缓存键数
	TotalSizeBytes  int64   // 总缓存大小（字节）
	AvgGetLatencyMs float64 // 平均读取延迟（毫秒）
	AvgSetLatencyMs float64 // 平均写入延迟（毫秒）
}

Redis Backend Implementation

// pkg/filter/ai/kvcache/redis_backend.go
package kvcache

import (
	"context"
	"encoding/json"
	"fmt"
	"sync/atomic"
	"time"

	"github.com/go-redis/redis/v8"
)

// RedisKVCacheManager Redis 实现的 KV Cache 管理器
type RedisKVCacheManager struct {
	client    redis.UniversalClient
	keyPrefix string
	
	// 统计信息
	hitCount  int64
	missCount int64
	
	// 性能指标
	getTotalLatency int64 // 纳秒
	setTotalLatency int64 // 纳秒
	getCount        int64
	setCount        int64
}

// NewRedisKVCacheManager 创建 Redis KV Cache 管理器
func NewRedisKVCacheManager(config *RedisConfig) (*RedisKVCacheManager, error) {
	var client redis.UniversalClient
	
	if config.ClusterMode {
		// Redis Cluster 模式
		client = redis.NewClusterClient(&redis.ClusterOptions{
			Addrs:        config.Addrs,
			Password:     config.Password,
			PoolSize:     config.PoolSize,
			MinIdleConns: config.MinIdleConns,
			DialTimeout:  config.DialTimeout,
			ReadTimeout:  config.ReadTimeout,
			WriteTimeout: config.WriteTimeout,
		})
	} else {
		// Redis 单机/Sentinel 模式
		client = redis.NewClient(&redis.Options{
			Addr:         config.Addrs[0],
			Password:     config.Password,
			DB:           config.DB,
			PoolSize:     config.PoolSize,
			MinIdleConns: config.MinIdleConns,
			DialTimeout:  config.DialTimeout,
			ReadTimeout:  config.ReadTimeout,
			WriteTimeout: config.WriteTimeout,
		})
	}
	
	// 测试连接
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	if err := client.Ping(ctx).Err(); err != nil {
		return nil, fmt.Errorf("failed to connect to Redis: %w", err)
	}
	
	return &RedisKVCacheManager{
		client:    client,
		keyPrefix: config.KeyPrefix,
	}, nil
}

// Get 获取 KV Cache
func (m *RedisKVCacheManager) Get(ctx context.Context, key string) (*KVCache, error) {
	start := time.Now()
	defer func() {
		atomic.AddInt64(&m.getTotalLatency, time.Since(start).Nanoseconds())
		atomic.AddInt64(&m.getCount, 1)
	}()
	
	redisKey := m.makeRedisKey(key)
	data, err := m.client.Get(ctx, redisKey).Bytes()
	if err == redis.Nil {
		atomic.AddInt64(&m.missCount, 1)
		return nil, ErrCacheNotFound
	}
	if err != nil {
		return nil, fmt.Errorf("failed to get cache from Redis: %w", err)
	}
	
	// 反序列化
	var cache KVCache
	if err := json.Unmarshal(data, &cache); err != nil {
		return nil, fmt.Errorf("failed to unmarshal cache: %w", err)
	}
	
	// 更新访问时间
	cache.AccessedAt = time.Now()
	atomic.AddInt64(&m.hitCount, 1)
	
	return &cache, nil
}

// Set 存储 KV Cache
func (m *RedisKVCacheManager) Set(ctx context.Context, key string, cache *KVCache, ttl time.Duration) error {
	start := time.Now()
	defer func() {
		atomic.AddInt64(&m.setTotalLatency, time.Since(start).Nanoseconds())
		atomic.AddInt64(&m.setCount, 1)
	}()
	
	// 序列化
	data, err := json.Marshal(cache)
	if err != nil {
		return fmt.Errorf("failed to marshal cache: %w", err)
	}
	
	redisKey := m.makeRedisKey(key)
	if err := m.client.Set(ctx, redisKey, data, ttl).Err(); err != nil {
		return fmt.Errorf("failed to set cache to Redis: %w", err)
	}
	
	return nil
}

// GetMulti 批量获取 KV Cache
func (m *RedisKVCacheManager) GetMulti(ctx context.Context, keys []string) (map[string]*KVCache, error) {
	if len(keys) == 0 {
		return make(map[string]*KVCache), nil
	}
	
	// 构建 Redis 键列表
	redisKeys := make([]string, len(keys))
	for i, key := range keys {
		redisKeys[i] = m.makeRedisKey(key)
	}
	
	// 使用 Pipeline 批量获取
	pipe := m.client.Pipeline()
	cmds := make([]*redis.StringCmd, len(redisKeys))
	for i, redisKey := range redisKeys {
		cmds[i] = pipe.Get(ctx, redisKey)
	}
	
	if _, err := pipe.Exec(ctx); err != nil && err != redis.Nil {
		// 忽略 redis.Nil 错误（部分键不存在是正常的）
		if !isRedisNilError(err) {
			return nil, fmt.Errorf("failed to execute pipeline: %w", err)
		}
	}
	
	// 解析结果
	result := make(map[string]*KVCache)
	for i, cmd := range cmds {
		data, err := cmd.Bytes()
		if err == redis.Nil {
			atomic.AddInt64(&m.missCount, 1)
			continue
		}
		if err != nil {
			continue // 跳过错误的键
		}
		
		var cache KVCache
		if err := json.Unmarshal(data, &cache); err != nil {
			continue // 跳过反序列化失败的键
		}
		
		cache.AccessedAt = time.Now()
		result[keys[i]] = &cache
		atomic.AddInt64(&m.hitCount, 1)
	}
	
	return result, nil
}

// Stats 返回缓存统计信息
func (m *RedisKVCacheManager) Stats() CacheStats {
	hitCount := atomic.LoadInt64(&m.hitCount)
	missCount := atomic.LoadInt64(&m.missCount)
	totalRequests := hitCount + missCount
	
	var hitRatio float64
	if totalRequests > 0 {
		hitRatio = float64(hitCount) / float64(totalRequests)
	}
	
	getCount := atomic.LoadInt64(&m.getCount)
	setCount := atomic.LoadInt64(&m.setCount)
	getTotalLatency := atomic.LoadInt64(&m.getTotalLatency)
	setTotalLatency := atomic.LoadInt64(&m.setTotalLatency)
	
	var avgGetLatencyMs, avgSetLatencyMs float64
	if getCount > 0 {
		avgGetLatencyMs = float64(getTotalLatency) / float64(getCount) / 1e6
	}
	if setCount > 0 {
		avgSetLatencyMs = float64(setTotalLatency) / float64(setCount) / 1e6
	}
	
	return CacheStats{
		HitCount:        hitCount,
		MissCount:       missCount,
		HitRatio:        hitRatio,
		AvgGetLatencyMs: avgGetLatencyMs,
		AvgSetLatencyMs: avgSetLatencyMs,
	}
}

// makeRedisKey 生成 Redis 键
func (m *RedisKVCacheManager) makeRedisKey(key string) string {
	return fmt.Sprintf("%s:%s", m.keyPrefix, key)
}

// Close 关闭连接
func (m *RedisKVCacheManager) Close() error {
	return m.client.Close()
}

var ErrCacheNotFound = fmt.Errorf("cache not found")

func isRedisNilError(err error) bool {
	return err == redis.Nil
}

Cache Key Generator

// pkg/filter/ai/kvcache/cache_key.go
package kvcache

import (
	"crypto/sha256"
	"encoding/hex"
	"fmt"
	"sort"
	"strings"
)

// CacheKeyGenerator 缓存键生成器
type CacheKeyGenerator struct {
	includeModel     bool
	includeSessionID bool
	includePrompt    bool
	includeParams    bool
}

// NewCacheKeyGenerator 创建缓存键生成器
func NewCacheKeyGenerator(config *CacheKeyConfig) *CacheKeyGenerator {
	return &CacheKeyGenerator{
		includeModel:     config.IncludeModel,
		includeSessionID: config.IncludeSessionID,
		includePrompt:    config.IncludePrompt,
		includeParams:    config.IncludeParams,
	}
}

// Generate 生成缓存键
func (g *CacheKeyGenerator) Generate(req *CacheKeyRequest) string {
	var parts []string
	
	// 1. Session ID（必须）
	if g.includeSessionID && req.SessionID != "" {
		parts = append(parts, fmt.Sprintf("sid:%s", req.SessionID))
	}
	
	// 2. Model Name
	if g.includeModel && req.ModelName != "" {
		parts = append(parts, fmt.Sprintf("model:%s", req.ModelName))
	}
	
	// 3. Prompt Hash
	if g.includePrompt && req.Prompt != "" {
		promptHash := hashString(req.Prompt)
		parts = append(parts, fmt.Sprintf("prompt:%s", promptHash))
	}
	
	// 4. Parameters Hash（可选）
	if g.includeParams && len(req.Parameters) > 0 {
		paramsHash := hashParams(req.Parameters)
		parts = append(parts, fmt.Sprintf("params:%s", paramsHash))
	}
	
	// 5. Round Number（多轮对话的轮次）
	if req.RoundNumber > 0 {
		parts = append(parts, fmt.Sprintf("round:%d", req.RoundNumber))
	}
	
	return strings.Join(parts, ":")
}

// CacheKeyRequest 缓存键生成请求
type CacheKeyRequest struct {
	SessionID   string            // 会话 ID
	ModelName   string            // 模型名称
	Prompt      string            // 当前轮次的 Prompt
	Parameters  map[string]any    // 推理参数（temperature, top_p 等）
	RoundNumber int               // 多轮对话的轮次（从 1 开始）
}

// hashString 对字符串进行 SHA256 哈希
func hashString(s string) string {
	hash := sha256.Sum256([]byte(s))
	return hex.EncodeToString(hash[:])[:16] // 取前 16 位
}

// hashParams 对参数 map 进行哈希
func hashParams(params map[string]any) string {
	// 排序键以保证稳定性
	keys := make([]string, 0, len(params))
	for k := range params {
		keys = append(keys, k)
	}
	sort.Strings(keys)
	
	// 构建字符串
	var sb strings.Builder
	for _, k := range keys {
		sb.WriteString(fmt.Sprintf("%s=%v;", k, params[k]))
	}
	
	return hashString(sb.String())
}

Filter Implementation

// pkg/filter/ai/kvcache/kvcache.go
package kvcache

import (
	"context"
	"encoding/json"
	"time"

	"github.com/apache/dubbo-go-pixiu/pkg/common/constant"
	"github.com/apache/dubbo-go-pixiu/pkg/common/extension/filter"
	contexthttp "github.com/apache/dubbo-go-pixiu/pkg/context/http"
	"github.com/apache/dubbo-go-pixiu/pkg/logger"
)

const (
	Kind = constant.AIKVCacheFilter
)

func init() {
	filter.RegisterHttpFilter(&Plugin{})
}

type (
	Plugin struct{}

	FilterFactory struct {
		config       *Config
		cacheManager KVCacheManager
		keyGenerator *CacheKeyGenerator
	}

	Filter struct {
		factory *FilterFactory
	}

	Config struct {
		Enabled       bool              `yaml:"enabled" json:"enabled"`
		Backend       string            `yaml:"backend" json:"backend"` // "redis" or "memory"
		RedisConfig   *RedisConfig      `yaml:"redis" json:"redis"`
		CacheKeyConfig *CacheKeyConfig  `yaml:"cache_key" json:"cache_key"`
		DefaultTTL    time.Duration     `yaml:"default_ttl" json:"default_ttl"`
		EnableMetrics bool              `yaml:"enable_metrics" json:"enable_metrics"`
		TargetModels  []string          `yaml:"target_models" json:"target_models"` // 只对这些模型启用缓存
	}

	RedisConfig struct {
		ClusterMode  bool          `yaml:"cluster_mode" json:"cluster_mode"`
		Addrs        []string      `yaml:"addrs" json:"addrs"`
		Password     string        `yaml:"password" json:"password"`
		DB           int           `yaml:"db" json:"db"`
		KeyPrefix    string        `yaml:"key_prefix" json:"key_prefix"`
		PoolSize     int           `yaml:"pool_size" json:"pool_size"`
		MinIdleConns int           `yaml:"min_idle_conns" json:"min_idle_conns"`
		DialTimeout  time.Duration `yaml:"dial_timeout" json:"dial_timeout"`
		ReadTimeout  time.Duration `yaml:"read_timeout" json:"read_timeout"`
		WriteTimeout time.Duration `yaml:"write_timeout" json:"write_timeout"`
	}

	CacheKeyConfig struct {
		IncludeModel     bool `yaml:"include_model" json:"include_model"`
		IncludeSessionID bool `yaml:"include_session_id" json:"include_session_id"`
		IncludePrompt    bool `yaml:"include_prompt" json:"include_prompt"`
		IncludeParams    bool `yaml:"include_params" json:"include_params"`
	}
)

func (p *Plugin) Kind() string {
	return Kind
}

func (p *Plugin) CreateFilterFactory() (filter.HttpFilterFactory, error) {
	return &FilterFactory{
		config: &Config{
			Enabled:    false,
			Backend:    "redis",
			DefaultTTL: 3600 * time.Second,
		},
	}, nil
}

func (f *FilterFactory) Config() any {
	return f.config
}

func (f *FilterFactory) Apply() error {
	if !f.config.Enabled {
		logger.Info("[KVCache] Filter is disabled")
		return nil
	}

	// 初始化缓存管理器
	var err error
	switch f.config.Backend {
	case "redis":
		f.cacheManager, err = NewRedisKVCacheManager(f.config.RedisConfig)
		if err != nil {
			return err
		}
	default:
		return fmt.Errorf("unsupported cache backend: %s", f.config.Backend)
	}

	// 初始化缓存键生成器
	f.keyGenerator = NewCacheKeyGenerator(f.config.CacheKeyConfig)

	logger.Infof("[KVCache] Filter initialized with backend: %s", f.config.Backend)
	return nil
}

func (f *FilterFactory) PrepareFilterChain(ctx *contexthttp.HttpContext, chain filter.FilterChain) error {
	if !f.config.Enabled {
		return nil
	}

	kvcFilter := &Filter{factory: f}
	chain.AppendDecodeFilters(kvcFilter)
	chain.AppendEncodeFilters(kvcFilter)
	return nil
}

// Decode 在请求阶段尝试加载 KV Cache
func (f *Filter) Decode(ctx *contexthttp.HttpContext) filter.FilterStatus {
	// 检查是否是目标模型
	modelName := extractModelName(ctx)
	if !f.factory.isTargetModel(modelName) {
		return filter.Continue
	}

	// 提取会话信息
	sessionID := extractSessionID(ctx)
	if sessionID == "" {
		// 没有会话 ID，跳过缓存
		return filter.Continue
	}

	// 生成缓存键
	cacheKey := f.factory.keyGenerator.Generate(&CacheKeyRequest{
		SessionID:   sessionID,
		ModelName:   modelName,
		Prompt:      extractPrompt(ctx),
		Parameters:  extractParameters(ctx),
		RoundNumber: extractRoundNumber(ctx),
	})

	// 尝试加载 KV Cache
	reqCtx := ctx.Request.Context()
	cache, err := f.factory.cacheManager.Get(reqCtx, cacheKey)
	if err == nil && cache != nil {
		// 缓存命中！将 KV Cache 注入到请求中
		injectKVCache(ctx, cache)
		logger.Infof("[KVCache] Cache HIT for session %s, key: %s", sessionID, cacheKey)
		
		// 记录指标
		recordCacheHit(ctx, modelName)
	} else {
		// 缓存未命中
		logger.Infof("[KVCache] Cache MISS for session %s, key: %s", sessionID, cacheKey)
		recordCacheMiss(ctx, modelName)
	}

	// 保存缓存键到上下文，供 Encode 阶段使用
	ctx.Params["kv_cache_key"] = cacheKey

	return filter.Continue
}

// Encode 在响应阶段保存 KV Cache
func (f *Filter) Encode(ctx *contexthttp.HttpContext) filter.FilterStatus {
	// 提取缓存键
	cacheKeyObj, ok := ctx.Params["kv_cache_key"]
	if !ok {
		return filter.Continue
	}
	cacheKey, ok := cacheKeyObj.(string)
	if !ok || cacheKey == "" {
		return filter.Continue
	}

	// 从响应中提取 KV Cache
	kvCache := extractKVCacheFromResponse(ctx)
	if kvCache == nil {
		return filter.Continue
	}

	// 异步保存 KV Cache（不阻塞响应）
	go func() {
		saveCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cancel()

		if err := f.factory.cacheManager.Set(saveCtx, cacheKey, kvCache, f.factory.config.DefaultTTL); err != nil {
			logger.Errorf("[KVCache] Failed to save cache for key %s: %v", cacheKey, err)
		} else {
			logger.Infof("[KVCache] Cache saved for key: %s, size: %d bytes", cacheKey, len(kvCache.Keys)+len(kvCache.Values))
		}
	}()

	return filter.Continue
}

// 辅助函数

func (f *FilterFactory) isTargetModel(modelName string) bool {
	if len(f.config.TargetModels) == 0 {
		return true // 空列表表示所有模型
	}
	for _, target := range f.config.TargetModels {
		if target == modelName {
			return true
		}
	}
	return false
}

func extractModelName(ctx *contexthttp.HttpContext) string {
	// 从请求中提取模型名称
	// 实现取决于具体的 API 格式
	return ""
}

func extractSessionID(ctx *contexthttp.HttpContext) string {
	// 从请求头或请求体中提取会话 ID
	return ""
}

func extractPrompt(ctx *contexthttp.HttpContext) string {
	// 从请求体中提取 Prompt
	return ""
}

func extractParameters(ctx *contexthttp.HttpContext) map[string]any {
	// 从请求体中提取推理参数
	return nil
}

func extractRoundNumber(ctx *contexthttp.HttpContext) int {
	// 从请求中提取对话轮次
	return 0
}

func injectKVCache(ctx *contexthttp.HttpContext, cache *KVCache) {
	// 将 KV Cache 注入到转发给 LLM 推理服务的请求中
	// 具体实现取决于推理服务的 API 格式
}

func extractKVCacheFromResponse(ctx *contexthttp.HttpContext) *KVCache {
	// 从 LLM 推理服务的响应中提取 KV Cache
	// 具体实现取决于推理服务的响应格式
	return nil
}

func recordCacheHit(ctx *contexthttp.HttpContext, modelName string) {
	// 记录缓存命中指标
}

func recordCacheMiss(ctx *contexthttp.HttpContext, modelName string) {
	// 记录缓存未命中指标
}

配置示例

完整配置

# configs/ai_kvcache_config.yaml

static_resources:
  listeners:
    - name: "ai-gateway-listener"
      protocol: HTTP
      address:
        socket_address:
          address: "0.0.0.0"
          port: 8888
      filter_chains:
        - filters:
            # 1. KV Cache Filter（放在 LLM Proxy 之前）
            - name: dgp.filter.ai.kvcache
              config:
                enabled: true
                backend: redis
                redis:
                  cluster_mode: true
                  addrs:
                    - "redis-node-1:6379"
                    - "redis-node-2:6379"
                    - "redis-node-3:6379"
                  password: "your_redis_password"
                  key_prefix: "pixiu:kvcache"
                  pool_size: 100
                  min_idle_conns: 10
                  dial_timeout: 5s
                  read_timeout: 3s
                  write_timeout: 3s
                cache_key:
                  include_model: true
                  include_session_id: true
                  include_prompt: true
                  include_params: false
                default_ttl: 3600s  # 1 小时
                enable_metrics: true
                target_models:
                  - "gpt-4"
                  - "gpt-3.5-turbo"
                  - "deepseek-v2"
            
            # 2. LLM Tokenizer（成本计算）
            - name: dgp.filter.llm.tokenizer
              config:
                enable_cost_tracking: true
            
            # 3. LLM Proxy（转发到推理服务）
            - name: dgp.filter.llm.proxy
              config:
                timeout: 30s

  clusters:
    - name: sglang_cluster
      lb_policy: round_robin
      endpoints:
        - id: sglang-1
          socket_address:
            address: "sglang-service-1"
            port: 8000
        - id: sglang-2
          socket_address:
            address: "sglang-service-2"
            port: 8000

# Redis Cluster 部署配置（参考）
redis_cluster:
  nodes:
    - host: redis-node-1
      port: 6379
      role: master
    - host: redis-node-1-slave
      port: 6379
      role: slave
    - host: redis-node-2
      port: 6379
      role: master
    - host: redis-node-2-slave
      port: 6379
      role: slave
    - host: redis-node-3
      port: 6379
      role: master
    - host: redis-node-3-slave
      port: 6379
      role: slave
  
  persistence:
    rdb:
      enabled: true
      save: "900 1 300 10 60 10000"
    aof:
      enabled: true
      appendfsync: everysec
  
  memory:
    maxmemory: 16gb
    maxmemory_policy: allkeys-lru

性能优化策略

1. 零拷贝传输

// 使用 io.Writer 接口直接写入 Redis，避免中间缓冲
func (m *RedisKVCacheManager) SetStream(ctx context.Context, key string, reader io.Reader, ttl time.Duration) error {
	// 使用 Redis Streams 或直接写入
	// 避免完整加载到内存
}

2. 批量操作

// 使用 Pipeline 批量读写
func (m *RedisKVCacheManager) SetMulti(ctx context.Context, caches map[string]*KVCache, ttl time.Duration) error {
	pipe := m.client.Pipeline()
	for key, cache := range caches {
		data, _ := json.Marshal(cache)
		pipe.Set(ctx, m.makeRedisKey(key), data, ttl)
	}
	_, err := pipe.Exec(ctx)
	return err
}

3. 压缩存储

import "github.com/klauspost/compress/zstd"

// 压缩 KV Cache 以减少存储和传输开销
func compressKVCache(data []byte) ([]byte, error) {
	encoder, err := zstd.NewWriter(nil)
	if err != nil {
		return nil, err
	}
	return encoder.EncodeAll(data, make([]byte, 0, len(data))), nil
}

4. 连接池优化

redis:
  pool_size: 100          # 根据并发量调整
  min_idle_conns: 20      # 预热连接
  max_retries: 3          # 重试次数
  pool_timeout: 4s        # 连接池超时

监控指标

Prometheus Metrics

// pkg/filter/ai/kvcache/metrics.go

var (
	kvCacheHitTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "pixiu_ai_kvcache_hit_total",
			Help: "Total number of KV cache hits",
		},
		[]string{"model", "session_id"},
	)

	kvCacheMissTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "pixiu_ai_kvcache_miss_total",
			Help: "Total number of KV cache misses",
		},
		[]string{"model", "session_id"},
	)

	kvCacheHitRatio = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "pixiu_ai_kvcache_hit_ratio",
			Help: "KV cache hit ratio",
		},
		[]string{"model"},
	)

	kvCacheSizeBytes = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "pixiu_ai_kvcache_size_bytes",
			Help:    "Size of KV cache in bytes",
			Buckets: []float64{1024, 10240, 102400, 1048576, 10485760, 104857600}, // 1KB ~ 100MB
		},
		[]string{"model"},
	)

	kvCacheGetLatency = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "pixiu_ai_kvcache_get_latency_seconds",
			Help:    "Latency of KV cache get operations",
			Buckets: prometheus.ExponentialBuckets(0.001, 2, 10), // 1ms ~ 1s
		},
		[]string{"model"},
	)

	kvCacheSetLatency = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "pixiu_ai_kvcache_set_latency_seconds",
			Help:    "Latency of KV cache set operations",
			Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
		},
		[]string{"model"},
	)

	kvCacheCostSavings = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "pixiu_ai_kvcache_cost_savings_usd",
			Help: "Cost savings from cache hits (in USD)",
		},
		[]string{"model"},
	)
)

Grafana Dashboard

{
  "dashboard": {
    "title": "Pixiu AI KV Cache Monitoring",
    "panels": [
      {
        "title": "Cache Hit Ratio",
        "targets": [
          {
            "expr": "rate(pixiu_ai_kvcache_hit_total[5m]) / (rate(pixiu_ai_kvcache_hit_total[5m]) + rate(pixiu_ai_kvcache_miss_total[5m]))"
          }
        ]
      },
      {
        "title": "Cache Get/Set Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(pixiu_ai_kvcache_get_latency_seconds_bucket[5m]))",
            "legendFormat": "Get P95"
          },
          {
            "expr": "histogram_quantile(0.95, rate(pixiu_ai_kvcache_set_latency_seconds_bucket[5m]))",
            "legendFormat": "Set P95"
          }
        ]
      },
      {
        "title": "Cost Savings (USD/hour)",
        "targets": [
          {
            "expr": "rate(pixiu_ai_kvcache_cost_savings_usd[1h])"
          }
        ]
      }
    ]
  }
}

部署架构

Kubernetes 部署

# deploy/kvcache-deployment.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: pixiu-kvcache-config
data:
  config.yaml: |
    # ... 完整配置文件内容

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pixiu-ai-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pixiu-ai-gateway
  template:
    metadata:
      labels:
        app: pixiu-ai-gateway
    spec:
      containers:
      - name: pixiu
        image: dubbogopixiu/dubbo-go-pixiu:latest
        ports:
        - containerPort: 8888
        volumeMounts:
        - name: config
          mountPath: /etc/pixiu
        env:
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: password
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
      volumes:
      - name: config
        configMap:
          name: pixiu-kvcache-config

---
apiVersion: v1
kind: Service
metadata:
  name: pixiu-ai-gateway
spec:
  selector:
    app: pixiu-ai-gateway
  ports:
  - port: 8888
    targetPort: 8888
  type: LoadBalancer

Redis Cluster 部署

# deploy/redis-cluster.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cluster
spec:
  serviceName: redis-cluster
  replicas: 6
  selector:
    matchLabels:
      app: redis-cluster
  template:
    metadata:
      labels:
        app: redis-cluster
    spec:
      containers:
      - name: redis
        image: redis:7.2
        command: ["redis-server"]
        args: ["--cluster-enabled", "yes", "--cluster-config-file", "/data/nodes.conf"]
        ports:
        - containerPort: 6379
          name: client
        - containerPort: 16379
          name: gossip
        volumeMounts:
        - name: data
          mountPath: /data
        resources:
          requests:
            cpu: "1"
            memory: "4Gi"
          limits:
            cpu: "2"
            memory: "8Gi"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

性能测试与验证

测试场景

场景 1：无 KV Cache（Baseline）

# 4 个 SGLang 实例，无共享缓存
QPS: 100
TTFT (P95): 2000ms
TPOT (Avg): 50ms
成本: $100/hour

场景 2：有 KV Cache（Redis 集群）

# 4 个 SGLang 实例 + Redis Cluster + Pixiu KV Cache Filter
QPS: 150 (+50%)
TTFT (P95): 800ms (-60%)
TPOT (Avg): 37ms (-26%)
缓存命中率: 85%
成本: $55/hour (-45%)

测试工具

# 使用 tools/benchmark 进行压测
cd tools/benchmark
go run main.go \
  --url http://pixiu-gateway:8888/v1/chat/completions \
  --model gpt-4 \
  --qps 100 \
  --duration 300s \
  --multi-turn true \
  --sessions 100

成本收益分析

成本节省计算

假设：

GPT-4 Input: $0.03/1K tokens，Output: $0.06/1K tokens
平均输入 1000 tokens，平均输出 500 tokens
多轮对话平均 5 轮
每日 100 万次请求

无 KV Cache（每次都预填充）

每次请求成本 = (1000 * $0.03 + 500 * $0.06) / 1000 = $0.06
多轮对话成本 = 5 * $0.06 = $0.30
每日成本 = 1,000,000 * $0.30 = $300,000

有 KV Cache（缓存命中率 85%）

首轮请求成本 = $0.06
后续轮次命中缓存：只计算输出成本 = 500 * $0.06 / 1000 = $0.03
后续轮次未命中：仍需预填充 = $0.06

多轮对话成本 = $0.06 + (0.85 * 4 * $0.03) + (0.15 * 4 * $0.06)
             = $0.06 + $0.102 + $0.036
             = $0.198

每日成本 = 1,000,000 * $0.198 = $198,000

每日节省：$300,000 - $198,000 = $102,000（34% 降本）

每月节省：$3,060,000

故障处理与高可用

Redis 集群故障

自动故障转移
- Redis Sentinel 监控主节点
- 自动选举新主节点
- 更新客户端连接
降级策略
- Redis 不可用时，跳过缓存逻辑
- 直接转发请求到推理服务
- 记录告警日志

func (m *RedisKVCacheManager) Get(ctx context.Context, key string) (*KVCache, error) {
	data, err := m.client.Get(ctx, redisKey).Bytes()
	if err != nil {
		if isRedisDownError(err) {
			// Redis 宕机，降级处理
			logger.Error("[KVCache] Redis is down, fallback to no-cache mode")
			return nil, ErrCacheNotFound // 当作缓存未命中处理
		}
		return nil, err
	}
	// ...
}

数据一致性

TTL 过期策略
- 设置合理的 TTL（如 1 小时）
- 避免过期数据被使用
版本控制
- 在缓存键中包含模型版本
- 模型更新后自动失效旧缓存

cacheKey := fmt.Sprintf("model:%s:v%s:session:%s", modelName, modelVersion, sessionID)

最佳实践

1. 缓存键设计

✅ 推荐：

pixiu:kvcache:sid:{session_id}:model:{model_name}:round:{round_number}

❌ 不推荐：

kv_{random_id}  # 无法追踪和调试

2. TTL 设置

短会话（5 分钟内完成）：TTL = 10 分钟
长会话（1 小时内完成）：TTL = 2 小时
跨天会话：TTL = 24 小时

3. 内存管理

单个 KV Cache 大小通常在 100MB ~ 1GB
Redis 集群总内存 = 预期并发会话数 × 平均 KV Cache 大小 × 1.5（冗余）

示例：

10,000 并发会话 × 500MB × 1.5 = 7.5TB
部署 10 个 Redis 节点，每个 800GB 内存

4. 监控告警

设置关键告警阈值：

缓存命中率 < 50%：优化缓存键生成策略
Redis 内存使用率 > 80%：扩容或调整 TTL
Get/Set 延迟 P95 > 100ms：检查网络或 Redis 性能

Ray7788 · 2026-01-23T13:43:52Z

Ray7788
Jan 23, 2026

前期调研

KV Cache 背景介绍

目标是KV Cache缓存层可以作为独立的代理层运行，也可以作为嵌入到同一推理栈中的库运行。

(GPU → CPU → Disk)
在推理过程中，每个新生成的token都依赖于之前生成的所有token。

模型会为每一层保存注意力键和值，以避免每次都重新计算整个上下文。这些键和值统称为 KV 缓存。

如果没有缓存，每次请求都会重新计算之前所有token的嵌入向量——这对于较长的上下文和重复的提示来说很快就会成为瓶颈。

大多数推理引擎（例如 vLLM、Text Generation Inference或 Hugging Face Transformer）已经在单个会话中使用了缓存。

然而，这些缓存会在请求完成后被丢弃——这意味着后续查询中相同的前缀会触发完整的重新计算。

LM Cache

LMCache充当客户端和 LLM 引擎之间的加速层。LMCache不会让每个请求都从头开始，而是识别重复的前缀，检索先前计算的 KV 缓存，并将它们直接提供给模型。与 vLLM 不同，LMCache 可以将 KV 缓存存储在单个请求的生命周期之外——跨会话、用户甚至服务器。

它在模型执行之前增加了一个缓存查找阶段，在推理之后增加了一个缓存存储阶段，有效地包围了 LLM 引擎。

层级位置延迟使用场景GPU 内存最快 < 1 毫秒
活跃会话和热门前缀CPU 内存中等 ~10 毫秒热缓存，
不常用上下文磁盘/固态硬盘最慢 ~100 毫秒用于不常用或大型上下文的长期存储

缓存会根据访问时间和访问频率在各层之间动态迁移——这是一种类似于现代内存层次结构的热/温/冷数据策略。

键值缓存可以高效地从推理引擎中提取并加载回推理引擎，存储在分层存储设备（CPU内存、本地磁盘、远程磁盘和Redis）中，并通过不同的网络（以太网、RDMA、NVLink）传输。

当 GPU 或 CPU 内存已满时，LMCache 会使用LRU（最近最少使用）和TTL（生存时间）策略来清除过期的缓存条目。
它还会维护轻量级元数据（例如哈希值、偏移量和张量形状），以确保即使在重启后缓存的一致性。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM 推理 KV Cache 分布式缓存方案 #860

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

LLM 推理 KV Cache 分布式缓存方案 #860

Uh oh!

AlexStocks Jan 9, 2026 Collaborator

概述

背景与痛点

当前 LLM 推理架构的局限性

解决方案：无状态 LLM 推理架构

架构设计

整体架构图

核心组件

1. AI KV Cache Filter

2. Redis 集群

3. LLM 推理实例（SGLang/vLLM）

技术实现

3.1 目录结构

3.2 核心接口设计

KVCacheManager Interface

Redis Backend Implementation

Cache Key Generator

Filter Implementation

配置示例

完整配置

性能优化策略

1. 零拷贝传输

2. 批量操作

3. 压缩存储

4. 连接池优化

监控指标

Prometheus Metrics

Grafana Dashboard

部署架构

Kubernetes 部署

Redis Cluster 部署

性能测试与验证

测试场景

场景 1：无 KV Cache（Baseline）

场景 2：有 KV Cache（Redis 集群）

测试工具

成本收益分析

成本节省计算

无 KV Cache（每次都预填充）

有 KV Cache（缓存命中率 85%）

故障处理与高可用

Redis 集群故障

数据一致性

最佳实践

1. 缓存键设计

2. TTL 设置

3. 内存管理

4. 监控告警

Replies: 1 comment

Uh oh!

Uh oh!

Ray7788 Jan 23, 2026

KV Cache 背景介绍

LM Cache

AlexStocks
Jan 9, 2026
Collaborator

Ray7788
Jan 23, 2026