Skip to content

KlingAIResearch/DecMem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Zhenhao Yang1 Xiaoshi Wu2 Zhengyao Lv1 Xiaoyu Shi2,† Xintao Wang2 Pengfei Wan2 Kun Gai2 Kwan-Yee K. Wong1,†

1The University of Hong Kong Β Β  2Kling Team, Kuaishou Technology
†Corresponding author

Β  Β  Β 

πŸ“‹ Table of Contents

Note: This open-source repository is a reference implementation built on Wan2.1-T2V-1.3B. The original model is trained on internal pretrained model so we will not opensource such codebase.

πŸ”₯ Updates

πŸ“· Introduction

We systematically reveal the root cause of the limited long-horizon world generation capability of naΓ―ve dense-attention designs and propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation.

(a) Sparse Global Memory (SGM): Combines a block-level sparse retrieval module and a context-aware attention module for long-term memory fine-grained retrieval in an end-to-end manner, enabling efficient access to global history with bounded cost.

(b) Anchored Local Memory (ALM): Keeps short-term transition smooth and supports stable, high-quality extrapolation by anchoring generation on recent local context.

(c) Capability: DecMem enables minute-level controllable long video generation with high fidelity, consistency and efficiency.

Watch the video

βš™οΈ Code: DecMem + Wan2.1-T2V-1.3B

Requirements

We tested this repo on the following setup:

  • GPU: NVIDIA GPU with at least 24 GB VRAM.
    • LTM (long-term memory) via Video Sparse Attention requires SM 90a (H100 / H200 / H800). The block-sparse CUDA kernel is built for sm_90a and is not supported on A100 or consumer GPUs.
    • Dense-only training / inference (use_ltm_attn: false) can run on A100 and other CUDA GPUs, but the full DecMem model with LTM needs the hardware above.
  • RAM: 64 GB recommended.
  • CUDA: 12.x toolkit matching the PyTorch build.
  • Compiler: gcc / g++ β‰₯ 11 (C++20 required by the VSA kernel).

Installation

1. Core environment

Create a conda environment and install dependencies:

git clone https://github.com/KlingAIResearch/DecMem.git
cd DecMem
conda create -n decmem python=3.10 -y
conda activate decmem
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Video Sparse Attention (VSA) kernel

Our LTM module is implemented with modified VSA kernel so it is required when training or inferring with use_ltm_attn: true (e.g. configs/decmem.yaml). Install from source:

cd video_sparse_attn
python setup.py install

Verify:

python -c "
import torch, vsa
from vsa.block_sparse_wrapper import block_sparse_attn_SM90
print('device cap:', torch.cuda.get_device_capability(0))
print('SM90 op:', block_sparse_attn_SM90)
"

Checkpoints

Download the Wan2.1 backbone (VAE + tokenizer weights used by the pipeline):

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
    --local-dir-use-symlinks False \
    --local-dir wan_models/Wan2.1-T2V-1.3B

Download DecMem trained checkpoints from HuggingFace:

huggingface-cli download KlingTeam/DecMem --local-dir checkpoints

Checkpoint layout expected by training / inference scripts:

checkpoints/
└── decmem.pt             # released weights

Quick start

We provide the example video-pose pairs for quick inference. The inference is Block-by-block causal denoising manner with KV cache.

bash scripts/infer_example.sh

Data

DecMem trains on WorldMem gameplay data: paired MP4 videos and NPZ action/pose files indexed by a CSV.

Download dataset

Download the Minecraft dataset from Hugging Face (see the WorldMem dataset section).

Place the dataset in the following directory structure:

data/
β”œβ”€β”€ training/
β”œβ”€β”€ validation/
└── test/

Each clip is an MP4 file with a matching NPZ file (same basename) in the same directory.

Build metadata CSV

After download, run the helper script to scan MP4/NPZ pairs and generate index files:

python scripts/build_worldmem_csv.py

This writes train.csv (from training/) and test.csv (from test/) under data/worldmem_minecraft_dataset/. Rows are shuffled with seed 42 by default; use --seed to change it.

Directory layout

After download and indexing:

data/
β”œβ”€β”€ train.csv          # training index (generated)
β”œβ”€β”€ test.csv           # evaluation index (generated)
β”œβ”€β”€ training/
β”‚   └── *.mp4 + *.npz  # paired clips
β”œβ”€β”€ validation/
β”‚   └── *.mp4 + *.npz
└── test/
    └── *.mp4 + *.npz

CSV format

Each row has two columns (utils/dataset.py β†’ WorldMemCSVDataset):

Column Description
video_path Path to an MP4 video clip.
action_path Path to the matching NPZ file.

Example (data/minecraft/train.csv):

video_path,action_path
data/training/000001.mp4,data/training/000001.npz
data/training/000002.mp4,data/training/000002.npz

Train

Note:

  1. Update MASTER_ADDR (and NNODES for multi-node) in the launch scripts before distributed training.
  2. Dense training (use_ltm_attn: false) can run on A100; LTM training requires H100/H800/H200 with the block-sparse kernel installed.
  3. Set WANDB_KEY / wandb_entity in the config or disable wandb with --disable-wandb.
  4. We train DecMem in three stages in this opensource version and load the checkpoints in the previous stage for resume training.

Stage 1: Dense causal diffusion (Short)

# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 dense_short
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 dense_short
# On node 2:...

Stage 2: Dense causal diffusion (Long)

Fine-tune from a short-sequence training dense checkpoint ; set generator_ckpt in the config:

# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 dense
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 dense
# On node 2:...

Stage 3: Decoupled memory (STM + LTM)

Fine-tune from a long-sequence training dense checkpoint; set generator_ckpt in the config:

# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 decmem
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 decmem
# On node 2:...

You can set RESUME_CKPT in training scripts to resume from a full training checkpoint (weights + optimizer + step) .

Inference

Once you finish the training in each stage, you can load the correponding checkpoints and configs for inference.

Common overrides:

EXP_NAME=decmem \
CHECKPOINT_PATH=checkpoints/decmem.pt \
T_LAT=160 \
NUM_CONDITION_FRAMES=56 \
GUIDANCE_SCALE=5.0 \
NUM_INFERENCE_STEPS=50 \
bash scripts/infer.sh

# Multi-GPU inference
NUM_GPUS=8 bash scripts/infer.sh
Argument Description
NUM_CONDITION_FRAMES Number of clean prefix latent frames (conditioning video prefix).
T_LAT Number of clean prefix latent frames (conditioning video prefix).
GUIDANCE_SCALE CFG scale on actions;≀ 1.0 disables CFG.
N Number of inference samples
NUM_INFERENCE_STEPS Number of denoising step

πŸ€— Acknowledgement

This codebase is built on top of:

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@misc{yang2026decmemminutelongconsistentworld,
      title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory}, 
      author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong},
      year={2026},
      eprint={2605.31336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.31336}, 
}

About

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages