Zhenhao Yang1 Xiaoshi Wu2 Zhengyao Lv1 Xiaoyu Shi2,β Xintao Wang2 Pengfei Wan2 Kun Gai2 Kwan-Yee K. Wong1,β
1The University of Hong Kong Β Β
2Kling Team, Kuaishou Technology
β Corresponding author
- π Table of Contents
- π₯ Updates
- π· Introduction
- βοΈ Code: DecMem + Wan2.1-T2V-1.3B
- π€ Acknowledgement
- π Citation
Note: This open-source repository is a reference implementation built on Wan2.1-T2V-1.3B. The original model is trained on internal pretrained model so we will not opensource such codebase.
- [2026.06.01]: Release the Training and Inference Code and the Checkpoints.
- [2026.06.01]: Release the Project Page and the Arxiv version.
We systematically reveal the root cause of the limited long-horizon world generation capability of naΓ―ve dense-attention designs and propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation.
(a) Sparse Global Memory (SGM): Combines a block-level sparse retrieval module and a context-aware attention module for long-term memory fine-grained retrieval in an end-to-end manner, enabling efficient access to global history with bounded cost.
(b) Anchored Local Memory (ALM): Keeps short-term transition smooth and supports stable, high-quality extrapolation by anchoring generation on recent local context.
(c) Capability: DecMem enables minute-level controllable long video generation with high fidelity, consistency and efficiency.
We tested this repo on the following setup:
- GPU: NVIDIA GPU with at least 24 GB VRAM.
- LTM (long-term memory) via Video Sparse Attention requires SM 90a (H100 / H200 / H800). The block-sparse CUDA kernel is built for
sm_90aand is not supported on A100 or consumer GPUs. - Dense-only training / inference (
use_ltm_attn: false) can run on A100 and other CUDA GPUs, but the full DecMem model with LTM needs the hardware above.
- LTM (long-term memory) via Video Sparse Attention requires SM 90a (H100 / H200 / H800). The block-sparse CUDA kernel is built for
- RAM: 64 GB recommended.
- CUDA: 12.x toolkit matching the PyTorch build.
- Compiler: gcc / g++ β₯ 11 (C++20 required by the VSA kernel).
Create a conda environment and install dependencies:
git clone https://github.com/KlingAIResearch/DecMem.git
cd DecMem
conda create -n decmem python=3.10 -y
conda activate decmem
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install flash-attn --no-build-isolationOur LTM module is implemented with modified VSA kernel so it is required when training or inferring with use_ltm_attn: true (e.g. configs/decmem.yaml). Install from source:
cd video_sparse_attn
python setup.py installVerify:
python -c "
import torch, vsa
from vsa.block_sparse_wrapper import block_sparse_attn_SM90
print('device cap:', torch.cuda.get_device_capability(0))
print('SM90 op:', block_sparse_attn_SM90)
"Download the Wan2.1 backbone (VAE + tokenizer weights used by the pipeline):
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
--local-dir-use-symlinks False \
--local-dir wan_models/Wan2.1-T2V-1.3BDownload DecMem trained checkpoints from HuggingFace:
huggingface-cli download KlingTeam/DecMem --local-dir checkpointsCheckpoint layout expected by training / inference scripts:
checkpoints/
βββ decmem.pt # released weights
We provide the example video-pose pairs for quick inference. The inference is Block-by-block causal denoising manner with KV cache.
bash scripts/infer_example.shDecMem trains on WorldMem gameplay data: paired MP4 videos and NPZ action/pose files indexed by a CSV.
Download the Minecraft dataset from Hugging Face (see the WorldMem dataset section).
Place the dataset in the following directory structure:
data/
βββ training/
βββ validation/
βββ test/
Each clip is an MP4 file with a matching NPZ file (same basename) in the same directory.
After download, run the helper script to scan MP4/NPZ pairs and generate index files:
python scripts/build_worldmem_csv.pyThis writes train.csv (from training/) and test.csv (from test/) under data/worldmem_minecraft_dataset/. Rows are shuffled with seed 42 by default; use --seed to change it.
After download and indexing:
data/
βββ train.csv # training index (generated)
βββ test.csv # evaluation index (generated)
βββ training/
β βββ *.mp4 + *.npz # paired clips
βββ validation/
β βββ *.mp4 + *.npz
βββ test/
βββ *.mp4 + *.npz
Each row has two columns (utils/dataset.py β WorldMemCSVDataset):
| Column | Description |
|---|---|
video_path |
Path to an MP4 video clip. |
action_path |
Path to the matching NPZ file. |
Example (data/minecraft/train.csv):
video_path,action_path
data/training/000001.mp4,data/training/000001.npz
data/training/000002.mp4,data/training/000002.npzNote:
- Update
MASTER_ADDR(andNNODESfor multi-node) in the launch scripts before distributed training.- Dense training (
use_ltm_attn: false) can run on A100; LTM training requires H100/H800/H200 with the block-sparse kernel installed.- Set
WANDB_KEY/wandb_entityin the config or disable wandb with--disable-wandb.- We train DecMem in three stages in this opensource version and load the checkpoints in the previous stage for resume training.
# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 dense_short
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 dense_short
# On node 2:...Fine-tune from a short-sequence training dense checkpoint ; set generator_ckpt in the config:
# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 dense
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 dense
# On node 2:...Fine-tune from a long-sequence training dense checkpoint; set generator_ckpt in the config:
# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 decmem
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 decmem
# On node 2:...You can set RESUME_CKPT in training scripts to resume from a full training checkpoint (weights + optimizer + step) .
Once you finish the training in each stage, you can load the correponding checkpoints and configs for inference.
Common overrides:
EXP_NAME=decmem \
CHECKPOINT_PATH=checkpoints/decmem.pt \
T_LAT=160 \
NUM_CONDITION_FRAMES=56 \
GUIDANCE_SCALE=5.0 \
NUM_INFERENCE_STEPS=50 \
bash scripts/infer.sh
# Multi-GPU inference
NUM_GPUS=8 bash scripts/infer.sh| Argument | Description |
|---|---|
NUM_CONDITION_FRAMES |
Number of clean prefix latent frames (conditioning video prefix). |
T_LAT |
Number of clean prefix latent frames (conditioning video prefix). |
GUIDANCE_SCALE |
CFG scale on actions;β€ 1.0 disables CFG. |
N |
Number of inference samples |
NUM_INFERENCE_STEPS |
Number of denoising step |
This codebase is built on top of:
- WorldMem for action-pose data format and pose utilities.
- PRoPE for camera-relative positional encoding.
- Video Sparse Attention for the LTM sparse attention kernel.
- Self Forcing and CausVid for causal diffusion training and teacher forcing codebase.
- Wan2.1 as the base video diffusion backbone.
Please leave us a star π and cite our paper if you find our work helpful.
@misc{yang2026decmemminutelongconsistentworld,
title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory},
author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong},
year={2026},
eprint={2605.31336},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.31336},
}