DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Zhenhao Yang¹ Xiaoshi Wu² Zhengyao Lv¹ Xiaoyu Shi^2,† Xintao Wang² Pengfei Wan² Kun Gai² Kwan-Yee K. Wong^1,†

¹The University of Hong Kong ²Kling Team, Kuaishou Technology
^†Corresponding author

📋 Table of Contents

📋 Table of Contents
🔥 Updates
📷 Introduction
⚙️ Code: DecMem + Wan2.1-T2V-1.3B
🤗 Acknowledgement
🌟 Citation

Note: This open-source repository is a reference implementation built on Wan2.1-T2V-1.3B. The original model is trained on internal pretrained model so we will not opensource such codebase.

🔥 Updates

[2026.06.01]: Release the Training and Inference Code and the Checkpoints.
[2026.06.01]: Release the Project Page and the Arxiv version.

📷 Introduction

We systematically reveal the root cause of the limited long-horizon world generation capability of naïve dense-attention designs and propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation.

(a) Sparse Global Memory (SGM): Combines a block-level sparse retrieval module and a context-aware attention module for long-term memory fine-grained retrieval in an end-to-end manner, enabling efficient access to global history with bounded cost.

(b) Anchored Local Memory (ALM): Keeps short-term transition smooth and supports stable, high-quality extrapolation by anchoring generation on recent local context.

(c) Capability: DecMem enables minute-level controllable long video generation with high fidelity, consistency and efficiency.

⚙️ Code: DecMem + Wan2.1-T2V-1.3B

Requirements

We tested this repo on the following setup:

GPU: NVIDIA GPU with at least 24 GB VRAM.
- LTM (long-term memory) via Video Sparse Attention requires SM 90a (H100 / H200 / H800). The block-sparse CUDA kernel is built for sm_90a and is not supported on A100 or consumer GPUs.
- Dense-only training / inference (use_ltm_attn: false) can run on A100 and other CUDA GPUs, but the full DecMem model with LTM needs the hardware above.
RAM: 64 GB recommended.
CUDA: 12.x toolkit matching the PyTorch build.
Compiler: gcc / g++ ≥ 11 (C++20 required by the VSA kernel).

Installation

1. Core environment

Create a conda environment and install dependencies:

git clone https://github.com/KlingAIResearch/DecMem.git
cd DecMem
conda create -n decmem python=3.10 -y
conda activate decmem
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Video Sparse Attention (VSA) kernel

Our LTM module is implemented with modified VSA kernel so it is required when training or inferring with use_ltm_attn: true (e.g. configs/decmem.yaml). Install from source:

cd video_sparse_attn
python setup.py install

Verify:

python -c "
import torch, vsa
from vsa.block_sparse_wrapper import block_sparse_attn_SM90
print('device cap:', torch.cuda.get_device_capability(0))
print('SM90 op:', block_sparse_attn_SM90)
"

Checkpoints

Download the Wan2.1 backbone (VAE + tokenizer weights used by the pipeline):

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
    --local-dir-use-symlinks False \
    --local-dir wan_models/Wan2.1-T2V-1.3B

Download DecMem trained checkpoints from HuggingFace:

huggingface-cli download KlingTeam/DecMem --local-dir checkpoints

Checkpoint layout expected by training / inference scripts:

checkpoints/
└── decmem.pt             # released weights

Quick start

We provide the example video-pose pairs for quick inference. The inference is Block-by-block causal denoising manner with KV cache.

bash scripts/infer_example.sh

Data

DecMem trains on WorldMem gameplay data: paired MP4 videos and NPZ action/pose files indexed by a CSV.

Download dataset

Download the Minecraft dataset from Hugging Face (see the WorldMem dataset section).

Place the dataset in the following directory structure:

data/
├── training/
├── validation/
└── test/

Each clip is an MP4 file with a matching NPZ file (same basename) in the same directory.

Build metadata CSV

After download, run the helper script to scan MP4/NPZ pairs and generate index files:

python scripts/build_worldmem_csv.py

This writes train.csv (from training/) and test.csv (from test/) under data/worldmem_minecraft_dataset/. Rows are shuffled with seed 42 by default; use --seed to change it.

Directory layout

After download and indexing:

data/
├── train.csv          # training index (generated)
├── test.csv           # evaluation index (generated)
├── training/
│   └── *.mp4 + *.npz  # paired clips
├── validation/
│   └── *.mp4 + *.npz
└── test/
    └── *.mp4 + *.npz

CSV format

Each row has two columns (utils/dataset.py → WorldMemCSVDataset):

Column	Description
`video_path`	Path to an MP4 video clip.
`action_path`	Path to the matching NPZ file.

Example (data/minecraft/train.csv):

video_path,action_path
data/training/000001.mp4,data/training/000001.npz
data/training/000002.mp4,data/training/000002.npz

Train

Note:

Update MASTER_ADDR (and NNODES for multi-node) in the launch scripts before distributed training.

Dense training (use_ltm_attn: false) can run on A100; LTM training requires H100/H800/H200 with the block-sparse kernel installed.

Set WANDB_KEY / wandb_entity in the config or disable wandb with --disable-wandb.

We train DecMem in three stages in this opensource version and load the checkpoints in the previous stage for resume training.

Stage 1: Dense causal diffusion (Short)

# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 dense_short
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 dense_short
# On node 2:...

Stage 2: Dense causal diffusion (Long)

Fine-tune from a short-sequence training dense checkpoint ; set generator_ckpt in the config:

# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 dense
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 dense
# On node 2:...

Stage 3: Decoupled memory (STM + LTM)

Fine-tune from a long-sequence training dense checkpoint; set generator_ckpt in the config:

# Edit NNODES, MASTER_ADDR, NPROC_PER_NODE in the script first.
# On node 0:
bash scripts/train_worldmem_multinode_manual.sh 0 decmem
# On node 1:
bash scripts/train_worldmem_multinode_manual.sh 1 decmem
# On node 2:...

You can set RESUME_CKPT in training scripts to resume from a full training checkpoint (weights + optimizer + step) .

Inference

Once you finish the training in each stage, you can load the correponding checkpoints and configs for inference.

Common overrides:

EXP_NAME=decmem \
CHECKPOINT_PATH=checkpoints/decmem.pt \
T_LAT=160 \
NUM_CONDITION_FRAMES=56 \
GUIDANCE_SCALE=5.0 \
NUM_INFERENCE_STEPS=50 \
bash scripts/infer.sh

# Multi-GPU inference
NUM_GPUS=8 bash scripts/infer.sh

Argument	Description
`NUM_CONDITION_FRAMES`	Number of clean prefix latent frames (conditioning video prefix).
`T_LAT`	Number of clean prefix latent frames (conditioning video prefix).
`GUIDANCE_SCALE`	CFG scale on actions;`≤ 1.0` disables CFG.
`N`	Number of inference samples
`NUM_INFERENCE_STEPS`	Number of denoising step

🤗 Acknowledgement

This codebase is built on top of:

WorldMem for action-pose data format and pose utilities.
PRoPE for camera-relative positional encoding.
Video Sparse Attention for the LTM sparse attention kernel.
Self Forcing and CausVid for causal diffusion training and teacher forcing codebase.
Wan2.1 as the base video diffusion backbone.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@misc{yang2026decmemminutelongconsistentworld,
      title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory}, 
      author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong},
      year={2026},
      eprint={2605.31336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.31336}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
demo_utils		demo_utils
example		example
model		model
pipeline		pipeline
scripts		scripts
trainer		trainer
utils		utils
video_sparse_attn		video_sparse_attn
wan		wan
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

📋 Table of Contents

🔥 Updates

📷 Introduction

⚙️ Code: DecMem + Wan2.1-T2V-1.3B

Requirements

Installation

1. Core environment

2. Video Sparse Attention (VSA) kernel

Checkpoints

Quick start

Data

Download dataset

Build metadata CSV

Directory layout

CSV format

Train

Stage 1: Dense causal diffusion (Short)

Stage 2: Dense causal diffusion (Long)

Stage 3: Decoupled memory (STM + LTM)

Inference

🤗 Acknowledgement

🌟 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

📋 Table of Contents

🔥 Updates

📷 Introduction

⚙️ Code: DecMem + Wan2.1-T2V-1.3B

Requirements

Installation

1. Core environment

2. Video Sparse Attention (VSA) kernel

Checkpoints

Quick start

Data

Download dataset

Build metadata CSV

Directory layout

CSV format

Train

Stage 1: Dense causal diffusion (Short)

Stage 2: Dense causal diffusion (Long)

Stage 3: Decoupled memory (STM + LTM)

Inference

🤗 Acknowledgement

🌟 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages