Skip to content

KlingAIResearch/UniVideo

Repository files navigation

UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei*,1,2Quande Liu†,2Zixuan Ye2Qiulin Wang2Xintao Wang2

Pengfei Wan2Kun Gai2Wenhu Chen†,1

1University of Waterloo    2Kling Team, Kuaishou Technology
*Work done during an internship at Kling Team, Kuaishou Technology Corresponding author

   

🚀 Supported Tasks

Univideo is flexible in its input and output configurations, supporting a wide range of multimodal tasks:

Task Input Type Output Task ID Description Demo Input Demo Output
Image/Video Understanding Image🖼️ / Video🎬 + Text📝 Text📝 understanding Multimodal analysis and captioning.
Text
Text-to-Image Text📝 Image🖼️ t2i Generating images from text prompts. Prompt
Text-to-Video Text📝 Video🎬 t2v Generating videos from text prompts. Prompt
Image-to-Video Image🖼️ + Text📝 Video🎬 i2v Animating a static image into a video.

Image Editing Image🖼️ + Text📝 Image🖼️ i2i_edit Instruction-based image editing.

In-context Image Editing Image🖼️ + Image🖼️ + Text📝 Image🖼️ i+i2i_edit Editing an image based on a reference image.
In-context Generation Image🖼️ × N + Text📝 Image🖼️ / Video🎬 multiid Multi-subject generation.
Video Editing Video🎬 + Text📝 Video🎬 v2v_edit Instruction-based video manipulation and stylization.

In-context Video Editing Image🖼️ + Video🎬 + Text📝 Video🎬 i+v2v_edit Reference-based manipulation: addition, deletion, swapping, and stylization.

🔔News

  • [2026-06-03]: The training script and instructions are now available in TRAINING.md.
  • [2026-01-30]: UniVideo was accepted at ICLR 2026 🎉
  • [2026-01-07]: Released Code and Model.
  • [2025-10-09]: Released Arxiv Preprint and the Project Page

How to use

1. Installation

conda env create -f environment.yml
conda activate univideo

This environment is tested with:

  • Python 3.11
  • PyTorch 2.4.1 + CUDA 12.1
  • diffusers 0.34.0
  • transformers 4.51.3

Try this command if the conda create from yaml doesn't work

conda create -n univideo python=3.11 -y
conda activate univideo
conda install pytorch==2.4.1 torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt

2. Download Checkpoint

Download the Univideo checkpoint to a local path for example ckpts/:

python download_ckpt.py --variant hidden

We provide two UniVideo checkpoint variants as described in Arxiv Preprint Section 3.2:

  • Variant 1 (img, video, txt -> mllm -> last layer hidden -> mmdit)
    Image, video, and text inputs are processed by the MLLM, and the final hidden states are fed into the MMDiT backbone.

  • Variant 2 (img, video, txt, queries -> mllm -> txt + queries last layer hidden -> mmdit)
    Image, video, text, and queries are processed by the MLLM. The final hidden states of text and queries are used as inputs to MMDiT.

Download the queries-based checkpoint with:

python download_ckpt.py --variant queries

Or download both variants without deleting either local directory:

python download_ckpt.py --variant all

3. Inference

We provide demo inference scripts to demonstrate how to load and run the UniVideo pipeline by setting up pipeline_kwargs on different inputs. Feel free to adapt these to your own inputs and setup.

1. Basic Understanding & Generation

# Image/Video Captioning & Understanding
python univideo_inference.py --demo_task understanding --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Text-to-Video (T2V)
python univideo_inference.py --demo_task t2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Text-to-Image (T2I)
python univideo_inference.py --demo_task t2i --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Image-to-Video (I2V)
python univideo_inference.py --demo_task i2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

2. Instruction-based Editing

# Image Editing 
python univideo_inference.py --demo_task image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Video Editing
python univideo_inference.py --demo_task video_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Video Stylization
python univideo_inference.py --demo_task stylization --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

3. In-Context Tasks

# In context video generation
python univideo_inference.py --demo_task in_context_video_gen --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# In context image editing
python univideo_inference.py --demo_task in_context_image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# In context video editing
## addition
python univideo_inference.py --demo_task in_context_video_edit_addition --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## swap
python univideo_inference.py --demo_task in_context_video_edit_swap --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## style
python univideo_inference.py --demo_task in_context_video_edit_style --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

4. Multi-GPU README Sweep

To run the README demo tasks across multiple local GPUs while keeping each task's default hyperparameters, use:

python scripts/run_readme_inference_sweep.py \
  --gpus 0,1,2 \
  --max-parallel 3 \
  --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml \
  --output-root outputs/readme-inference

The launcher writes one log per task under outputs/readme-inference/logs. Lower --max-parallel if checkpoint loading saturates local storage.

Univideo variant 2

To use the Queries-based version of UniVideo, simply update the configuration flag.

--config configs/univideo_qwen2p5vl7b_queries_hunyuanvideo.yaml

4. Training

We provide an example training setting using open-source data so users can run a small training job and verify the training pipeline. See TRAINING.md for the data schema, dataset preparation details, and full training options.

python download_ckpt.py --variant hidden
python -m pip install --target .deps/pyarrow pyarrow
bash scripts/prepare_smoke_data.sh
torchrun --standalone --nproc_per_node 8 \
  train/train_univideo.py configs/train_multitask_129f_hybrid_smoke.yaml

5. Evaluation

We provide the scripts for evaluating UniVideo on GenEval, ImgEdit, GEdit and Vbench benchmarks. Check out EVAL.md

Acknowledgement

  • HunyuanVideo: the base video generation model used in this work. Thanks to the authors for their excellent contribution.
  • Qwen2.5-VL: the base vlm model used in this work. Thanks to the authors for their excellent contribution.
  • MetaQueries: we adopt their query implementation. Thanks to the authors for their excellent contribution.

🌟 Citation

If you find UniVideo useful for your research and applications, please cite using this BibTeX:

@article{wei2025univideo,
  title={Univideo: Unified understanding, generation, and editing for videos},
  author={Wei, Cong and Liu, Quande and Ye, Zixuan and Wang, Qiulin and Wang, Xintao and Wan, Pengfei and Gai, Kun and Chen, Wenhu},
  journal={arXiv preprint arXiv:2510.08377},
  year={2025}
}

About

[ICLR 2026] UniVideo: Unified Understanding, Generation, and Editing for Videos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors