Cong Wei*,1,2 Quande Liu†,2 Zixuan Ye2 Qiulin Wang2 Xintao Wang2
Pengfei Wan2 Kun Gai2 Wenhu Chen†,1
1University of Waterloo
2Kling Team, Kuaishou Technology
*Work done during an internship at Kling Team, Kuaishou Technology
†Corresponding author
Univideo is flexible in its input and output configurations, supporting a wide range of multimodal tasks:
- [2026-06-03]: The training script and instructions are now available in TRAINING.md.
- [2026-01-30]: UniVideo was accepted at ICLR 2026 🎉
- [2026-01-07]: Released Code and Model.
- [2025-10-09]: Released Arxiv Preprint and the Project Page
conda env create -f environment.yml
conda activate univideo
This environment is tested with:
- Python 3.11
- PyTorch 2.4.1 + CUDA 12.1
- diffusers 0.34.0
- transformers 4.51.3
Try this command if the conda create from yaml doesn't work
conda create -n univideo python=3.11 -y
conda activate univideo
conda install pytorch==2.4.1 torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
Download the Univideo checkpoint to a local path for example ckpts/:
python download_ckpt.py --variant hidden
We provide two UniVideo checkpoint variants as described in Arxiv Preprint Section 3.2:
-
Variant 1 (img, video, txt -> mllm -> last layer hidden -> mmdit)
Image, video, and text inputs are processed by the MLLM, and the final hidden states are fed into the MMDiT backbone. -
Variant 2 (img, video, txt, queries -> mllm -> txt + queries last layer hidden -> mmdit)
Image, video, text, and queries are processed by the MLLM. The final hidden states of text and queries are used as inputs to MMDiT.
Download the queries-based checkpoint with:
python download_ckpt.py --variant queries
Or download both variants without deleting either local directory:
python download_ckpt.py --variant all
We provide demo inference scripts to demonstrate how to load and run the UniVideo pipeline by setting up pipeline_kwargs on different inputs. Feel free to adapt these to your own inputs and setup.
# Image/Video Captioning & Understanding
python univideo_inference.py --demo_task understanding --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Text-to-Video (T2V)
python univideo_inference.py --demo_task t2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Text-to-Image (T2I)
python univideo_inference.py --demo_task t2i --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Image-to-Video (I2V)
python univideo_inference.py --demo_task i2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml# Image Editing
python univideo_inference.py --demo_task image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Video Editing
python univideo_inference.py --demo_task video_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# Video Stylization
python univideo_inference.py --demo_task stylization --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml# In context video generation
python univideo_inference.py --demo_task in_context_video_gen --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# In context image editing
python univideo_inference.py --demo_task in_context_image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
# In context video editing
## addition
python univideo_inference.py --demo_task in_context_video_edit_addition --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## swap
python univideo_inference.py --demo_task in_context_video_edit_swap --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## style
python univideo_inference.py --demo_task in_context_video_edit_style --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yamlTo run the README demo tasks across multiple local GPUs while keeping each task's default hyperparameters, use:
python scripts/run_readme_inference_sweep.py \
--gpus 0,1,2 \
--max-parallel 3 \
--config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml \
--output-root outputs/readme-inferenceThe launcher writes one log per task under outputs/readme-inference/logs.
Lower --max-parallel if checkpoint loading saturates local storage.
To use the Queries-based version of UniVideo, simply update the configuration flag.
--config configs/univideo_qwen2p5vl7b_queries_hunyuanvideo.yaml
We provide an example training setting using open-source data so users can run a small training job and verify the training pipeline. See TRAINING.md for the data schema, dataset preparation details, and full training options.
python download_ckpt.py --variant hidden
python -m pip install --target .deps/pyarrow pyarrow
bash scripts/prepare_smoke_data.sh
torchrun --standalone --nproc_per_node 8 \
train/train_univideo.py configs/train_multitask_129f_hybrid_smoke.yamlWe provide the scripts for evaluating UniVideo on GenEval, ImgEdit, GEdit and Vbench benchmarks. Check out EVAL.md
- HunyuanVideo: the base video generation model used in this work. Thanks to the authors for their excellent contribution.
- Qwen2.5-VL: the base vlm model used in this work. Thanks to the authors for their excellent contribution.
- MetaQueries: we adopt their query implementation. Thanks to the authors for their excellent contribution.
If you find UniVideo useful for your research and applications, please cite using this BibTeX:
@article{wei2025univideo,
title={Univideo: Unified understanding, generation, and editing for videos},
author={Wei, Cong and Liu, Quande and Ye, Zixuan and Wang, Qiulin and Wang, Xintao and Wan, Pengfei and Gai, Kun and Chen, Wenhu},
journal={arXiv preprint arXiv:2510.08377},
year={2025}
}
















