UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei^*,1,2 Quande Liu^†,2 Zixuan Ye² Qiulin Wang² Xintao Wang²

Pengfei Wan² Kun Gai² Wenhu Chen^†,1

¹University of Waterloo ²Kling Team, Kuaishou Technology
^*Work done during an internship at Kling Team, Kuaishou Technology ^†Corresponding author

🚀 Supported Tasks

Univideo is flexible in its input and output configurations, supporting a wide range of multimodal tasks:

Task	Input Type	Output	Task ID	Description	Demo Input	Demo Output
Image/Video Understanding	Image🖼️ / Video🎬 + Text📝	Text📝	`understanding`	Multimodal analysis and captioning.		_Text
Text-to-Image	Text📝	Image🖼️	`t2i`	Generating images from text prompts.	_Prompt
Text-to-Video	Text📝	Video🎬	`t2v`	Generating videos from text prompts.	_Prompt
Image-to-Video	Image🖼️ + Text📝	Video🎬	`i2v`	Animating a static image into a video.
Image Editing	Image🖼️ + Text📝	Image🖼️	`i2i_edit`	Instruction-based image editing.
In-context Image Editing	Image🖼️ + Image🖼️ + Text📝	Image🖼️	`i+i2i_edit`	Editing an image based on a reference image.
In-context Generation	Image🖼️ × N + Text📝	Image🖼️ / Video🎬	`multiid`	Multi-subject generation.
Video Editing	Video🎬 + Text📝	Video🎬	`v2v_edit`	Instruction-based video manipulation and stylization.
In-context Video Editing	Image🖼️ + Video🎬 + Text📝	Video🎬	`i+v2v_edit`	Reference-based manipulation: addition, deletion, swapping, and stylization.

🔔News

[2026-06-03]: The training script and instructions are now available in TRAINING.md.
[2026-01-30]: UniVideo was accepted at ICLR 2026 🎉
[2026-01-07]: Released Code and Model.
[2025-10-09]: Released Arxiv Preprint and the Project Page

How to use

1. Installation

conda env create -f environment.yml
conda activate univideo

This environment is tested with:

Python 3.11
PyTorch 2.4.1 + CUDA 12.1
diffusers 0.34.0
transformers 4.51.3

Try this command if the conda create from yaml doesn't work

conda create -n univideo python=3.11 -y
conda activate univideo
conda install pytorch==2.4.1 torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt

2. Download Checkpoint

Download the Univideo checkpoint to a local path for example ckpts/:

python download_ckpt.py --variant hidden

We provide two UniVideo checkpoint variants as described in Arxiv Preprint Section 3.2:

Variant 1 (img, video, txt -> mllm -> last layer hidden -> mmdit)
Image, video, and text inputs are processed by the MLLM, and the final hidden states are fed into the MMDiT backbone.
Variant 2 (img, video, txt, queries -> mllm -> txt + queries last layer hidden -> mmdit)
Image, video, text, and queries are processed by the MLLM. The final hidden states of text and queries are used as inputs to MMDiT.

Download the queries-based checkpoint with:

python download_ckpt.py --variant queries

Or download both variants without deleting either local directory:

python download_ckpt.py --variant all

3. Inference

We provide demo inference scripts to demonstrate how to load and run the UniVideo pipeline by setting up pipeline_kwargs on different inputs. Feel free to adapt these to your own inputs and setup.

1. Basic Understanding & Generation

# Image/Video Captioning & Understanding
python univideo_inference.py --demo_task understanding --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Text-to-Video (T2V)
python univideo_inference.py --demo_task t2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Text-to-Image (T2I)
python univideo_inference.py --demo_task t2i --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Image-to-Video (I2V)
python univideo_inference.py --demo_task i2v --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

2. Instruction-based Editing

# Image Editing 
python univideo_inference.py --demo_task image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Video Editing
python univideo_inference.py --demo_task video_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# Video Stylization
python univideo_inference.py --demo_task stylization --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

3. In-Context Tasks

# In context video generation
python univideo_inference.py --demo_task in_context_video_gen --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# In context image editing
python univideo_inference.py --demo_task in_context_image_edit --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

# In context video editing
## addition
python univideo_inference.py --demo_task in_context_video_edit_addition --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## swap
python univideo_inference.py --demo_task in_context_video_edit_swap --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml
## style
python univideo_inference.py --demo_task in_context_video_edit_style --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml

4. Multi-GPU README Sweep

To run the README demo tasks across multiple local GPUs while keeping each task's default hyperparameters, use:

python scripts/run_readme_inference_sweep.py \
  --gpus 0,1,2 \
  --max-parallel 3 \
  --config configs/univideo_qwen2p5vl7b_hidden_hunyuanvideo.yaml \
  --output-root outputs/readme-inference

The launcher writes one log per task under outputs/readme-inference/logs. Lower --max-parallel if checkpoint loading saturates local storage.

Univideo variant 2

To use the Queries-based version of UniVideo, simply update the configuration flag.

--config configs/univideo_qwen2p5vl7b_queries_hunyuanvideo.yaml

4. Training

We provide an example training setting using open-source data so users can run a small training job and verify the training pipeline. See TRAINING.md for the data schema, dataset preparation details, and full training options.

python download_ckpt.py --variant hidden
python -m pip install --target .deps/pyarrow pyarrow
bash scripts/prepare_smoke_data.sh
torchrun --standalone --nproc_per_node 8 \
  train/train_univideo.py configs/train_multitask_129f_hybrid_smoke.yaml

5. Evaluation

We provide the scripts for evaluating UniVideo on GenEval, ImgEdit, GEdit and Vbench benchmarks. Check out EVAL.md

Acknowledgement

HunyuanVideo: the base video generation model used in this work. Thanks to the authors for their excellent contribution.
Qwen2.5-VL: the base vlm model used in this work. Thanks to the authors for their excellent contribution.
MetaQueries: we adopt their query implementation. Thanks to the authors for their excellent contribution.

🌟 Citation

If you find UniVideo useful for your research and applications, please cite using this BibTeX:

@article{wei2025univideo,
  title={Univideo: Unified understanding, generation, and editing for videos},
  author={Wei, Cong and Liu, Quande and Ye, Zixuan and Wang, Qiulin and Wang, Xintao and Wan, Pengfei and Gai, Kun and Chen, Wenhu},
  journal={arXiv preprint arXiv:2510.08377},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs		configs
demo		demo
eval		eval
examples		examples
scripts		scripts
train		train
.gitignore		.gitignore
EVAL.md		EVAL.md
LICENSE.txt		LICENSE.txt
README.md		README.md
TRAINING.md		TRAINING.md
__init__.py		__init__.py
download_ckpt.py		download_ckpt.py
environment.yml		environment.yml
mllm_encoder.py		mllm_encoder.py
pipeline_univideo.py		pipeline_univideo.py
requirements.txt		requirements.txt
transformer_univideo_hunyuan_video.py		transformer_univideo_hunyuan_video.py
univideo_inference.py		univideo_inference.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniVideo: Unified Understanding, Generation, and Editing for Videos

🚀 Supported Tasks

🔔News

How to use

1. Installation

2. Download Checkpoint

3. Inference

1. Basic Understanding & Generation

2. Instruction-based Editing

3. In-Context Tasks

4. Multi-GPU README Sweep

Univideo variant 2

4. Training

5. Evaluation

Acknowledgement

🌟 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UniVideo: Unified Understanding, Generation, and Editing for Videos

🚀 Supported Tasks

🔔News

How to use

1. Installation

2. Download Checkpoint

3. Inference

1. Basic Understanding & Generation

2. Instruction-based Editing

3. In-Context Tasks

4. Multi-GPU README Sweep

Univideo variant 2

4. Training

5. Evaluation

Acknowledgement

🌟 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages