VideoRLVR

Introduction

VideoRLVR is an RL recipe for training video reasoning models with verifiable rewards. This repository contains the training code for RLVR on Wan2.2-TI2V-5B with reasoning tasks including Maze, FlowFree, and Sokoban.

Resources

All released models and datasets are organized in the Huggingface collection.

Type	Resource
Checkpoints	SFT, RLVR
Datasets	Train & Test
Training backend	Vendored `diffsynth/` package with `training/train.py`

Code Layout

diffsynth/                                      # vendored DiffSynth package
scripts/
├── _env.sh                                     # shared setup
├── train_sde_grpo_multitask.sh                 # multitask SDE-GRPO launcher
├── train_sde_grpo_multitask_pure_success.sh    # sparse pure-success multitask launcher
├── train_sde_grpo_maze.sh                      # maze-only SDE-GRPO launcher
├── train_sde_grpo_flowfree.sh                  # FlowFree-only SDE-GRPO launcher
├── train_sde_grpo_sokoban.sh                   # Sokoban-only SDE-GRPO launcher
├── inference_multitask.sh                      # multitask inference launcher
├── inference_vbvr_ood.sh                       # OOD inference launcher
├── eval_multitask.sh                           # multitask evaluation launcher
├── eval_maze.sh                                # maze evaluation launcher
└── eval_vbvr_ood.sh                            # OOD evaluation launcher
src/
├── inference_wan.py                            # Wan inference entrypoint
├── rewards/                                    # verifiable reward functions and reward factory
└── eval/                                       # maze, FlowFree, Sokoban, and multitask evaluation helpers
training/
└── train.py                                    # training entrypoint used by accelerate
environment.yml                                 # conda environment
requirements.txt                                # Python dependencies

Reward construction is dispatched through src/rewards/reasoning_reward.py::create_reward. The Wan video pipeline and diffusion utilities are provided by the vendored diffsynth/ package.

Environment Setup

Create the conda environment and install the shared runtime dependencies:

conda env create -f environment.yml
conda activate video-rl

The environment is intended for Wan2.2 VideoRLVT d with accelerate, the local DiffSynth runtime, verifiable reward computation, and OpenCV-based task evaluation.

Runtime Configuration

All shell launchers source scripts/_env.sh, which defines shared paths. Override these variables before launching a run when your data, models, or outputs live outside the repository defaults:

Variable	What It Points To
`REPO_ROOT`	Repository root, inferred automatically from `scripts/_env.sh`
`DATA_ROOT`	Dataset root containing train/test folders, metadata CSVs, and referenced videos
`MODEL_ROOT`	Local model/checkpoint root
`OUTPUT_ROOT`	Root directory for checkpoints, inference outputs, and evaluation results
`SFT_CHECKPOINT`	SFT-initialized `.safetensors` checkpoint for Wan2.2-TI2V-5B

Any paths shown in this README are examples only. Replace them with the real dataset roots, checkpoint paths, output directories, and logging settings for your own machine.

Dataset Format

The expected dataset layout under ${DATA_ROOT} is:

${DATA_ROOT}/
├── train/
│   ├── metadata_all.csv
│   ├── metadata_maze_only.csv
│   ├── metadata_flowfree_only.csv
│   ├── metadata_sokoban_only.csv
│   └── ... per-sample training videos referenced by the CSVs ...
├── test/
│   ├── metadata_all.csv
│   └── ... per-sample test videos referenced by the CSVs ...
└── vbvr_ood/
    └── ... OOD metadata and videos ...

The CSV files follow the DiffSynth UnifiedDataset format. Each row contains the video path, prompt, task type, and task-specific metadata when needed. Sokoban rows may include sokoban_metadata so that process-aware success rewards can verify valid pushes, box movement, and final target state.

See training/train.py for the full set of parsed dataset fields.

Run Training

All training launchers source scripts/_env.sh and call training/train.py through accelerate launch.

Run the default multitask RLVR recipe:

bash scripts/train_sde_grpo_multitask.sh

Available launchers:

Script	Metadata	Reward	Output Suffix
`scripts/train_sde_grpo_multitask.sh`	`metadata_all.csv`	`combined_success`	`combined_success_G16_T20_L10_KL0.04_FULL`
`scripts/train_sde_grpo_multitask_pure_success.sh`	`metadata_all.csv`	`combined_pure_success`	`pure_success_G16_T20_L10_KL0.04_FULL`
`scripts/train_sde_grpo_maze.sh`	`metadata_maze_only.csv`	`maze_success`	`maze_success_G16_T20_L10_KL0.04_FULL`
`scripts/train_sde_grpo_flowfree.sh`	`metadata_flowfree_only.csv`	`flowfree_success`	`flowfree_success_G16_T20_L10_KL0.04_FULL`
`scripts/train_sde_grpo_sokoban.sh`	`metadata_sokoban_only.csv`	`sokoban_success`	`sokoban_success_G16_T20_L10_KL0.04_FULL`

For example, to run only the maze subset:

bash scripts/train_sde_grpo_maze.sh

All scripts use the same base training configuration unless edited directly:

Setting	Value
Task	`sde_grpo`
Dataset root	`${DATA_ROOT}/diffsynth_multitask`
Base model	`Wan-AI/Wan2.2-TI2V-5B`
Trainable module	`dit`
Resolution	`480 x 832`
Frames	`81`
Learning rate	`5e-6`
Epochs	`1`
Group size	`16`
KL coefficient	`0.04`
Inference steps	`20`
SDE coefficient	`0.3`
SDE cutoff ratio	`0.5`
PPO clip epsilon	`0.2`
Loss steps	`10`
Micro batch size	`2`
CFG range	`0.6 1.4`
Checkpoint save interval	`100` steps

Evaluation

Task-aware evaluation utilities live under src/eval/:

File	What It Contains
`src/eval/evaluate_multitask.py`	Unified maze, FlowFree, and Sokoban evaluation helpers
`src/eval/evaluate_maze.py`	Maze-specific evaluation helpers
`src/eval/evaluate_flowfree.py`	FlowFree endpoint, connectivity, fill-rate, and cell-F1 checks
`src/eval/evaluate_sokoban.py`	Sokoban state detection, action-path extraction, and process checks

Citation

If this code or the VideoRLVR recipe is useful for your research, please cite the paper.

@article{zhu2026video,
  title={Video Models Can Reason with Verifiable Rewards}, 
  author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
  journal={arXiv preprint arXiv:2605.15458},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoRLVR

Introduction

Table of Contents

Resources

Code Layout

Environment Setup

Runtime Configuration

Dataset Format

Run Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
diffsynth		diffsynth
scripts		scripts
src		src
training		training
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VideoRLVR

Introduction

Table of Contents

Resources

Code Layout

Environment Setup

Runtime Configuration

Dataset Format

Run Training

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages