VideoRLVR is an RL recipe for training video reasoning models with verifiable rewards. This repository contains the training code for RLVR on Wan2.2-TI2V-5B with reasoning tasks including Maze, FlowFree, and Sokoban.
All released models and datasets are organized in the Huggingface collection.
| Type | Resource |
|---|---|
| Checkpoints | SFT, RLVR |
| Datasets | Train & Test |
| Training backend | Vendored diffsynth/ package with training/train.py |
diffsynth/ # vendored DiffSynth package
scripts/
├── _env.sh # shared setup
├── train_sde_grpo_multitask.sh # multitask SDE-GRPO launcher
├── train_sde_grpo_multitask_pure_success.sh # sparse pure-success multitask launcher
├── train_sde_grpo_maze.sh # maze-only SDE-GRPO launcher
├── train_sde_grpo_flowfree.sh # FlowFree-only SDE-GRPO launcher
├── train_sde_grpo_sokoban.sh # Sokoban-only SDE-GRPO launcher
├── inference_multitask.sh # multitask inference launcher
├── inference_vbvr_ood.sh # OOD inference launcher
├── eval_multitask.sh # multitask evaluation launcher
├── eval_maze.sh # maze evaluation launcher
└── eval_vbvr_ood.sh # OOD evaluation launcher
src/
├── inference_wan.py # Wan inference entrypoint
├── rewards/ # verifiable reward functions and reward factory
└── eval/ # maze, FlowFree, Sokoban, and multitask evaluation helpers
training/
└── train.py # training entrypoint used by accelerate
environment.yml # conda environment
requirements.txt # Python dependencies
Reward construction is dispatched through src/rewards/reasoning_reward.py::create_reward.
The Wan video pipeline and diffusion utilities are provided by the vendored diffsynth/ package.
Create the conda environment and install the shared runtime dependencies:
conda env create -f environment.yml
conda activate video-rlThe environment is intended for Wan2.2 VideoRLVT d with accelerate, the local DiffSynth runtime, verifiable reward computation, and OpenCV-based task evaluation.
All shell launchers source scripts/_env.sh, which defines shared paths.
Override these variables before launching a run when your data, models, or outputs live outside the repository defaults:
| Variable | What It Points To |
|---|---|
REPO_ROOT |
Repository root, inferred automatically from scripts/_env.sh |
DATA_ROOT |
Dataset root containing train/test folders, metadata CSVs, and referenced videos |
MODEL_ROOT |
Local model/checkpoint root |
OUTPUT_ROOT |
Root directory for checkpoints, inference outputs, and evaluation results |
SFT_CHECKPOINT |
SFT-initialized .safetensors checkpoint for Wan2.2-TI2V-5B |
Any paths shown in this README are examples only. Replace them with the real dataset roots, checkpoint paths, output directories, and logging settings for your own machine.
The expected dataset layout under ${DATA_ROOT} is:
${DATA_ROOT}/
├── train/
│ ├── metadata_all.csv
│ ├── metadata_maze_only.csv
│ ├── metadata_flowfree_only.csv
│ ├── metadata_sokoban_only.csv
│ └── ... per-sample training videos referenced by the CSVs ...
├── test/
│ ├── metadata_all.csv
│ └── ... per-sample test videos referenced by the CSVs ...
└── vbvr_ood/
└── ... OOD metadata and videos ...
The CSV files follow the DiffSynth UnifiedDataset format.
Each row contains the video path, prompt, task type, and task-specific metadata when needed.
Sokoban rows may include sokoban_metadata so that process-aware success rewards can verify valid pushes, box movement, and final target state.
See training/train.py for the full set of parsed dataset fields.
All training launchers source scripts/_env.sh and call training/train.py through accelerate launch.
Run the default multitask RLVR recipe:
bash scripts/train_sde_grpo_multitask.shAvailable launchers:
| Script | Metadata | Reward | Output Suffix |
|---|---|---|---|
scripts/train_sde_grpo_multitask.sh |
metadata_all.csv |
combined_success |
combined_success_G16_T20_L10_KL0.04_FULL |
scripts/train_sde_grpo_multitask_pure_success.sh |
metadata_all.csv |
combined_pure_success |
pure_success_G16_T20_L10_KL0.04_FULL |
scripts/train_sde_grpo_maze.sh |
metadata_maze_only.csv |
maze_success |
maze_success_G16_T20_L10_KL0.04_FULL |
scripts/train_sde_grpo_flowfree.sh |
metadata_flowfree_only.csv |
flowfree_success |
flowfree_success_G16_T20_L10_KL0.04_FULL |
scripts/train_sde_grpo_sokoban.sh |
metadata_sokoban_only.csv |
sokoban_success |
sokoban_success_G16_T20_L10_KL0.04_FULL |
For example, to run only the maze subset:
bash scripts/train_sde_grpo_maze.shAll scripts use the same base training configuration unless edited directly:
| Setting | Value |
|---|---|
| Task | sde_grpo |
| Dataset root | ${DATA_ROOT}/diffsynth_multitask |
| Base model | Wan-AI/Wan2.2-TI2V-5B |
| Trainable module | dit |
| Resolution | 480 x 832 |
| Frames | 81 |
| Learning rate | 5e-6 |
| Epochs | 1 |
| Group size | 16 |
| KL coefficient | 0.04 |
| Inference steps | 20 |
| SDE coefficient | 0.3 |
| SDE cutoff ratio | 0.5 |
| PPO clip epsilon | 0.2 |
| Loss steps | 10 |
| Micro batch size | 2 |
| CFG range | 0.6 1.4 |
| Checkpoint save interval | 100 steps |
Task-aware evaluation utilities live under src/eval/:
| File | What It Contains |
|---|---|
src/eval/evaluate_multitask.py |
Unified maze, FlowFree, and Sokoban evaluation helpers |
src/eval/evaluate_maze.py |
Maze-specific evaluation helpers |
src/eval/evaluate_flowfree.py |
FlowFree endpoint, connectivity, fill-rate, and cell-F1 checks |
src/eval/evaluate_sokoban.py |
Sokoban state detection, action-path extraction, and process checks |
If this code or the VideoRLVR recipe is useful for your research, please cite the paper.
@article{zhu2026video,
title={Video Models Can Reason with Verifiable Rewards},
author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
journal={arXiv preprint arXiv:2605.15458},
year={2026}
}