Skip to content

luka-group/VideoRLVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VideoRLVR

Paper Website

Introduction

VideoRLVR is an RL recipe for training video reasoning models with verifiable rewards. This repository contains the training code for RLVR on Wan2.2-TI2V-5B with reasoning tasks including Maze, FlowFree, and Sokoban.

Table of Contents

Resources

All released models and datasets are organized in the Huggingface collection.

Type Resource
Checkpoints SFT, RLVR
Datasets Train & Test
Training backend Vendored diffsynth/ package with training/train.py

Code Layout

diffsynth/                                      # vendored DiffSynth package
scripts/
├── _env.sh                                     # shared setup
├── train_sde_grpo_multitask.sh                 # multitask SDE-GRPO launcher
├── train_sde_grpo_multitask_pure_success.sh    # sparse pure-success multitask launcher
├── train_sde_grpo_maze.sh                      # maze-only SDE-GRPO launcher
├── train_sde_grpo_flowfree.sh                  # FlowFree-only SDE-GRPO launcher
├── train_sde_grpo_sokoban.sh                   # Sokoban-only SDE-GRPO launcher
├── inference_multitask.sh                      # multitask inference launcher
├── inference_vbvr_ood.sh                       # OOD inference launcher
├── eval_multitask.sh                           # multitask evaluation launcher
├── eval_maze.sh                                # maze evaluation launcher
└── eval_vbvr_ood.sh                            # OOD evaluation launcher
src/
├── inference_wan.py                            # Wan inference entrypoint
├── rewards/                                    # verifiable reward functions and reward factory
└── eval/                                       # maze, FlowFree, Sokoban, and multitask evaluation helpers
training/
└── train.py                                    # training entrypoint used by accelerate
environment.yml                                 # conda environment
requirements.txt                                # Python dependencies

Reward construction is dispatched through src/rewards/reasoning_reward.py::create_reward. The Wan video pipeline and diffusion utilities are provided by the vendored diffsynth/ package.

Environment Setup

Create the conda environment and install the shared runtime dependencies:

conda env create -f environment.yml
conda activate video-rl

The environment is intended for Wan2.2 VideoRLVT d with accelerate, the local DiffSynth runtime, verifiable reward computation, and OpenCV-based task evaluation.

Runtime Configuration

All shell launchers source scripts/_env.sh, which defines shared paths. Override these variables before launching a run when your data, models, or outputs live outside the repository defaults:

Variable What It Points To
REPO_ROOT Repository root, inferred automatically from scripts/_env.sh
DATA_ROOT Dataset root containing train/test folders, metadata CSVs, and referenced videos
MODEL_ROOT Local model/checkpoint root
OUTPUT_ROOT Root directory for checkpoints, inference outputs, and evaluation results
SFT_CHECKPOINT SFT-initialized .safetensors checkpoint for Wan2.2-TI2V-5B

Any paths shown in this README are examples only. Replace them with the real dataset roots, checkpoint paths, output directories, and logging settings for your own machine.

Dataset Format

The expected dataset layout under ${DATA_ROOT} is:

${DATA_ROOT}/
├── train/
│   ├── metadata_all.csv
│   ├── metadata_maze_only.csv
│   ├── metadata_flowfree_only.csv
│   ├── metadata_sokoban_only.csv
│   └── ... per-sample training videos referenced by the CSVs ...
├── test/
│   ├── metadata_all.csv
│   └── ... per-sample test videos referenced by the CSVs ...
└── vbvr_ood/
    └── ... OOD metadata and videos ...

The CSV files follow the DiffSynth UnifiedDataset format. Each row contains the video path, prompt, task type, and task-specific metadata when needed. Sokoban rows may include sokoban_metadata so that process-aware success rewards can verify valid pushes, box movement, and final target state.

See training/train.py for the full set of parsed dataset fields.

Run Training

All training launchers source scripts/_env.sh and call training/train.py through accelerate launch.

Run the default multitask RLVR recipe:

bash scripts/train_sde_grpo_multitask.sh

Available launchers:

Script Metadata Reward Output Suffix
scripts/train_sde_grpo_multitask.sh metadata_all.csv combined_success combined_success_G16_T20_L10_KL0.04_FULL
scripts/train_sde_grpo_multitask_pure_success.sh metadata_all.csv combined_pure_success pure_success_G16_T20_L10_KL0.04_FULL
scripts/train_sde_grpo_maze.sh metadata_maze_only.csv maze_success maze_success_G16_T20_L10_KL0.04_FULL
scripts/train_sde_grpo_flowfree.sh metadata_flowfree_only.csv flowfree_success flowfree_success_G16_T20_L10_KL0.04_FULL
scripts/train_sde_grpo_sokoban.sh metadata_sokoban_only.csv sokoban_success sokoban_success_G16_T20_L10_KL0.04_FULL

For example, to run only the maze subset:

bash scripts/train_sde_grpo_maze.sh

All scripts use the same base training configuration unless edited directly:

Setting Value
Task sde_grpo
Dataset root ${DATA_ROOT}/diffsynth_multitask
Base model Wan-AI/Wan2.2-TI2V-5B
Trainable module dit
Resolution 480 x 832
Frames 81
Learning rate 5e-6
Epochs 1
Group size 16
KL coefficient 0.04
Inference steps 20
SDE coefficient 0.3
SDE cutoff ratio 0.5
PPO clip epsilon 0.2
Loss steps 10
Micro batch size 2
CFG range 0.6 1.4
Checkpoint save interval 100 steps

Evaluation

Task-aware evaluation utilities live under src/eval/:

File What It Contains
src/eval/evaluate_multitask.py Unified maze, FlowFree, and Sokoban evaluation helpers
src/eval/evaluate_maze.py Maze-specific evaluation helpers
src/eval/evaluate_flowfree.py FlowFree endpoint, connectivity, fill-rate, and cell-F1 checks
src/eval/evaluate_sokoban.py Sokoban state detection, action-path extraction, and process checks

Citation

If this code or the VideoRLVR recipe is useful for your research, please cite the paper.

@article{zhu2026video,
  title={Video Models Can Reason with Verifiable Rewards}, 
  author={Tinghui Zhu and Sheng Zhang and James Y. Huang and Selena Song and Xiaofei Wen and Yuankai Li and Hoifung Poon and Muhao Chen},
  journal={arXiv preprint arXiv:2605.15458},
  year={2026}
}

About

Code for paper "Video Models Can Reason with Verifiable Rewards"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors