RoboRenForce is a unified framework that covers the full robotics RL pipeline: classic locomotion control (PPO/SAC on Isaac Lab, MJLab, Gymnasium), vision-language-action model training (Qwen2-VL, Qwen3-VL, OpenPI, GR00T), and multi-stage learning (pretrain → SFT → RL fine-tuning). Everything is driven by a composable @configclass system and a consistent wrapper chain across simulators.
┌──────────────────────────────── RoboRenForce ────────────────────────────────┐
│ │
│ System 2 (VLM Backbone) System 1 (Action Expert / Psi0) │
│ ┌──────────────────────┐ ┌──────────────────────────────┐ │
│ │ Qwen2-VL / Qwen3-VL │──feats──▶ │ Psi0 — Regression / │ │
│ │ OpenPI / GR00T │ │ Diffusion / Flow-Match head │ │
│ │ (frozen / LoRA) │ │ chunked high-level actions │ │
│ └──────────────────────┘ └──────────────────────────────┘ │
│ ▲ │ │
│ obs (image+lang) high-level target / EE pose │
│ │ ▼ │
│ │ System 0 (Whole-Body Loco Policy) │
│ │ ┌──────────────────────────────┐ │
│ │ │ AMO · Sonic · custom RL/MPC │ │
│ │ │ joint-level torques / dq │ │
│ │ └──────────────────────────────┘ │
│ │ │ │
│ ┌───────────┴───────────────────────────────────────┴─────────────────────┐ │
│ │ Environment Wrapper Chain │ │
│ │ Isaac Lab ─┐ │ │
│ │ MJLab ─┤─▶ VecEnv ─▶ DynamicEnv ─▶ GroupVecWrapper ─▶ MultiModal │ │
│ │ RoboTwin ─┤ │ │
│ │ Gymnasium ─┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
System 2 reasons over vision + language and emits latent features.
System 1 (Psi0) consumes those features and outputs chunked high-level actions
(joint targets / EE poses) via a regression, diffusion, or flow-match head.
System 0 is a low-level whole-body locomotion policy (AMO, Sonic, …) trained
with PPO/SAC that tracks System 1's targets at simulator rate.
Planning a GR00T finetune? Read docs/SETUP_GUIDE.md first — it walks through the system packages (
python3.10-dev,ffmpeg,libaio-dev,git-lfs), downloading the gatednvidia/GR00T-N1.7-3B+nvidia/Cosmos-Reason2-2Bweights, disk-space planning, and a verified end-to-end finetune on the bundled SO101 demo. For training data, see docs/DATA_DOWNLOAD.md (bundled demos, NVIDIA GR00T-flavored datasets, and converting community LeRobot v3 datasets).
git clone <repository-url>
cd RoboRenForce
# Core framework
pip install -e source/RoboRenForce
# Task packages (install what you need)
pip install -e source/tasks/RRF_isaaclab # Isaac Lab locomotion/manipulation
pip install -e source/tasks/RRF_mjlab # MJLab (MuJoCo Warp) locomotion
pip install -e source/tasks/RRF_robotwin # RoboTwin manipulation
pip install -e source/tasks/RRF_humanoid_psi0 # Humanoid offline tasks
# Role-distributed VLA RL training (opt-in — see "Parallelism" section)
pip install -e source/RRF_orchestra
# External setup (robot assets, etc.)
bash scripts/setup_ext.shVLA model setup (optional)
# Download VLM weights
bash scripts/models/setup_models.sh qwen2vl # Qwen2-VL 2B (~4.2GB)
bash scripts/models/setup_models.sh qwen3vl # Qwen3-VL 2B
bash scripts/models/setup_models.sh openpi # OpenPI pi0.5 4B
bash scripts/models/setup_models.sh groot # GR00T N1.7 3B
# Or install dependencies only
pip install "transformers>=4.37" qwen-vl-utils accelerate peftSee docs/models.md for per-model details, VRAM requirements, and usage examples.
MJLab simulator setup
# Clone and install MJLab
git clone https://github.com/mujocolab/mjlab.git
pip install -e mjlab
# Requires: mujoco>=3.7.0, mujoco-warp>=3.7.0.1, warp-lang>=1.12.0# Go1 quadruped on flat terrain — PPO
python scripts/renforce/train_mjlab.py \
--task Mjlab-Velocity-Flat-Unitree-Go1 \
--num_envs 4096 --device cuda:0
# G1 humanoid on rough terrain
python scripts/renforce/train_mjlab.py \
--task Mjlab-Velocity-Rough-Unitree-G1 \
--num_envs 2048 --max_iterations 30000python scripts/renforce/train_lab.py \
--task RoboRenForce-AFR-UnitreeGo1Flat-PPO \
--num_envs 4096 --headless# Single GPU
python scripts/vla/pretrain/train_single_gpu.py \
--model_name Qwen/Qwen2-VL-2B-Instruct \
--dataset_path data/my_dataset --epochs 20
# Multi-GPU DDP
torchrun --nproc_per_node=4 scripts/vla/pretrain/train_ddp.py \
--model_name Qwen/Qwen2-VL-2B-Instruct \
--dataset_path data/my_dataset --epochs 20python scripts/vla/rl/train_robotwin_grpo.py \
--task close_laptop_lid --algo grpo --num_envs 32More examples: SFT, evaluation, data pipeline
Supervised Fine-tuning (SFT)
# Single GPU
python scripts/vla/post_train/train_sft.py \
--model_name Qwen/Qwen2-VL-2B-Instruct \
--dataset_path data/my_dataset --epochs 5
# Multi-GPU
torchrun --nproc_per_node=2 scripts/vla/post_train/train_sft_ddp.py \
--model_name Qwen/Qwen2-VL-2B-Instruct \
--dataset_path data/my_datasetEvaluation / Playback
# MJLab
python scripts/renforce/play_mjlab.py \
--target logs/RFRL/mjlab_go1/model_5000.pt --num_envs 64
# Isaac Lab
python scripts/renforce/play_lab.py \
--target logs/RFRL/go1_ppo/model_5000.pt --videoData Conversion
# Download LeRobot dataset
python scripts/data/download_lerobot_dataset.py --repo lerobot/aloha_sim
# Isaac Lab trajectories → LeRobot format
python scripts/data/isaaclab_to_lerobot.py --input traj/ --output data/lerobot/
# RLDS → LeRobot
python scripts/data/rlds_to_lerobot.py --input rlds_data/ --output data/lerobot/| Algorithm | Type | Input | Runner | Reference |
|---|---|---|---|---|
| PPO | On-policy, GAE | State | OnPolicyRunner |
ppo.py |
| CAPS-PPO | On-policy, smooth | State | OnPolicyRunner |
CAPS.py |
| L2C2-PPO | On-policy, smooth | State | OnPolicyRunner |
L2C2.py |
| Lips-PPO | On-policy, Lipschitz | State | OnPolicyRunner |
Lips.py |
| SAPG-PPO | On-policy, self-adaptive | State | SAPGOnPolicyRunner |
sapg/ |
| EPO | On-policy, exploration | State | EPOOnPolicyRunner |
epo/ |
| SAC | Off-policy, entropy-reg | State | OffPolicyRunner |
sac/ |
| SAC-Seq | Off-policy, sequential | State | OffPolicyRunner |
sac_seq.py |
| SAC-Trans | Off-policy, transformer | State | OffPolicyRunner |
sac_trans.py |
| DSAC / DSACT | Off-policy, distributional | State | OffPolicyRunner |
dsac/ |
| Algorithm | Type | Input | Runner | Reference |
|---|---|---|---|---|
| IQL | Implicit Q-Learning | State | OfflineRunnerBase |
iql.py |
| Algorithm | Type | Input | Runner | Reference |
|---|---|---|---|---|
| VLA Pretrain | Behavior cloning (L1/MSE) | Image + Language + State | VLAPretrainRunner |
pretrain_algorithm.py |
| SFT | Supervised fine-tuning (KL reg.) | Image + Language + State | VLASFTRunner |
sft.py |
| DAgger | Online imitation + expert intervention | Image + Language + State | — | dagger.py |
| Algorithm | Type | Input | Runner | Reference |
|---|---|---|---|---|
| GRPO | Group Relative Policy Opt | Image + Language + State | VLAGRPORunner |
grpo.py |
| VLA-PPO | PPO over VLM backbone | Image + Language + State | VLAPPORunner |
ppo.py |
| VLA-SAC | SAC over VLM backbone | Image + Language + State | — | sac.py |
| Algorithm | Type | Input | Runner | Reference |
|---|---|---|---|---|
| MBPO | Model-Based Policy Opt | State | MBPOOnPolicyRunner |
mbpo/ |
| System Dynamics (MLP) | Forward model f(s,a)→s' | State | OfflineRunnerBase |
system_dynamics_mlp.py |
| System Dynamics (Transformer) | Forward model f(s,a)→s' | State | OfflineRunnerBase |
system_dynamics_transformer.py |
| TD-MPC / TD-MPC2 | Latent dynamics + planning | State | NNModelBasedRunner |
tdmpcs/ |
| Belief Flow Model | Belief state dynamics | State | FlowModelRunner |
belief_flow_model/ |
| Algorithm | Type | Input | Runner | Reference |
|---|---|---|---|---|
| GAIL + PPO | Adversarial IL | State | OnPolicyRunner |
gail_ppo.py |
| AMP + PPO | Adversarial Motion Priors | State | OnPolicyRunner |
amp_ppo.py |
| Distillation | Knowledge transfer | State | — | distillation.py |
| Model | Params | Output Dim | Action Heads | License |
|---|---|---|---|---|
| Qwen2-VL | 2B / 7B | 1536 | Regression, Diffusion | Apache 2.0 |
| Qwen3-VL | 2B / 8B | 2048 | Regression, Diffusion | Apache 2.0 |
| OpenPI (pi0.5) | 4B | 2048 | Flow Matching | Apache 2.0 + Gemma |
| GR00T N1.7 | 3B | 2048 | DiT | Apache 2.0 |
| MLP Baseline | ~1M | 64 | Regression | Built-in |
See docs/models.md for full details.
RoboRenForce covers four training paradigms:
- Online RL — agent interacts with simulator, maximizes reward (PPO, SAC)
- Offline RL — learns from fixed dataset via Q-learning, no env interaction (IQL, CQL)
- VLA Pretrain (SL) — supervised learning on demonstration data with VLM backbone
- VLA RL Fine-tune — online RL with a VLA policy (GRPO, PPO over VLM)
| Platform | Package | Robots / Tasks | Paradigm | Script |
|---|---|---|---|---|
| MJLab (MuJoCo Warp) | RRF_mjlab |
Go1, G1 — velocity tracking (flat/rough) | Online RL | train_mjlab.py |
| Isaac Lab (Isaac Sim) | RRF_isaaclab |
A1, Go1, Go2, Anymal B/C/D, H1, G1 | Online RL | train_lab.py |
| Gymnasium | Built-in | Classic control, MuJoCo | Online RL | train_gym.py |
| D4RL | RRF_d4rl |
Walker2d, Hopper, HalfCheetah | Offline RL | OfflineRunnerBase |
| LIBERO | RRF_libero |
Franka — 10/90/130 manipulation tasks | VLA Pretrain / VLA RL | train_vla_benchmark.py |
| ManiSkill | RRF_maniskill |
Franka — GPU-accelerated manipulation | VLA Pretrain / VLA RL | train_vla_benchmark.py |
| CALVIN | RRF_calvin |
Franka — 5-subtask long-horizon eval | VLA Pretrain / VLA RL | train_vla_benchmark.py |
| RoboTwin (SAPIEN3) | RRF_robotwin |
Piper, ALOHA — 60+ manipulation tasks | VLA RL | train_robotwin_grpo.py |
| Humanoid Psi0 | RRF_humanoid_psi0 |
G1 Dex3 — 14 pick-and-place demos | VLA Pretrain | train_single_gpu.py |
All components are configured via @configclass — a decorator that extends Python dataclasses with type validation, serialization, and factory construction.
Example: defining a custom PPO config
from RoboRenForce import configclass
from RoboRenForce import runners, algorithms, components, networks
@configclass
class MyLocoPPOCfg(runners.OnPolicyRunnerCfg):
seed = 42
num_steps_per_env = 24
max_iterations = 10000
experiment_name = "my_experiment"
policy = components.ActorCriticPackCfg(
actor_cfg=components.StateIndStdActorCfg(
backbone_cfg=networks.MLPCfg(
hidden_features=[512, 256, 128],
activations=[[('ELU', {})]] * 3 + [[]]
),
use_log_std=False
),
critic_cfg=components.VNetworkCfg(
backbone_cfg=networks.MLPCfg(
hidden_features=[512, 256, 128],
activations=[[('ELU', {})]] * 3 + [[]]
)
)
)
algorithm = algorithms.PPOCfg(
clip_param=0.2,
entropy_coef=0.01,
num_learning_epochs=5,
num_mini_batches=4,
learning_rate=1.0e-3,
schedule="adaptive",
gamma=0.99,
lam=0.95,
)Example: registering a task with gymnasium
import gymnasium as gym
gym.register(
id="RoboRenForce-MyTask-PPO",
entry_point="mjlab.envs:ManagerBasedRlEnv",
disable_env_checker=True,
kwargs={
"env_cfg_entry_point": my_env_cfg,
"RoboRenForce_entry_point": MyLocoPPOCfg(),
},
)Example: environment wrapper chain
from mjlab.envs import ManagerBasedRlEnv
from RRF_mjlab_tasks.mjlab_utils import (
RoboRenForceMJLabEnvWrapper, # Base: obs remapping, step adaptation
MJLabDynamicEnvWrapper, # + reward/command extraction, dim_params
MJLabGroupVecWrapper, # + train/eval env partitioning
)
env = ManagerBasedRlEnv(cfg=my_cfg, device="cuda:0")
wrapped = MJLabDynamicEnvWrapper(env)
print(wrapped.num_envs) # 4096
print(wrapped.dim_params) # {'policy_dim': 48, 'critic_dim': 72, ...}
obs, extras = wrapped.reset() # obs: (4096, 48)RoboRenForce/
├── source/
│ ├── RoboRenForce/ # Core framework
│ │ └── RoboRenForce/
│ │ ├── algorithms/ # RL algorithms
│ │ │ ├── on_policy/ # PPO, MBPO, SAPG, smooth variants
│ │ │ ├── off_policy/ # SAC, DSAC
│ │ │ ├── vla_training/ # Pretrain, SFT, GRPO, PPO, IQL, DAgger
│ │ │ └── nn_model_trainer/
│ │ ├── runners/ # Training loops
│ │ │ ├── on_policy/ # OnPolicyRunner, SAPG, EPO
│ │ │ ├── off_policy/ # OffPolicyRunner
│ │ │ ├── vla/ # Pretrain, SFT, GRPO (+ DDP variants)
│ │ │ └── nn_model_based/ # MBPO, Flow model
│ │ ├── networks/ # Neural network modules
│ │ │ ├── vlm/ # Qwen2-VL, Qwen3-VL, OpenPI, GR00T
│ │ │ ├── transformer/ # Transformer backbone
│ │ │ └── mlp.py, vae/, moe.py, fft_filter.py
│ │ ├── components/ # Actors, critics, normalizers
│ │ │ ├── actor/ # Gaussian, SAC, Lipschitz, VLA actors
│ │ │ ├── critic/ # V-net, Q-net, distributional
│ │ │ └── normalizer/ # Empirical normalizer
│ │ ├── buffer/ # Replay buffers & rollout storage
│ │ └── utils/ # Config system, env wrappers, tools
│ │ ├── configclass/ # @configclass decorator
│ │ └── env_wrapper/ # Lab, Gym, VLA wrapper chains
│ ├── RRF_orchestra/ # Role-distributed runner (opt-in)
│ │ └── RRF_orchestra/ # rollout / inference / learner role-split
│ │ ├── protocol/ # Wire protocol (messages, channels, shared_tensor)
│ │ ├── workers/ # Env / Inference workers + batchers
│ │ ├── orchestrator/ # Topology, Supervisor, OrchestraVLARunner
│ │ ├── adapters/ # Algorithm + policy adapters
│ │ └── examples/ # hello_world_orchestra
│ └── tasks/ # Task packages
│ ├── RRF_isaaclab/ # Isaac Lab locomotion & manipulation
│ ├── RRF_mjlab/ # MJLab (MuJoCo Warp) locomotion
│ ├── RRF_robotwin/ # RoboTwin 60+ manipulation tasks
│ └── RRF_humanoid_psi0/ # Humanoid offline datasets
├── scripts/
│ ├── renforce/ # Standard RL training & evaluation
│ │ ├── train_lab.py # Isaac Lab training
│ │ ├── train_mjlab.py # MJLab training
│ │ ├── train_gym.py # Gymnasium training
│ │ ├── play_lab.py # Isaac Lab evaluation
│ │ └── play_mjlab.py # MJLab evaluation
│ ├── vla/ # VLA model training
│ │ ├── pretrain/ # Single-GPU & DDP pretraining
│ │ ├── post_train/ # SFT (single & DDP)
│ │ └── rl/ # GRPO / PPO fine-tuning
│ └── data/ # Dataset tools
│ ├── download_lerobot_dataset.py
│ ├── isaaclab_to_lerobot.py
│ └── rlds_to_lerobot.py
├── docs/
│ ├── models.md # VLM backbone details & setup
│ └── BENCHMARK_PLAN.md # Benchmark experiment specs
└── tests/ # Test suites
RoboRenForce ships with two orthogonal parallelism mechanisms — they solve different problems and can be composed.
| Axis | Where it lives | Problem solved | Typical use |
|---|---|---|---|
| Data-parallel (DDP) | RoboRenForce/runners/vla/{pretrain,post_train}/*_runner_distributed.py + scripts/vla/{pretrain,post_train}/train_*_ddp.py |
Single GPU is too small / too slow for the desired batch. Same role on every rank, gradients all-reduced via NCCL. | VLA pretrain / SFT on N GPUs, launched with torchrun --nproc_per_node=N |
| Role-distributed (Orchestra) | source/RRF_orchestra/ — opt-in package |
CPU sim and GPU inference idle waiting on each other. One process per role: EnvWorker × N, InferenceWorker × 1, Learner × 1. Tensors moved over torch.multiprocessing.Queue with shared-memory zero-copy. |
VLA RL fine-tuning on a single machine; VLA model dominates step time and you want batched inference across N parallel envs |
The two axes compose cleanly — e.g. role-distributed rollout + DDP learner — though v1 keeps them decoupled.
# VLA Pretraining — 4 GPUs
torchrun --nproc_per_node=4 scripts/vla/pretrain/train_ddp.py \
--model_name Qwen/Qwen2-VL-2B-Instruct \
--dataset_path data/humanoid_psi0 --epochs 20
# VLA SFT — 2 GPUs
torchrun --nproc_per_node=2 scripts/vla/post_train/train_sft_ddp.py \
--model_name Qwen/Qwen2-VL-2B-Instruct \
--dataset_path data/my_datasetDDP implementation details
- Serialized model loading (rank 0 first, then barrier) to avoid HuggingFace cache races
- 30-minute NCCL timeout for large model initialization
device_map_auto=Falsefor DDP compatibility (no model sharding)- Unwrapped model for validation (avoids DDP deadlock on single-rank validation)
DistributedSamplerwith proper epoch shuffling
RRF_orchestra runs each role (env / inference / learner) in its own process. Large tensors travel through shared memory (zero-copy); only small handles and metadata go through mp.Queue. The OrchestraVLARunner is a drop-in replacement for the single-process VLA RL runner.
EnvWorker × N InferenceWorker × 1 Learner × 1
┌───────────┐ ObsBatch ┌─────────────────┐ ┌─────────┐
│ env.step │ ─────────▶ │ batched VLA inf │ ──────▶ │ collect │
│ │ ◀───────── │ ActionBatch │ │ traj │
└─────┬─────┘ └────────▲────────┘ │ grad │
│ Trajectory │ WeightUpdate │ update │
▼ └─────────────────┤ │
┌──────────────────────────────────────────────────────┘
│ channels (mp.Queue + shared-mem tensors)
└────────────────────────────────────────────────────────
# Install (opt-in — keeps core repo light for single-process users)
pip install -e source/RRF_orchestra
# End-to-end smoke (2 env workers + 1 inference + 1 learner, 5 iterations)
python -m RRF_orchestra.examples.hello_world_orchestra
# 20 unit / integration tests (mp + shared-memory + orchestrator round-trip)
pytest tests/RRF_orchestra/ -vSee source/RRF_orchestra/README.md for the user guide and docs/PLAN-task3-orchestra-package.md for the full design (wire protocol, topology, per-task EnvWorker integration).
Naming note: this package was previously called
RRF_distributed. It was renamed toRRF_orchestrato disambiguate from the data-parallel*_runner_distributed.pyfiles (PyTorch DDP). The two are independent axes.
See docs/BENCHMARK_PLAN.md for the full experiment matrix.
| Experiment | VLM | Head | GPUs | Train Loss | Val Loss | Throughput | Notes |
|---|---|---|---|---|---|---|---|
| MLP Baseline | MockVLM | Regression | 1 × H100 | 0.1705 | 0.3899 | 3.34 batch/s | Sanity-check run |
| Qwen2-VL DDP | Qwen2-VL-2B | Regression | 3 × H100 | 0.0323 → 0.0228 | 0.0228 | ~67 samples/s | 4 epochs, AMP enabled |
| Task | Algorithm | Envs | Reward (start → end) | Steps/s | Hardware | Iters |
|---|---|---|---|---|---|---|
| Go1 Flat | PPO (GAE) | 256 | −5.77 → −0.31 | 1,100 | 1 × H100 | 20 |
| Paradigm | Algorithm | Status |
|---|---|---|
| Pretrain (SL) | VLAPretrainAlgorithm | ✅ Verified (single + 3-GPU DDP) |
| SFT | SFTAlgorithm (KL reg.) | ✅ Verified (single + 2-GPU DDP) |
| GRPO | GRPOAlgorithm | ✅ Verified |
| PPO (GAE) | PPOAlgorithm | ✅ Verified |
| Locomotion PPO | PPO (MJLab Go1) | ✅ Verified (H100, 1100 steps/s) |
| Script | Purpose | Docs |
|---|---|---|
train_mjlab.py |
Train on MJLab environments | --help for all options |
train_lab.py |
Train on Isaac Lab environments | Requires Isaac Sim |
train_gym.py |
Train on Gymnasium/MuJoCo | Standard envs |
play_mjlab.py |
Evaluate MJLab checkpoint | |
play_lab.py |
Evaluate Isaac Lab checkpoint | Video recording |
train_single_gpu.py |
VLA pretraining (1 GPU) | |
train_ddp.py |
VLA pretraining (multi-GPU) | Use with torchrun |
train_sft.py |
VLA supervised fine-tuning | |
train_robotwin_grpo.py |
VLA RL (GRPO/PPO) on RoboTwin | --algo grpo/ppo |
train_vla_benchmark.py |
VLA on LIBERO/ManiSkill/CALVIN | --benchmark libero |
- Built with alignment to rsl_rl conventions
- Isaac Lab integration via Isaac Lab
- MJLab integration via MJLab (MuJoCo Warp)
- VLM backbones from HuggingFace ecosystem
- Data format compatible with LeRobot
BSD-3-Clause. See LICENSE for details.
- Maintainer: Ziang Zheng — ziang_zheng@foxmail.com