Skip to content

Renforce-Dynamics/RoboRenForce

Repository files navigation

RoboRenForce

Modular RL & VLA Framework for Robotics — from Locomotion to Vision-Language-Action

Python 3.10+ License Isaac Lab MJLab

RoboRenForce is a unified framework that covers the full robotics RL pipeline: classic locomotion control (PPO/SAC on Isaac Lab, MJLab, Gymnasium), vision-language-action model training (Qwen2-VL, Qwen3-VL, OpenPI, GR00T), and multi-stage learning (pretrain → SFT → RL fine-tuning). Everything is driven by a composable @configclass system and a consistent wrapper chain across simulators.


Architecture

┌──────────────────────────────── RoboRenForce ────────────────────────────────┐
│                                                                              │
│   System 2 (VLM Backbone)            System 1 (Action Expert / Psi0)         │
│   ┌──────────────────────┐           ┌──────────────────────────────┐        │
│   │ Qwen2-VL / Qwen3-VL  │──feats──▶ │ Psi0 — Regression /          │        │
│   │ OpenPI / GR00T       │           │  Diffusion / Flow-Match head │        │
│   │ (frozen / LoRA)      │           │ chunked high-level actions   │        │
│   └──────────────────────┘           └──────────────────────────────┘        │
│              ▲                                       │                       │
│       obs (image+lang)                  high-level target / EE pose          │
│              │                                       ▼                       │
│              │                       System 0 (Whole-Body Loco Policy)       │
│              │                       ┌──────────────────────────────┐        │
│              │                       │ AMO · Sonic · custom RL/MPC  │        │
│              │                       │  joint-level torques / dq    │        │
│              │                       └──────────────────────────────┘        │
│              │                                       │                       │
│  ┌───────────┴───────────────────────────────────────┴─────────────────────┐ │
│  │                    Environment Wrapper Chain                            │ │
│  │  Isaac Lab ─┐                                                           │ │
│  │  MJLab     ─┤─▶ VecEnv ─▶ DynamicEnv ─▶ GroupVecWrapper ─▶ MultiModal  │ │
│  │  RoboTwin  ─┤                                                           │ │
│  │  Gymnasium ─┘                                                           │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘

System 2 reasons over vision + language and emits latent features.
System 1 (Psi0) consumes those features and outputs chunked high-level actions
(joint targets / EE poses) via a regression, diffusion, or flow-match head.
System 0 is a low-level whole-body locomotion policy (AMO, Sonic, …) trained
with PPO/SAC that tracks System 1's targets at simulator rate.

Installation

Planning a GR00T finetune? Read docs/SETUP_GUIDE.md first — it walks through the system packages (python3.10-dev, ffmpeg, libaio-dev, git-lfs), downloading the gated nvidia/GR00T-N1.7-3B + nvidia/Cosmos-Reason2-2B weights, disk-space planning, and a verified end-to-end finetune on the bundled SO101 demo. For training data, see docs/DATA_DOWNLOAD.md (bundled demos, NVIDIA GR00T-flavored datasets, and converting community LeRobot v3 datasets).

git clone <repository-url>
cd RoboRenForce

# Core framework
pip install -e source/RoboRenForce

# Task packages (install what you need)
pip install -e source/tasks/RRF_isaaclab    # Isaac Lab locomotion/manipulation
pip install -e source/tasks/RRF_mjlab       # MJLab (MuJoCo Warp) locomotion
pip install -e source/tasks/RRF_robotwin    # RoboTwin manipulation
pip install -e source/tasks/RRF_humanoid_psi0  # Humanoid offline tasks

# Role-distributed VLA RL training (opt-in — see "Parallelism" section)
pip install -e source/RRF_orchestra

# External setup (robot assets, etc.)
bash scripts/setup_ext.sh
VLA model setup (optional)
# Download VLM weights
bash scripts/models/setup_models.sh qwen2vl    # Qwen2-VL 2B (~4.2GB)
bash scripts/models/setup_models.sh qwen3vl    # Qwen3-VL 2B
bash scripts/models/setup_models.sh openpi     # OpenPI pi0.5 4B
bash scripts/models/setup_models.sh groot      # GR00T N1.7 3B

# Or install dependencies only
pip install "transformers>=4.37" qwen-vl-utils accelerate peft

See docs/models.md for per-model details, VRAM requirements, and usage examples.

MJLab simulator setup
# Clone and install MJLab
git clone https://github.com/mujocolab/mjlab.git
pip install -e mjlab

# Requires: mujoco>=3.7.0, mujoco-warp>=3.7.0.1, warp-lang>=1.12.0

Quick Start

Locomotion Training (MJLab)

# Go1 quadruped on flat terrain — PPO
python scripts/renforce/train_mjlab.py \
    --task Mjlab-Velocity-Flat-Unitree-Go1 \
    --num_envs 4096 --device cuda:0

# G1 humanoid on rough terrain
python scripts/renforce/train_mjlab.py \
    --task Mjlab-Velocity-Rough-Unitree-G1 \
    --num_envs 2048 --max_iterations 30000

Locomotion Training (Isaac Lab)

python scripts/renforce/train_lab.py \
    --task RoboRenForce-AFR-UnitreeGo1Flat-PPO \
    --num_envs 4096 --headless

VLA Pretraining

# Single GPU
python scripts/vla/pretrain/train_single_gpu.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset --epochs 20

# Multi-GPU DDP
torchrun --nproc_per_node=4 scripts/vla/pretrain/train_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset --epochs 20

VLA RL Fine-tuning (GRPO / PPO)

python scripts/vla/rl/train_robotwin_grpo.py \
    --task close_laptop_lid --algo grpo --num_envs 32
More examples: SFT, evaluation, data pipeline

Supervised Fine-tuning (SFT)

# Single GPU
python scripts/vla/post_train/train_sft.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset --epochs 5

# Multi-GPU
torchrun --nproc_per_node=2 scripts/vla/post_train/train_sft_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset

Evaluation / Playback

# MJLab
python scripts/renforce/play_mjlab.py \
    --target logs/RFRL/mjlab_go1/model_5000.pt --num_envs 64

# Isaac Lab
python scripts/renforce/play_lab.py \
    --target logs/RFRL/go1_ppo/model_5000.pt --video

Data Conversion

# Download LeRobot dataset
python scripts/data/download_lerobot_dataset.py --repo lerobot/aloha_sim

# Isaac Lab trajectories → LeRobot format
python scripts/data/isaaclab_to_lerobot.py --input traj/ --output data/lerobot/

# RLDS → LeRobot
python scripts/data/rlds_to_lerobot.py --input rlds_data/ --output data/lerobot/

Supported Algorithms

Online RL — agent interacts with simulator, maximizes reward

Algorithm Type Input Runner Reference
PPO On-policy, GAE State OnPolicyRunner ppo.py
CAPS-PPO On-policy, smooth State OnPolicyRunner CAPS.py
L2C2-PPO On-policy, smooth State OnPolicyRunner L2C2.py
Lips-PPO On-policy, Lipschitz State OnPolicyRunner Lips.py
SAPG-PPO On-policy, self-adaptive State SAPGOnPolicyRunner sapg/
EPO On-policy, exploration State EPOOnPolicyRunner epo/
SAC Off-policy, entropy-reg State OffPolicyRunner sac/
SAC-Seq Off-policy, sequential State OffPolicyRunner sac_seq.py
SAC-Trans Off-policy, transformer State OffPolicyRunner sac_trans.py
DSAC / DSACT Off-policy, distributional State OffPolicyRunner dsac/

Offline RL — learns from fixed dataset, no environment interaction

Algorithm Type Input Runner Reference
IQL Implicit Q-Learning State OfflineRunnerBase iql.py

VLA Pretrain (SL) — supervised learning on demonstration data

Algorithm Type Input Runner Reference
VLA Pretrain Behavior cloning (L1/MSE) Image + Language + State VLAPretrainRunner pretrain_algorithm.py
SFT Supervised fine-tuning (KL reg.) Image + Language + State VLASFTRunner sft.py
DAgger Online imitation + expert intervention Image + Language + State dagger.py

VLA RL Fine-tune — online RL with VLM policy backbone

Algorithm Type Input Runner Reference
GRPO Group Relative Policy Opt Image + Language + State VLAGRPORunner grpo.py
VLA-PPO PPO over VLM backbone Image + Language + State VLAPPORunner ppo.py
VLA-SAC SAC over VLM backbone Image + Language + State sac.py

NN Model-Based — learn dynamics model, plan or augment policy

Algorithm Type Input Runner Reference
MBPO Model-Based Policy Opt State MBPOOnPolicyRunner mbpo/
System Dynamics (MLP) Forward model f(s,a)→s' State OfflineRunnerBase system_dynamics_mlp.py
System Dynamics (Transformer) Forward model f(s,a)→s' State OfflineRunnerBase system_dynamics_transformer.py
TD-MPC / TD-MPC2 Latent dynamics + planning State NNModelBasedRunner tdmpcs/
Belief Flow Model Belief state dynamics State FlowModelRunner belief_flow_model/

Imitation Learning — learn from expert demonstrations or motions

Algorithm Type Input Runner Reference
GAIL + PPO Adversarial IL State OnPolicyRunner gail_ppo.py
AMP + PPO Adversarial Motion Priors State OnPolicyRunner amp_ppo.py
Distillation Knowledge transfer State distillation.py

Supported VLM Backbones

Model Params Output Dim Action Heads License
Qwen2-VL 2B / 7B 1536 Regression, Diffusion Apache 2.0
Qwen3-VL 2B / 8B 2048 Regression, Diffusion Apache 2.0
OpenPI (pi0.5) 4B 2048 Flow Matching Apache 2.0 + Gemma
GR00T N1.7 3B 2048 DiT Apache 2.0
MLP Baseline ~1M 64 Regression Built-in

See docs/models.md for full details.


Supported Environments

RoboRenForce covers four training paradigms:

  • Online RL — agent interacts with simulator, maximizes reward (PPO, SAC)
  • Offline RL — learns from fixed dataset via Q-learning, no env interaction (IQL, CQL)
  • VLA Pretrain (SL) — supervised learning on demonstration data with VLM backbone
  • VLA RL Fine-tune — online RL with a VLA policy (GRPO, PPO over VLM)
Platform Package Robots / Tasks Paradigm Script
MJLab (MuJoCo Warp) RRF_mjlab Go1, G1 — velocity tracking (flat/rough) Online RL train_mjlab.py
Isaac Lab (Isaac Sim) RRF_isaaclab A1, Go1, Go2, Anymal B/C/D, H1, G1 Online RL train_lab.py
Gymnasium Built-in Classic control, MuJoCo Online RL train_gym.py
D4RL RRF_d4rl Walker2d, Hopper, HalfCheetah Offline RL OfflineRunnerBase
LIBERO RRF_libero Franka — 10/90/130 manipulation tasks VLA Pretrain / VLA RL train_vla_benchmark.py
ManiSkill RRF_maniskill Franka — GPU-accelerated manipulation VLA Pretrain / VLA RL train_vla_benchmark.py
CALVIN RRF_calvin Franka — 5-subtask long-horizon eval VLA Pretrain / VLA RL train_vla_benchmark.py
RoboTwin (SAPIEN3) RRF_robotwin Piper, ALOHA — 60+ manipulation tasks VLA RL train_robotwin_grpo.py
Humanoid Psi0 RRF_humanoid_psi0 G1 Dex3 — 14 pick-and-place demos VLA Pretrain train_single_gpu.py

Configuration System

All components are configured via @configclass — a decorator that extends Python dataclasses with type validation, serialization, and factory construction.

Example: defining a custom PPO config
from RoboRenForce import configclass
from RoboRenForce import runners, algorithms, components, networks

@configclass
class MyLocoPPOCfg(runners.OnPolicyRunnerCfg):
    seed = 42
    num_steps_per_env = 24
    max_iterations = 10000
    experiment_name = "my_experiment"

    policy = components.ActorCriticPackCfg(
        actor_cfg=components.StateIndStdActorCfg(
            backbone_cfg=networks.MLPCfg(
                hidden_features=[512, 256, 128],
                activations=[[('ELU', {})]] * 3 + [[]]
            ),
            use_log_std=False
        ),
        critic_cfg=components.VNetworkCfg(
            backbone_cfg=networks.MLPCfg(
                hidden_features=[512, 256, 128],
                activations=[[('ELU', {})]] * 3 + [[]]
            )
        )
    )

    algorithm = algorithms.PPOCfg(
        clip_param=0.2,
        entropy_coef=0.01,
        num_learning_epochs=5,
        num_mini_batches=4,
        learning_rate=1.0e-3,
        schedule="adaptive",
        gamma=0.99,
        lam=0.95,
    )
Example: registering a task with gymnasium
import gymnasium as gym

gym.register(
    id="RoboRenForce-MyTask-PPO",
    entry_point="mjlab.envs:ManagerBasedRlEnv",
    disable_env_checker=True,
    kwargs={
        "env_cfg_entry_point": my_env_cfg,
        "RoboRenForce_entry_point": MyLocoPPOCfg(),
    },
)
Example: environment wrapper chain
from mjlab.envs import ManagerBasedRlEnv
from RRF_mjlab_tasks.mjlab_utils import (
    RoboRenForceMJLabEnvWrapper,   # Base: obs remapping, step adaptation
    MJLabDynamicEnvWrapper,         # + reward/command extraction, dim_params
    MJLabGroupVecWrapper,           # + train/eval env partitioning
)

env = ManagerBasedRlEnv(cfg=my_cfg, device="cuda:0")
wrapped = MJLabDynamicEnvWrapper(env)

print(wrapped.num_envs)       # 4096
print(wrapped.dim_params)     # {'policy_dim': 48, 'critic_dim': 72, ...}
obs, extras = wrapped.reset() # obs: (4096, 48)

Project Structure

RoboRenForce/
├── source/
│   ├── RoboRenForce/                   # Core framework
│   │   └── RoboRenForce/
│   │       ├── algorithms/             # RL algorithms
│   │       │   ├── on_policy/          #   PPO, MBPO, SAPG, smooth variants
│   │       │   ├── off_policy/         #   SAC, DSAC
│   │       │   ├── vla_training/       #   Pretrain, SFT, GRPO, PPO, IQL, DAgger
│   │       │   └── nn_model_trainer/
│   │       ├── runners/                # Training loops
│   │       │   ├── on_policy/          #   OnPolicyRunner, SAPG, EPO
│   │       │   ├── off_policy/         #   OffPolicyRunner
│   │       │   ├── vla/               #   Pretrain, SFT, GRPO (+ DDP variants)
│   │       │   └── nn_model_based/   #   MBPO, Flow model
│   │       ├── networks/               # Neural network modules
│   │       │   ├── vlm/               #   Qwen2-VL, Qwen3-VL, OpenPI, GR00T
│   │       │   ├── transformer/       #   Transformer backbone
│   │       │   └── mlp.py, vae/, moe.py, fft_filter.py
│   │       ├── components/             # Actors, critics, normalizers
│   │       │   ├── actor/             #   Gaussian, SAC, Lipschitz, VLA actors
│   │       │   ├── critic/            #   V-net, Q-net, distributional
│   │       │   └── normalizer/        #   Empirical normalizer
│   │       ├── buffer/                 # Replay buffers & rollout storage
│   │       └── utils/                  # Config system, env wrappers, tools
│   │           ├── configclass/       #   @configclass decorator
│   │           └── env_wrapper/       #   Lab, Gym, VLA wrapper chains
│   ├── RRF_orchestra/                  # Role-distributed runner (opt-in)
│   │   └── RRF_orchestra/              #   rollout / inference / learner role-split
│   │       ├── protocol/              #     Wire protocol (messages, channels, shared_tensor)
│   │       ├── workers/               #     Env / Inference workers + batchers
│   │       ├── orchestrator/          #     Topology, Supervisor, OrchestraVLARunner
│   │       ├── adapters/              #     Algorithm + policy adapters
│   │       └── examples/              #     hello_world_orchestra
│   └── tasks/                          # Task packages
│       ├── RRF_isaaclab/              #   Isaac Lab locomotion & manipulation
│       ├── RRF_mjlab/                 #   MJLab (MuJoCo Warp) locomotion
│       ├── RRF_robotwin/              #   RoboTwin 60+ manipulation tasks
│       └── RRF_humanoid_psi0/         #   Humanoid offline datasets
├── scripts/
│   ├── renforce/                       # Standard RL training & evaluation
│   │   ├── train_lab.py               #   Isaac Lab training
│   │   ├── train_mjlab.py             #   MJLab training
│   │   ├── train_gym.py               #   Gymnasium training
│   │   ├── play_lab.py                #   Isaac Lab evaluation
│   │   └── play_mjlab.py              #   MJLab evaluation
│   ├── vla/                            # VLA model training
│   │   ├── pretrain/                  #   Single-GPU & DDP pretraining
│   │   ├── post_train/                #   SFT (single & DDP)
│   │   └── rl/                        #   GRPO / PPO fine-tuning
│   └── data/                           # Dataset tools
│       ├── download_lerobot_dataset.py
│       ├── isaaclab_to_lerobot.py
│       └── rlds_to_lerobot.py
├── docs/
│   ├── models.md                       # VLM backbone details & setup
│   └── BENCHMARK_PLAN.md              # Benchmark experiment specs
└── tests/                              # Test suites

Parallelism: Two Independent Axes

RoboRenForce ships with two orthogonal parallelism mechanisms — they solve different problems and can be composed.

Axis Where it lives Problem solved Typical use
Data-parallel (DDP) RoboRenForce/runners/vla/{pretrain,post_train}/*_runner_distributed.py + scripts/vla/{pretrain,post_train}/train_*_ddp.py Single GPU is too small / too slow for the desired batch. Same role on every rank, gradients all-reduced via NCCL. VLA pretrain / SFT on N GPUs, launched with torchrun --nproc_per_node=N
Role-distributed (Orchestra) source/RRF_orchestra/ — opt-in package CPU sim and GPU inference idle waiting on each other. One process per role: EnvWorker × N, InferenceWorker × 1, Learner × 1. Tensors moved over torch.multiprocessing.Queue with shared-memory zero-copy. VLA RL fine-tuning on a single machine; VLA model dominates step time and you want batched inference across N parallel envs

The two axes compose cleanly — e.g. role-distributed rollout + DDP learner — though v1 keeps them decoupled.

Data-parallel (DDP)

# VLA Pretraining — 4 GPUs
torchrun --nproc_per_node=4 scripts/vla/pretrain/train_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/humanoid_psi0 --epochs 20

# VLA SFT — 2 GPUs
torchrun --nproc_per_node=2 scripts/vla/post_train/train_sft_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset
DDP implementation details
  • Serialized model loading (rank 0 first, then barrier) to avoid HuggingFace cache races
  • 30-minute NCCL timeout for large model initialization
  • device_map_auto=False for DDP compatibility (no model sharding)
  • Unwrapped model for validation (avoids DDP deadlock on single-rank validation)
  • DistributedSampler with proper epoch shuffling

Role-distributed (Orchestra)

RRF_orchestra runs each role (env / inference / learner) in its own process. Large tensors travel through shared memory (zero-copy); only small handles and metadata go through mp.Queue. The OrchestraVLARunner is a drop-in replacement for the single-process VLA RL runner.

   EnvWorker × N           InferenceWorker × 1         Learner × 1
   ┌───────────┐  ObsBatch  ┌─────────────────┐         ┌─────────┐
   │ env.step  │ ─────────▶ │ batched VLA inf │ ──────▶ │ collect │
   │           │ ◀───────── │ ActionBatch     │         │ traj    │
   └─────┬─────┘            └────────▲────────┘         │ grad    │
         │ Trajectory                 │ WeightUpdate    │ update  │
         ▼                            └─────────────────┤         │
   ┌──────────────────────────────────────────────────────┘
   │              channels (mp.Queue + shared-mem tensors)
   └────────────────────────────────────────────────────────
# Install (opt-in — keeps core repo light for single-process users)
pip install -e source/RRF_orchestra

# End-to-end smoke (2 env workers + 1 inference + 1 learner, 5 iterations)
python -m RRF_orchestra.examples.hello_world_orchestra

# 20 unit / integration tests (mp + shared-memory + orchestrator round-trip)
pytest tests/RRF_orchestra/ -v

See source/RRF_orchestra/README.md for the user guide and docs/PLAN-task3-orchestra-package.md for the full design (wire protocol, topology, per-task EnvWorker integration).

Naming note: this package was previously called RRF_distributed. It was renamed to RRF_orchestra to disambiguate from the data-parallel *_runner_distributed.py files (PyTorch DDP). The two are independent axes.


Benchmark Results

See docs/BENCHMARK_PLAN.md for the full experiment matrix.

VLA Pretraining

Experiment VLM Head GPUs Train Loss Val Loss Throughput Notes
MLP Baseline MockVLM Regression 1 × H100 0.1705 0.3899 3.34 batch/s Sanity-check run
Qwen2-VL DDP Qwen2-VL-2B Regression 3 × H100 0.0323 → 0.0228 0.0228 ~67 samples/s 4 epochs, AMP enabled

Locomotion (MJLab)

Task Algorithm Envs Reward (start → end) Steps/s Hardware Iters
Go1 Flat PPO (GAE) 256 −5.77 → −0.31 1,100 1 × H100 20

Algorithm Verification

Paradigm Algorithm Status
Pretrain (SL) VLAPretrainAlgorithm ✅ Verified (single + 3-GPU DDP)
SFT SFTAlgorithm (KL reg.) ✅ Verified (single + 2-GPU DDP)
GRPO GRPOAlgorithm ✅ Verified
PPO (GAE) PPOAlgorithm ✅ Verified
Locomotion PPO PPO (MJLab Go1) ✅ Verified (H100, 1100 steps/s)

Scripts Reference

Script Purpose Docs
train_mjlab.py Train on MJLab environments --help for all options
train_lab.py Train on Isaac Lab environments Requires Isaac Sim
train_gym.py Train on Gymnasium/MuJoCo Standard envs
play_mjlab.py Evaluate MJLab checkpoint
play_lab.py Evaluate Isaac Lab checkpoint Video recording
train_single_gpu.py VLA pretraining (1 GPU)
train_ddp.py VLA pretraining (multi-GPU) Use with torchrun
train_sft.py VLA supervised fine-tuning
train_robotwin_grpo.py VLA RL (GRPO/PPO) on RoboTwin --algo grpo/ppo
train_vla_benchmark.py VLA on LIBERO/ManiSkill/CALVIN --benchmark libero

Acknowledgments

  • Built with alignment to rsl_rl conventions
  • Isaac Lab integration via Isaac Lab
  • MJLab integration via MJLab (MuJoCo Warp)
  • VLM backbones from HuggingFace ecosystem
  • Data format compatible with LeRobot

License

BSD-3-Clause. See LICENSE for details.

Contact

About

Robot Learning with All Algorithm Support

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors