RoboRenForce

Modular RL & VLA Framework for Robotics — from Locomotion to Vision-Language-Action

RoboRenForce is a unified framework that covers the full robotics RL pipeline: classic locomotion control (PPO/SAC on Isaac Lab, MJLab, Gymnasium), vision-language-action model training (Qwen2-VL, Qwen3-VL, OpenPI, GR00T), and multi-stage learning (pretrain → SFT → RL fine-tuning). Everything is driven by a composable @configclass system and a consistent wrapper chain across simulators.

Architecture

┌──────────────────────────────── RoboRenForce ────────────────────────────────┐
│                                                                              │
│   System 2 (VLM Backbone)            System 1 (Action Expert / Psi0)         │
│   ┌──────────────────────┐           ┌──────────────────────────────┐        │
│   │ Qwen2-VL / Qwen3-VL  │──feats──▶ │ Psi0 — Regression /          │        │
│   │ OpenPI / GR00T       │           │  Diffusion / Flow-Match head │        │
│   │ (frozen / LoRA)      │           │ chunked high-level actions   │        │
│   └──────────────────────┘           └──────────────────────────────┘        │
│              ▲                                       │                       │
│       obs (image+lang)                  high-level target / EE pose          │
│              │                                       ▼                       │
│              │                       System 0 (Whole-Body Loco Policy)       │
│              │                       ┌──────────────────────────────┐        │
│              │                       │ AMO · Sonic · custom RL/MPC  │        │
│              │                       │  joint-level torques / dq    │        │
│              │                       └──────────────────────────────┘        │
│              │                                       │                       │
│  ┌───────────┴───────────────────────────────────────┴─────────────────────┐ │
│  │                    Environment Wrapper Chain                            │ │
│  │  Isaac Lab ─┐                                                           │ │
│  │  MJLab     ─┤─▶ VecEnv ─▶ DynamicEnv ─▶ GroupVecWrapper ─▶ MultiModal  │ │
│  │  RoboTwin  ─┤                                                           │ │
│  │  Gymnasium ─┘                                                           │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘

System 2 reasons over vision + language and emits latent features.
System 1 (Psi0) consumes those features and outputs chunked high-level actions
(joint targets / EE poses) via a regression, diffusion, or flow-match head.
System 0 is a low-level whole-body locomotion policy (AMO, Sonic, …) trained
with PPO/SAC that tracks System 1's targets at simulator rate.

Installation

Planning a GR00T finetune? Read docs/SETUP_GUIDE.md first — it walks through the system packages (python3.10-dev, ffmpeg, libaio-dev, git-lfs), downloading the gated nvidia/GR00T-N1.7-3B + nvidia/Cosmos-Reason2-2B weights, disk-space planning, and a verified end-to-end finetune on the bundled SO101 demo. For training data, see docs/DATA_DOWNLOAD.md (bundled demos, NVIDIA GR00T-flavored datasets, and converting community LeRobot v3 datasets).

git clone <repository-url>
cd RoboRenForce

# Core framework
pip install -e source/RoboRenForce

# Task packages (install what you need)
pip install -e source/tasks/RRF_isaaclab    # Isaac Lab locomotion/manipulation
pip install -e source/tasks/RRF_mjlab       # MJLab (MuJoCo Warp) locomotion
pip install -e source/tasks/RRF_robotwin    # RoboTwin manipulation
pip install -e source/tasks/RRF_humanoid_psi0  # Humanoid offline tasks

# Role-distributed VLA RL training (opt-in — see "Parallelism" section)
pip install -e source/RRF_orchestra

# External setup (robot assets, etc.)
bash scripts/setup_ext.sh

VLA model setup (optional)

# Download VLM weights
bash scripts/models/setup_models.sh qwen2vl    # Qwen2-VL 2B (~4.2GB)
bash scripts/models/setup_models.sh qwen3vl    # Qwen3-VL 2B
bash scripts/models/setup_models.sh openpi     # OpenPI pi0.5 4B
bash scripts/models/setup_models.sh groot      # GR00T N1.7 3B

# Or install dependencies only
pip install "transformers>=4.37" qwen-vl-utils accelerate peft

See docs/models.md for per-model details, VRAM requirements, and usage examples.

MJLab simulator setup

# Clone and install MJLab
git clone https://github.com/mujocolab/mjlab.git
pip install -e mjlab

# Requires: mujoco>=3.7.0, mujoco-warp>=3.7.0.1, warp-lang>=1.12.0

Quick Start

Locomotion Training (MJLab)

# Go1 quadruped on flat terrain — PPO
python scripts/renforce/train_mjlab.py \
    --task Mjlab-Velocity-Flat-Unitree-Go1 \
    --num_envs 4096 --device cuda:0

# G1 humanoid on rough terrain
python scripts/renforce/train_mjlab.py \
    --task Mjlab-Velocity-Rough-Unitree-G1 \
    --num_envs 2048 --max_iterations 30000

Locomotion Training (Isaac Lab)

python scripts/renforce/train_lab.py \
    --task RoboRenForce-AFR-UnitreeGo1Flat-PPO \
    --num_envs 4096 --headless

VLA Pretraining

# Single GPU
python scripts/vla/pretrain/train_single_gpu.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset --epochs 20

# Multi-GPU DDP
torchrun --nproc_per_node=4 scripts/vla/pretrain/train_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset --epochs 20

VLA RL Fine-tuning (GRPO / PPO)

python scripts/vla/rl/train_robotwin_grpo.py \
    --task close_laptop_lid --algo grpo --num_envs 32

More examples: SFT, evaluation, data pipeline

Supervised Fine-tuning (SFT)

# Single GPU
python scripts/vla/post_train/train_sft.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset --epochs 5

# Multi-GPU
torchrun --nproc_per_node=2 scripts/vla/post_train/train_sft_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset

Evaluation / Playback

# MJLab
python scripts/renforce/play_mjlab.py \
    --target logs/RFRL/mjlab_go1/model_5000.pt --num_envs 64

# Isaac Lab
python scripts/renforce/play_lab.py \
    --target logs/RFRL/go1_ppo/model_5000.pt --video

Data Conversion

# Download LeRobot dataset
python scripts/data/download_lerobot_dataset.py --repo lerobot/aloha_sim

# Isaac Lab trajectories → LeRobot format
python scripts/data/isaaclab_to_lerobot.py --input traj/ --output data/lerobot/

# RLDS → LeRobot
python scripts/data/rlds_to_lerobot.py --input rlds_data/ --output data/lerobot/

Supported Algorithms

Online RL — agent interacts with simulator, maximizes reward

Algorithm	Type	Input	Runner	Reference
PPO	On-policy, GAE	State	`OnPolicyRunner`	ppo.py
CAPS-PPO	On-policy, smooth	State	`OnPolicyRunner`	CAPS.py
L2C2-PPO	On-policy, smooth	State	`OnPolicyRunner`	L2C2.py
Lips-PPO	On-policy, Lipschitz	State	`OnPolicyRunner`	Lips.py
SAPG-PPO	On-policy, self-adaptive	State	`SAPGOnPolicyRunner`	sapg/
EPO	On-policy, exploration	State	`EPOOnPolicyRunner`	epo/
SAC	Off-policy, entropy-reg	State	`OffPolicyRunner`	sac/
SAC-Seq	Off-policy, sequential	State	`OffPolicyRunner`	sac_seq.py
SAC-Trans	Off-policy, transformer	State	`OffPolicyRunner`	sac_trans.py
DSAC / DSACT	Off-policy, distributional	State	`OffPolicyRunner`	dsac/

Offline RL — learns from fixed dataset, no environment interaction

Algorithm	Type	Input	Runner	Reference
IQL	Implicit Q-Learning	State	`OfflineRunnerBase`	iql.py

VLA Pretrain (SL) — supervised learning on demonstration data

Algorithm	Type	Input	Runner	Reference
VLA Pretrain	Behavior cloning (L1/MSE)	Image + Language + State	`VLAPretrainRunner`	pretrain_algorithm.py
SFT	Supervised fine-tuning (KL reg.)	Image + Language + State	`VLASFTRunner`	sft.py
DAgger	Online imitation + expert intervention	Image + Language + State	—	dagger.py

VLA RL Fine-tune — online RL with VLM policy backbone

Algorithm	Type	Input	Runner	Reference
GRPO	Group Relative Policy Opt	Image + Language + State	`VLAGRPORunner`	grpo.py
VLA-PPO	PPO over VLM backbone	Image + Language + State	`VLAPPORunner`	ppo.py
VLA-SAC	SAC over VLM backbone	Image + Language + State	—	sac.py

NN Model-Based — learn dynamics model, plan or augment policy

Algorithm	Type	Input	Runner	Reference
MBPO	Model-Based Policy Opt	State	`MBPOOnPolicyRunner`	mbpo/
System Dynamics (MLP)	Forward model f(s,a)→s'	State	`OfflineRunnerBase`	system_dynamics_mlp.py
System Dynamics (Transformer)	Forward model f(s,a)→s'	State	`OfflineRunnerBase`	system_dynamics_transformer.py
TD-MPC / TD-MPC2	Latent dynamics + planning	State	`NNModelBasedRunner`	tdmpcs/
Belief Flow Model	Belief state dynamics	State	`FlowModelRunner`	belief_flow_model/

Imitation Learning — learn from expert demonstrations or motions

Algorithm	Type	Input	Runner	Reference
GAIL + PPO	Adversarial IL	State	`OnPolicyRunner`	gail_ppo.py
AMP + PPO	Adversarial Motion Priors	State	`OnPolicyRunner`	amp_ppo.py
Distillation	Knowledge transfer	State	—	distillation.py

Supported VLM Backbones

Model	Params	Output Dim	Action Heads	License
Qwen2-VL	2B / 7B	1536	Regression, Diffusion	Apache 2.0
Qwen3-VL	2B / 8B	2048	Regression, Diffusion	Apache 2.0
OpenPI (pi0.5)	4B	2048	Flow Matching	Apache 2.0 + Gemma
GR00T N1.7	3B	2048	DiT	Apache 2.0
MLP Baseline	~1M	64	Regression	Built-in

See docs/models.md for full details.

Supported Environments

RoboRenForce covers four training paradigms:

Online RL — agent interacts with simulator, maximizes reward (PPO, SAC)
Offline RL — learns from fixed dataset via Q-learning, no env interaction (IQL, CQL)
VLA Pretrain (SL) — supervised learning on demonstration data with VLM backbone
VLA RL Fine-tune — online RL with a VLA policy (GRPO, PPO over VLM)

Platform	Package	Robots / Tasks	Paradigm	Script
MJLab (MuJoCo Warp)	`RRF_mjlab`	Go1, G1 — velocity tracking (flat/rough)	Online RL	`train_mjlab.py`
Isaac Lab (Isaac Sim)	`RRF_isaaclab`	A1, Go1, Go2, Anymal B/C/D, H1, G1	Online RL	`train_lab.py`
Gymnasium	Built-in	Classic control, MuJoCo	Online RL	`train_gym.py`
D4RL	`RRF_d4rl`	Walker2d, Hopper, HalfCheetah	Offline RL	`OfflineRunnerBase`
LIBERO	`RRF_libero`	Franka — 10/90/130 manipulation tasks	VLA Pretrain / VLA RL	`train_vla_benchmark.py`
ManiSkill	`RRF_maniskill`	Franka — GPU-accelerated manipulation	VLA Pretrain / VLA RL	`train_vla_benchmark.py`
CALVIN	`RRF_calvin`	Franka — 5-subtask long-horizon eval	VLA Pretrain / VLA RL	`train_vla_benchmark.py`
RoboTwin (SAPIEN3)	`RRF_robotwin`	Piper, ALOHA — 60+ manipulation tasks	VLA RL	`train_robotwin_grpo.py`
Humanoid Psi0	`RRF_humanoid_psi0`	G1 Dex3 — 14 pick-and-place demos	VLA Pretrain	`train_single_gpu.py`

Configuration System

All components are configured via @configclass — a decorator that extends Python dataclasses with type validation, serialization, and factory construction.

Example: defining a custom PPO config

from RoboRenForce import configclass
from RoboRenForce import runners, algorithms, components, networks

@configclass
class MyLocoPPOCfg(runners.OnPolicyRunnerCfg):
    seed = 42
    num_steps_per_env = 24
    max_iterations = 10000
    experiment_name = "my_experiment"

    policy = components.ActorCriticPackCfg(
        actor_cfg=components.StateIndStdActorCfg(
            backbone_cfg=networks.MLPCfg(
                hidden_features=[512, 256, 128],
                activations=[[('ELU', {})]] * 3 + [[]]
            ),
            use_log_std=False
        ),
        critic_cfg=components.VNetworkCfg(
            backbone_cfg=networks.MLPCfg(
                hidden_features=[512, 256, 128],
                activations=[[('ELU', {})]] * 3 + [[]]
            )
        )
    )

    algorithm = algorithms.PPOCfg(
        clip_param=0.2,
        entropy_coef=0.01,
        num_learning_epochs=5,
        num_mini_batches=4,
        learning_rate=1.0e-3,
        schedule="adaptive",
        gamma=0.99,
        lam=0.95,
    )

Example: registering a task with gymnasium

import gymnasium as gym

gym.register(
    id="RoboRenForce-MyTask-PPO",
    entry_point="mjlab.envs:ManagerBasedRlEnv",
    disable_env_checker=True,
    kwargs={
        "env_cfg_entry_point": my_env_cfg,
        "RoboRenForce_entry_point": MyLocoPPOCfg(),
    },
)

Example: environment wrapper chain

from mjlab.envs import ManagerBasedRlEnv
from RRF_mjlab_tasks.mjlab_utils import (
    RoboRenForceMJLabEnvWrapper,   # Base: obs remapping, step adaptation
    MJLabDynamicEnvWrapper,         # + reward/command extraction, dim_params
    MJLabGroupVecWrapper,           # + train/eval env partitioning
)

env = ManagerBasedRlEnv(cfg=my_cfg, device="cuda:0")
wrapped = MJLabDynamicEnvWrapper(env)

print(wrapped.num_envs)       # 4096
print(wrapped.dim_params)     # {'policy_dim': 48, 'critic_dim': 72, ...}
obs, extras = wrapped.reset() # obs: (4096, 48)

Project Structure

RoboRenForce/
├── source/
│   ├── RoboRenForce/                   # Core framework
│   │   └── RoboRenForce/
│   │       ├── algorithms/             # RL algorithms
│   │       │   ├── on_policy/          #   PPO, MBPO, SAPG, smooth variants
│   │       │   ├── off_policy/         #   SAC, DSAC
│   │       │   ├── vla_training/       #   Pretrain, SFT, GRPO, PPO, IQL, DAgger
│   │       │   └── nn_model_trainer/
│   │       ├── runners/                # Training loops
│   │       │   ├── on_policy/          #   OnPolicyRunner, SAPG, EPO
│   │       │   ├── off_policy/         #   OffPolicyRunner
│   │       │   ├── vla/               #   Pretrain, SFT, GRPO (+ DDP variants)
│   │       │   └── nn_model_based/   #   MBPO, Flow model
│   │       ├── networks/               # Neural network modules
│   │       │   ├── vlm/               #   Qwen2-VL, Qwen3-VL, OpenPI, GR00T
│   │       │   ├── transformer/       #   Transformer backbone
│   │       │   └── mlp.py, vae/, moe.py, fft_filter.py
│   │       ├── components/             # Actors, critics, normalizers
│   │       │   ├── actor/             #   Gaussian, SAC, Lipschitz, VLA actors
│   │       │   ├── critic/            #   V-net, Q-net, distributional
│   │       │   └── normalizer/        #   Empirical normalizer
│   │       ├── buffer/                 # Replay buffers & rollout storage
│   │       └── utils/                  # Config system, env wrappers, tools
│   │           ├── configclass/       #   @configclass decorator
│   │           └── env_wrapper/       #   Lab, Gym, VLA wrapper chains
│   ├── RRF_orchestra/                  # Role-distributed runner (opt-in)
│   │   └── RRF_orchestra/              #   rollout / inference / learner role-split
│   │       ├── protocol/              #     Wire protocol (messages, channels, shared_tensor)
│   │       ├── workers/               #     Env / Inference workers + batchers
│   │       ├── orchestrator/          #     Topology, Supervisor, OrchestraVLARunner
│   │       ├── adapters/              #     Algorithm + policy adapters
│   │       └── examples/              #     hello_world_orchestra
│   └── tasks/                          # Task packages
│       ├── RRF_isaaclab/              #   Isaac Lab locomotion & manipulation
│       ├── RRF_mjlab/                 #   MJLab (MuJoCo Warp) locomotion
│       ├── RRF_robotwin/              #   RoboTwin 60+ manipulation tasks
│       └── RRF_humanoid_psi0/         #   Humanoid offline datasets
├── scripts/
│   ├── renforce/                       # Standard RL training & evaluation
│   │   ├── train_lab.py               #   Isaac Lab training
│   │   ├── train_mjlab.py             #   MJLab training
│   │   ├── train_gym.py               #   Gymnasium training
│   │   ├── play_lab.py                #   Isaac Lab evaluation
│   │   └── play_mjlab.py              #   MJLab evaluation
│   ├── vla/                            # VLA model training
│   │   ├── pretrain/                  #   Single-GPU & DDP pretraining
│   │   ├── post_train/                #   SFT (single & DDP)
│   │   └── rl/                        #   GRPO / PPO fine-tuning
│   └── data/                           # Dataset tools
│       ├── download_lerobot_dataset.py
│       ├── isaaclab_to_lerobot.py
│       └── rlds_to_lerobot.py
├── docs/
│   ├── models.md                       # VLM backbone details & setup
│   └── BENCHMARK_PLAN.md              # Benchmark experiment specs
└── tests/                              # Test suites

Parallelism: Two Independent Axes

RoboRenForce ships with two orthogonal parallelism mechanisms — they solve different problems and can be composed.

Axis	Where it lives	Problem solved	Typical use
Data-parallel (DDP)	`RoboRenForce/runners/vla/{pretrain,post_train}/_runner_distributed.py` + `scripts/vla/{pretrain,post_train}/train__ddp.py`	Single GPU is too small / too slow for the desired batch. Same role on every rank, gradients all-reduced via NCCL.	VLA pretrain / SFT on N GPUs, launched with `torchrun --nproc_per_node=N`
Role-distributed (Orchestra)	`source/RRF_orchestra/` — opt-in package	CPU sim and GPU inference idle waiting on each other. One process per role: `EnvWorker × N`, `InferenceWorker × 1`, `Learner × 1`. Tensors moved over `torch.multiprocessing.Queue` with shared-memory zero-copy.	VLA RL fine-tuning on a single machine; VLA model dominates step time and you want batched inference across N parallel envs

The two axes compose cleanly — e.g. role-distributed rollout + DDP learner — though v1 keeps them decoupled.

Data-parallel (DDP)

# VLA Pretraining — 4 GPUs
torchrun --nproc_per_node=4 scripts/vla/pretrain/train_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/humanoid_psi0 --epochs 20

# VLA SFT — 2 GPUs
torchrun --nproc_per_node=2 scripts/vla/post_train/train_sft_ddp.py \
    --model_name Qwen/Qwen2-VL-2B-Instruct \
    --dataset_path data/my_dataset

DDP implementation details

Serialized model loading (rank 0 first, then barrier) to avoid HuggingFace cache races
30-minute NCCL timeout for large model initialization
device_map_auto=False for DDP compatibility (no model sharding)
Unwrapped model for validation (avoids DDP deadlock on single-rank validation)
DistributedSampler with proper epoch shuffling

Role-distributed (Orchestra)

RRF_orchestra runs each role (env / inference / learner) in its own process. Large tensors travel through shared memory (zero-copy); only small handles and metadata go through mp.Queue. The OrchestraVLARunner is a drop-in replacement for the single-process VLA RL runner.

   EnvWorker × N           InferenceWorker × 1         Learner × 1
   ┌───────────┐  ObsBatch  ┌─────────────────┐         ┌─────────┐
   │ env.step  │ ─────────▶ │ batched VLA inf │ ──────▶ │ collect │
   │           │ ◀───────── │ ActionBatch     │         │ traj    │
   └─────┬─────┘            └────────▲────────┘         │ grad    │
         │ Trajectory                 │ WeightUpdate    │ update  │
         ▼                            └─────────────────┤         │
   ┌──────────────────────────────────────────────────────┘
   │              channels (mp.Queue + shared-mem tensors)
   └────────────────────────────────────────────────────────

# Install (opt-in — keeps core repo light for single-process users)
pip install -e source/RRF_orchestra

# End-to-end smoke (2 env workers + 1 inference + 1 learner, 5 iterations)
python -m RRF_orchestra.examples.hello_world_orchestra

# 20 unit / integration tests (mp + shared-memory + orchestrator round-trip)
pytest tests/RRF_orchestra/ -v

See source/RRF_orchestra/README.md for the user guide and docs/PLAN-task3-orchestra-package.md for the full design (wire protocol, topology, per-task EnvWorker integration).

Naming note: this package was previously called RRF_distributed. It was renamed to RRF_orchestra to disambiguate from the data-parallel *_runner_distributed.py files (PyTorch DDP). The two are independent axes.

Benchmark Results

See docs/BENCHMARK_PLAN.md for the full experiment matrix.

VLA Pretraining

Experiment	VLM	Head	GPUs	Train Loss	Val Loss	Throughput	Notes
MLP Baseline	MockVLM	Regression	1 × H100	0.1705	0.3899	3.34 batch/s	Sanity-check run
Qwen2-VL DDP	Qwen2-VL-2B	Regression	3 × H100	0.0323 → 0.0228	0.0228	~67 samples/s	4 epochs, AMP enabled

Locomotion (MJLab)

Task	Algorithm	Envs	Reward (start → end)	Steps/s	Hardware	Iters
Go1 Flat	PPO (GAE)	256	−5.77 → −0.31	1,100	1 × H100	20

Algorithm Verification

Paradigm	Algorithm	Status
Pretrain (SL)	VLAPretrainAlgorithm	✅ Verified (single + 3-GPU DDP)
SFT	SFTAlgorithm (KL reg.)	✅ Verified (single + 2-GPU DDP)
GRPO	GRPOAlgorithm	✅ Verified
PPO (GAE)	PPOAlgorithm	✅ Verified
Locomotion PPO	PPO (MJLab Go1)	✅ Verified (H100, 1100 steps/s)

Scripts Reference

Script	Purpose	Docs
`train_mjlab.py`	Train on MJLab environments	`--help` for all options
`train_lab.py`	Train on Isaac Lab environments	Requires Isaac Sim
`train_gym.py`	Train on Gymnasium/MuJoCo	Standard envs
`play_mjlab.py`	Evaluate MJLab checkpoint
`play_lab.py`	Evaluate Isaac Lab checkpoint	Video recording
`train_single_gpu.py`	VLA pretraining (1 GPU)
`train_ddp.py`	VLA pretraining (multi-GPU)	Use with `torchrun`
`train_sft.py`	VLA supervised fine-tuning
`train_robotwin_grpo.py`	VLA RL (GRPO/PPO) on RoboTwin	`--algo grpo/ppo`
`train_vla_benchmark.py`	VLA on LIBERO/ManiSkill/CALVIN	`--benchmark libero`

Acknowledgments

Built with alignment to rsl_rl conventions
Isaac Lab integration via Isaac Lab
MJLab integration via MJLab (MuJoCo Warp)
VLM backbones from HuggingFace ecosystem
Data format compatible with LeRobot

License

BSD-3-Clause. See LICENSE for details.

Contact

Maintainer: Ziang Zheng — ziang_zheng@foxmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.claude		.claude
benchmarks		benchmarks
docs		docs
scripts		scripts
source		source
tests		tests
third_party		third_party
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
activate.sh		activate.sh
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RoboRenForce

Architecture

Installation

Quick Start

Locomotion Training (MJLab)

Locomotion Training (Isaac Lab)

VLA Pretraining

VLA RL Fine-tuning (GRPO / PPO)

Supported Algorithms

Online RL — agent interacts with simulator, maximizes reward

Offline RL — learns from fixed dataset, no environment interaction

VLA Pretrain (SL) — supervised learning on demonstration data

VLA RL Fine-tune — online RL with VLM policy backbone

NN Model-Based — learn dynamics model, plan or augment policy

Imitation Learning — learn from expert demonstrations or motions

Supported VLM Backbones

Supported Environments

Configuration System

Project Structure

Parallelism: Two Independent Axes

Data-parallel (DDP)

Role-distributed (Orchestra)

Benchmark Results

VLA Pretraining

Locomotion (MJLab)

Algorithm Verification

Scripts Reference

Acknowledgments

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages