SO101-Nexus

Train policies on SO101-Nexus environments using the included CleanRL-based PPO script.

SO101-Nexus includes a PPO training script at examples/ppo.py. It is a CleanRL-based implementation with an MLP actor-critic that serves as a baseline for RL training.

Quick start

uv run --extra train python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --total-timesteps 200000

CLI arguments

Argument	Type	Default	Description
`--env-id`	`str`	required	Gymnasium environment ID
`--total-timesteps`	`int`	`10000000`	Total training timesteps
`--num-envs`	`int`	`8`	Number of parallel environments
`--num-steps`	`int`	`128`	Rollout length per environment per update
`--learning-rate`	`float`	`2.5e-4`	Optimizer learning rate
`--seed`	`int`	--	Random seed for reproducibility
`--track`	flag	off	Enable Weights and Biases logging
`--capture-video`	flag	off	Record evaluation videos

Vectorized environments

The --num-envs flag controls how many environments run in parallel. More environments increase throughput but require more memory.

Environments are vectorized using Gymnasium's built-in SyncVectorEnv.

A reasonable starting point for local training:

uv run --extra train python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --num-envs 4 \
    --num-steps 128 \
    --total-timesteps 500000

Tracking with Weights and Biases

Pass --track to log training metrics to W&B:

uv run --extra train python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --total-timesteps 1000000 \
    --track

Recording videos

Pass --capture-video to save periodic evaluation rollouts:

uv run --extra train python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --total-timesteps 500000 \
    --capture-video

GPU-parallel training with Warp

For large-scale RL, the Warp backend runs thousands of environments in parallel on the GPU. Install the warp extra and use examples/ppo_warp.py, which trains directly on the batched Warp*-v1 vector environments:

uv run --extra warp python examples/ppo_warp.py \
    --env-id WarpTouch-v1 \
    --num-envs 4096 \
    --device cuda

Notes

This script is a training baseline, not a tuned solution. For production use, you will likely need to adjust hyperparameters, reward weights (see Customizing Environments), and the network architecture. The MLP actor-critic is intentionally simple to make the code easy to read and modify.

Training with PPO