Training with PPO
Train policies on SO101-Nexus environments using the included CleanRL-based PPO script.
SO101-Nexus includes a PPO training script at examples/ppo.py. It is a CleanRL-based implementation with an MLP actor-critic that serves as a baseline for RL training.
Quick start
uv run --extra train python examples/ppo.py \
--env-id MuJoCoPickLift-v1 \
--total-timesteps 200000CLI arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--env-id | str | required | Gymnasium environment ID |
--total-timesteps | int | 10000000 | Total training timesteps |
--num-envs | int | 8 | Number of parallel environments |
--num-steps | int | 128 | Rollout length per environment per update |
--learning-rate | float | 2.5e-4 | Optimizer learning rate |
--seed | int | -- | Random seed for reproducibility |
--track | flag | off | Enable Weights and Biases logging |
--capture-video | flag | off | Record evaluation videos |
Vectorized environments
The --num-envs flag controls how many environments run in parallel. More environments increase throughput but require more memory.
Environments are vectorized using Gymnasium's built-in SyncVectorEnv.
A reasonable starting point for local training:
uv run --extra train python examples/ppo.py \
--env-id MuJoCoPickLift-v1 \
--num-envs 4 \
--num-steps 128 \
--total-timesteps 500000Tracking with Weights and Biases
Pass --track to log training metrics to W&B:
uv run --extra train python examples/ppo.py \
--env-id MuJoCoPickLift-v1 \
--total-timesteps 1000000 \
--trackRecording videos
Pass --capture-video to save periodic evaluation rollouts:
uv run --extra train python examples/ppo.py \
--env-id MuJoCoPickLift-v1 \
--total-timesteps 500000 \
--capture-videoGPU-parallel training with Warp
For large-scale RL, the Warp backend runs thousands of environments in parallel on the GPU. Install the warp extra and use examples/ppo_warp.py, which trains directly on the batched Warp*-v1 vector environments:
uv run --extra warp python examples/ppo_warp.py \
--env-id WarpTouch-v1 \
--num-envs 4096 \
--device cudaNotes
This script is a training baseline, not a tuned solution. For production use, you will likely need to adjust hyperparameters, reward weights (see Customizing Environments), and the network architecture. The MLP actor-critic is intentionally simple to make the code easy to read and modify.