SO101-Nexus
Guides

Training with PPO

Train policies on SO101-Nexus environments using the included CleanRL-based PPO script.

SO101-Nexus includes a PPO training script at examples/ppo.py. It is a CleanRL-based implementation with an MLP actor-critic that serves as a baseline for RL training. The script auto-detects the backend from the environment ID prefix.

Quick start

MuJoCo

uv run python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --total-timesteps 200000

ManiSkill

ManiSkill environments require the --prerelease=allow flag due to package constraints:

uv run --package so101-nexus-maniskill --prerelease=allow \
    python examples/ppo.py \
    --env-id ManiSkillPickLiftSO101-v1 \
    --total-timesteps 200000

CLI arguments

ArgumentTypeDefaultDescription
--env-idstrrequiredGymnasium environment ID
--total-timestepsint10000000Total training timesteps
--num-envsint8Number of parallel environments
--num-stepsint128Rollout length per environment per update
--learning-ratefloat2.5e-4Optimizer learning rate
--seedint--Random seed for reproducibility
--trackflagoffEnable Weights and Biases logging
--capture-videoflagoffRecord evaluation videos

Vectorized environments

The --num-envs flag controls how many environments run in parallel. More environments increase throughput but require more memory.

For MuJoCo, environments are vectorized using Gymnasium's built-in SyncVectorEnv. ManiSkill environments use ManiSkill's native GPU-accelerated vectorization when available.

A reasonable starting point for local training:

uv run python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --num-envs 4 \
    --num-steps 128 \
    --total-timesteps 500000

Tracking with Weights and Biases

Pass --track to log training metrics to W&B:

uv run python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --total-timesteps 1000000 \
    --track

Recording videos

Pass --capture-video to save periodic evaluation rollouts:

uv run python examples/ppo.py \
    --env-id MuJoCoPickLift-v1 \
    --total-timesteps 500000 \
    --capture-video

Notes

This script is a training baseline, not a tuned solution. For production use, you will likely need to adjust hyperparameters, reward weights (see Customizing Environments), and the network architecture. The MLP actor-critic is intentionally simple to make the code easy to read and modify.

On this page