SO101-Nexus

Every SO101-Nexus environment builds its observation from a list of observation components. Components are lightweight descriptor classes that tell the environment which data to include. State components contribute slices of a flat vector; camera components add image tensors to a dictionary observation.

Observation Components

State Components

State components produce fixed-size slices of the observation vector.

Component	Dimensions	Description
`JointPositions`	6	Current angle of each robot joint
`EndEffectorPose`	7	TCP position (3) + quaternion orientation (4)
`TargetOffset`	3	Vector from gripper tip to goal position
`GazeDirection`	3	Unit vector from gripper toward the target object
`GraspState`	1	Binary grasp flag (1.0 = grasping, 0.0 = not)
`ObjectPose`	7	Target object position (3) + quaternion orientation (4)
`ObjectOffset`	3	Vector from gripper tip to target object
`TargetPosition`	3	Absolute goal position (x, y, z)

Camera Components

Camera components add image tensors to a dict-style observation space. They have size = 0 in the state vector.

Component	Key	Description
`WristCamera`	`"wrist_camera"`	RGB image from the camera mounted on the robot's wrist
`OverheadCamera`	`"overhead_camera"`	RGB image from a stationary camera above the workspace

Camera components accept width and height parameters (default 640x480). WristCamera also supports domain randomization parameters for FOV and pitch.

Both backends render camera observations. The MuJoCo backend renders on CPU through per-env OpenGL renderers; the MuJoCo Warp backend renders all worlds in one batched ray-tracing pass on the simulation device, returning uint8 image tensors of shape (num_envs, height, width, 3) (with per-world wrist-camera domain randomization). The two renderers differ (OpenGL rasterizer vs ray tracer), so pixel values are not identical across backends, but observation keys, shapes, and dtypes match. Warp camera observations are independent of Gymnasium render_mode; render_mode visualization is MuJoCo-only. See Backend Support.

from so101_nexus import WristCamera, OverheadCamera

# Default resolution
wrist = WristCamera()

# Custom resolution with FOV randomization
wrist = WristCamera(width=224, height=224, fov_deg_range=(60.0, 90.0))

# Overhead camera
overhead = OverheadCamera(width=320, height=240, fov_deg=45.0)

Composing Observations

Pass a list of components via the observations parameter on any config. Each task provides sensible defaults when observations is not specified.

from so101_nexus import (
    PickConfig, JointPositions, EndEffectorPose, GraspState,
    ObjectPose, ObjectOffset, WristCamera,
)

# State-only observations (default for PickLift)
config = PickConfig(observations=[
    EndEffectorPose(),
    GraspState(),
    ObjectPose(),
    ObjectOffset(),
])

# Add a wrist camera to the observation
config = PickConfig(observations=[
    EndEffectorPose(),
    GraspState(),
    ObjectPose(),
    ObjectOffset(),
    WristCamera(width=224, height=224),
])

When the observation list contains only state components, the observation is a flat NumPy array. When it contains one or more camera components, the observation becomes a dictionary with a "state" key (flat vector from all state components) plus one key per camera (e.g. "wrist_camera", "overhead_camera").

Default Observations by Task

Each task config auto-populates observations if you don't provide one:

Task	Default Components	State Dimensions
PickLift	EndEffectorPose, GraspState, ObjectPose, ObjectOffset	18
PickAndPlace	EndEffectorPose, GraspState, TargetPosition, ObjectPose, ObjectOffset, TargetOffset	24
Touch	EndEffectorPose, GraspState, ObjectPose, ObjectOffset	18
LookAt	JointPositions	6
Move	JointPositions	6

Observation Modes

The obs_mode config parameter controls the semantic intent of the observation:

`obs_mode="state"` (default)

The observation contains whatever components are listed in observations. This is useful for state-based reinforcement learning where the policy has access to ground-truth information.

`obs_mode="visual"`

Designed for vision-based policies. Requires at least one camera component (e.g. WristCamera() or OverheadCamera()) in the observations list. Construction raises an error if no camera component is present.

from so101_nexus import PickConfig, JointPositions, WristCamera

config = PickConfig(
    obs_mode="visual",
    observations=[JointPositions(), WristCamera(width=224, height=224)],
)

Units: Degrees vs Radians

Configuration APIs (REST_POSE, *_deg fields, spawn ranges, leader-arm helpers) use degrees per the library convention. Runtime values exposed by the env (obs["state"], env.step(action), env.action_space.low/high) use radians, which is the native unit of the underlying MuJoCo engine (and the planned MuJoCo Warp backend) and the unit the integrator consumes when writing ctrl. Converting on every step would add runtime work without changing the simulation result.

Code that crosses into LeRobot territory converts at the boundary:

The teleop recorder (so101_nexus.teleop) stores actions and states in degrees.
The MolmoAct adapter (so101_nexus.policy_adapters) reads obs["state"] and converts to degrees before calling select_action, then converts the policy's degree-valued action back to radians before env.step.

This matches LeRobot's use_degrees=true convention. For background on why LeRobot uses degrees, mid-range zero, and no wrap-around, see Backward compatibility.

Inspecting Observations

import gymnasium as gym
import so101_nexus.mujoco

# State-only (default)
env = gym.make("MuJoCoPickLift-v1")
obs, info = env.reset()
print(f"Observation shape: {obs.shape}")  # (18,)
env.close()

# With camera
from so101_nexus import (
    PickConfig, EndEffectorPose, GraspState, ObjectPose, ObjectOffset, WristCamera,
)

config = PickConfig(observations=[
    EndEffectorPose(),
    GraspState(),
    ObjectPose(),
    ObjectOffset(),
    WristCamera(width=224, height=224),
])
env = gym.make("MuJoCoPickLift-v1", config=config)
obs, info = env.reset()
print(obs["state"].shape)          # (18,): state components
print(obs["wrist_camera"].shape)   # (224, 224, 3): camera image
env.close()

Choosing the Right Setup

Use Case	`obs_mode`	Components	Why
State-based RL training	`"state"`	Default (no cameras)	Policy uses privileged state directly
Vision-based RL training	`"visual"`	JointPositions + WristCamera	Policy learns from camera images; no ground-truth state
Multi-view vision	`"visual"`	JointPositions + WristCamera + OverheadCamera	Policy fuses multiple camera views
Debugging / visualization	`"state"`	Default + WristCamera	Full state + camera for analysis

Observations

On this page