SO101-Nexus
Concepts

Observations

Composable observation components, observation modes, and how to inspect them.

Every SO101-Nexus environment builds its observation from a list of observation components. Components are lightweight descriptor classes that tell the environment which data to include. State components contribute slices of a flat vector; camera components add image tensors to a dictionary observation.

Observation Components

State Components

State components produce fixed-size slices of the observation vector.

ComponentDimensionsDescription
JointPositions6Current angle of each robot joint
EndEffectorPose7TCP position (3) + quaternion orientation (4)
TargetOffset3Vector from gripper tip to goal position
GazeDirection3Unit vector from gripper toward the target object
GraspState1Binary grasp flag (1.0 = grasping, 0.0 = not)
ObjectPose7Target object position (3) + quaternion orientation (4)
ObjectOffset3Vector from gripper tip to target object
TargetPosition3Absolute goal position (x, y, z)

Camera Components

Camera components add image tensors to a dict-style observation space. They have size = 0 in the state vector.

ComponentKeyDescription
WristCamera"wrist_camera"RGB image from the camera mounted on the robot's wrist
OverheadCamera"overhead_camera"RGB image from a stationary camera above the workspace

Camera components accept width and height parameters (default 640x480). WristCamera also supports domain randomization parameters for FOV and pitch.

Both backends render camera observations. The MuJoCo backend renders on CPU through per-env OpenGL renderers; the MuJoCo Warp backend renders all worlds in one batched ray-tracing pass on the simulation device, returning uint8 image tensors of shape (num_envs, height, width, 3) (with per-world wrist-camera domain randomization). The two renderers differ (OpenGL rasterizer vs ray tracer), so pixel values are not identical across backends, but observation keys, shapes, and dtypes match. Warp camera observations are independent of Gymnasium render_mode; render_mode visualization is MuJoCo-only. See Backend Support.

from so101_nexus import WristCamera, OverheadCamera

# Default resolution
wrist = WristCamera()

# Custom resolution with FOV randomization
wrist = WristCamera(width=224, height=224, fov_deg_range=(60.0, 90.0))

# Overhead camera
overhead = OverheadCamera(width=320, height=240, fov_deg=45.0)

Composing Observations

Pass a list of components via the observations parameter on any config. Each task provides sensible defaults when observations is not specified.

from so101_nexus import (
    PickConfig, JointPositions, EndEffectorPose, GraspState,
    ObjectPose, ObjectOffset, WristCamera,
)

# State-only observations (default for PickLift)
config = PickConfig(observations=[
    EndEffectorPose(),
    GraspState(),
    ObjectPose(),
    ObjectOffset(),
])

# Add a wrist camera to the observation
config = PickConfig(observations=[
    EndEffectorPose(),
    GraspState(),
    ObjectPose(),
    ObjectOffset(),
    WristCamera(width=224, height=224),
])

When the observation list contains only state components, the observation is a flat NumPy array. When it contains one or more camera components, the observation becomes a dictionary with a "state" key (flat vector from all state components) plus one key per camera (e.g. "wrist_camera", "overhead_camera").

Default Observations by Task

Each task config auto-populates observations if you don't provide one:

TaskDefault ComponentsState Dimensions
PickLiftEndEffectorPose, GraspState, ObjectPose, ObjectOffset18
PickAndPlaceEndEffectorPose, GraspState, TargetPosition, ObjectPose, ObjectOffset, TargetOffset24
TouchEndEffectorPose, GraspState, ObjectPose, ObjectOffset18
LookAtJointPositions6
MoveJointPositions6

Observation Modes

The obs_mode config parameter controls the semantic intent of the observation:

obs_mode="state" (default)

The observation contains whatever components are listed in observations. This is useful for state-based reinforcement learning where the policy has access to ground-truth information.

obs_mode="visual"

Designed for vision-based policies. Requires at least one camera component (e.g. WristCamera() or OverheadCamera()) in the observations list. Construction raises an error if no camera component is present.

from so101_nexus import PickConfig, JointPositions, WristCamera

config = PickConfig(
    obs_mode="visual",
    observations=[JointPositions(), WristCamera(width=224, height=224)],
)

Units: Degrees vs Radians

Configuration APIs (REST_POSE, *_deg fields, spawn ranges, leader-arm helpers) use degrees per the library convention. Runtime values exposed by the env (obs["state"], env.step(action), env.action_space.low/high) use radians, which is the native unit of the underlying MuJoCo engine (and the planned MuJoCo Warp backend) and the unit the integrator consumes when writing ctrl. Converting on every step would add runtime work without changing the simulation result.

Code that crosses into LeRobot territory converts at the boundary:

  • The teleop recorder (so101_nexus.teleop) stores actions and states in degrees.
  • The MolmoAct adapter (so101_nexus.policy_adapters) reads obs["state"] and converts to degrees before calling select_action, then converts the policy's degree-valued action back to radians before env.step.

This matches LeRobot's use_degrees=true convention. For background on why LeRobot uses degrees, mid-range zero, and no wrap-around, see Backward compatibility.

Inspecting Observations

import gymnasium as gym
import so101_nexus.mujoco

# State-only (default)
env = gym.make("MuJoCoPickLift-v1")
obs, info = env.reset()
print(f"Observation shape: {obs.shape}")  # (18,)
env.close()

# With camera
from so101_nexus import (
    PickConfig, EndEffectorPose, GraspState, ObjectPose, ObjectOffset, WristCamera,
)

config = PickConfig(observations=[
    EndEffectorPose(),
    GraspState(),
    ObjectPose(),
    ObjectOffset(),
    WristCamera(width=224, height=224),
])
env = gym.make("MuJoCoPickLift-v1", config=config)
obs, info = env.reset()
print(obs["state"].shape)          # (18,): state components
print(obs["wrist_camera"].shape)   # (224, 224, 3): camera image
env.close()

Choosing the Right Setup

Use Caseobs_modeComponentsWhy
State-based RL training"state"Default (no cameras)Policy uses privileged state directly
Vision-based RL training"visual"JointPositions + WristCameraPolicy learns from camera images; no ground-truth state
Multi-view vision"visual"JointPositions + WristCamera + OverheadCameraPolicy fuses multiple camera views
Debugging / visualization"state"Default + WristCameraFull state + camera for analysis

On this page