Observations
Composable observation components, observation modes, and how to inspect them.
Every SO101-Nexus environment builds its observation from a list of observation components. Components are lightweight descriptor classes that tell the environment which data to include. State components contribute slices of a flat vector; camera components add image tensors to a dictionary observation.
Observation Components
State Components
State components produce fixed-size slices of the observation vector.
| Component | Dimensions | Description |
|---|---|---|
JointPositions | 6 | Current angle of each robot joint |
EndEffectorPose | 7 | TCP position (3) + quaternion orientation (4) |
TargetOffset | 3 | Vector from gripper tip to goal position |
GazeDirection | 3 | Unit vector from gripper toward the target object |
GraspState | 1 | Binary grasp flag (1.0 = grasping, 0.0 = not) |
ObjectPose | 7 | Target object position (3) + quaternion orientation (4) |
ObjectOffset | 3 | Vector from gripper tip to target object |
TargetPosition | 3 | Absolute goal position (x, y, z) |
Camera Components
Camera components add image tensors to a dict-style observation space. They have size = 0 in the state vector.
| Component | Key | Description |
|---|---|---|
WristCamera | "wrist_camera" | RGB image from the camera mounted on the robot's wrist |
OverheadCamera | "overhead_camera" | RGB image from a stationary camera above the workspace |
Camera components accept width and height parameters (default 640x480). WristCamera also supports domain randomization parameters for FOV and pitch.
Both backends render camera observations. The MuJoCo backend renders on CPU through per-env OpenGL renderers; the MuJoCo Warp backend renders all worlds in one batched ray-tracing pass on the simulation device, returning uint8 image tensors of shape (num_envs, height, width, 3) (with per-world wrist-camera domain randomization). The two renderers differ (OpenGL rasterizer vs ray tracer), so pixel values are not identical across backends, but observation keys, shapes, and dtypes match. Warp camera observations are independent of Gymnasium render_mode; render_mode visualization is MuJoCo-only. See Backend Support.
from so101_nexus import WristCamera, OverheadCamera
# Default resolution
wrist = WristCamera()
# Custom resolution with FOV randomization
wrist = WristCamera(width=224, height=224, fov_deg_range=(60.0, 90.0))
# Overhead camera
overhead = OverheadCamera(width=320, height=240, fov_deg=45.0)Composing Observations
Pass a list of components via the observations parameter on any config. Each task provides sensible defaults when observations is not specified.
from so101_nexus import (
PickConfig, JointPositions, EndEffectorPose, GraspState,
ObjectPose, ObjectOffset, WristCamera,
)
# State-only observations (default for PickLift)
config = PickConfig(observations=[
EndEffectorPose(),
GraspState(),
ObjectPose(),
ObjectOffset(),
])
# Add a wrist camera to the observation
config = PickConfig(observations=[
EndEffectorPose(),
GraspState(),
ObjectPose(),
ObjectOffset(),
WristCamera(width=224, height=224),
])When the observation list contains only state components, the observation is a flat NumPy array. When it contains one or more camera components, the observation becomes a dictionary with a "state" key (flat vector from all state components) plus one key per camera (e.g. "wrist_camera", "overhead_camera").
Default Observations by Task
Each task config auto-populates observations if you don't provide one:
| Task | Default Components | State Dimensions |
|---|---|---|
| PickLift | EndEffectorPose, GraspState, ObjectPose, ObjectOffset | 18 |
| PickAndPlace | EndEffectorPose, GraspState, TargetPosition, ObjectPose, ObjectOffset, TargetOffset | 24 |
| Touch | EndEffectorPose, GraspState, ObjectPose, ObjectOffset | 18 |
| LookAt | JointPositions | 6 |
| Move | JointPositions | 6 |
Observation Modes
The obs_mode config parameter controls the semantic intent of the observation:
obs_mode="state" (default)
The observation contains whatever components are listed in observations. This is useful for state-based reinforcement learning where the policy has access to ground-truth information.
obs_mode="visual"
Designed for vision-based policies. Requires at least one camera component (e.g. WristCamera() or OverheadCamera()) in the observations list. Construction raises an error if no camera component is present.
from so101_nexus import PickConfig, JointPositions, WristCamera
config = PickConfig(
obs_mode="visual",
observations=[JointPositions(), WristCamera(width=224, height=224)],
)Units: Degrees vs Radians
Configuration APIs (REST_POSE, *_deg fields, spawn ranges, leader-arm helpers) use degrees per the library convention. Runtime values exposed by the env (obs["state"], env.step(action), env.action_space.low/high) use radians, which is the native unit of the underlying MuJoCo engine (and the planned MuJoCo Warp backend) and the unit the integrator consumes when writing ctrl. Converting on every step would add runtime work without changing the simulation result.
Code that crosses into LeRobot territory converts at the boundary:
- The teleop recorder (
so101_nexus.teleop) stores actions and states in degrees. - The MolmoAct adapter (
so101_nexus.policy_adapters) readsobs["state"]and converts to degrees before callingselect_action, then converts the policy's degree-valued action back to radians beforeenv.step.
This matches LeRobot's use_degrees=true convention. For background on why LeRobot uses degrees, mid-range zero, and no wrap-around, see Backward compatibility.
Inspecting Observations
import gymnasium as gym
import so101_nexus.mujoco
# State-only (default)
env = gym.make("MuJoCoPickLift-v1")
obs, info = env.reset()
print(f"Observation shape: {obs.shape}") # (18,)
env.close()
# With camera
from so101_nexus import (
PickConfig, EndEffectorPose, GraspState, ObjectPose, ObjectOffset, WristCamera,
)
config = PickConfig(observations=[
EndEffectorPose(),
GraspState(),
ObjectPose(),
ObjectOffset(),
WristCamera(width=224, height=224),
])
env = gym.make("MuJoCoPickLift-v1", config=config)
obs, info = env.reset()
print(obs["state"].shape) # (18,): state components
print(obs["wrist_camera"].shape) # (224, 224, 3): camera image
env.close()Choosing the Right Setup
| Use Case | obs_mode | Components | Why |
|---|---|---|---|
| State-based RL training | "state" | Default (no cameras) | Policy uses privileged state directly |
| Vision-based RL training | "visual" | JointPositions + WristCamera | Policy learns from camera images; no ground-truth state |
| Multi-view vision | "visual" | JointPositions + WristCamera + OverheadCamera | Policy fuses multiple camera views |
| Debugging / visualization | "state" | Default + WristCamera | Full state + camera for analysis |