Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Reward Optimization

We do not have any labeled(task-related) rewards in training. The reward inference (optimization) happens in the test time. Users can prompt in any type of reward function w.r.t. the robot states, and the policy zero-shot output the optimized skills without retraining. ($R$ below is the rewards.)

Basic Locomotion

We enable the robot to perform basic locomotion tasks including standing still, walking forward/backward/sideways, turning left/right.

Maintains stable standing posture without movement.

$$R = (\mathrm{head\_height} = 1.2\mathrm{m}) \wedge (\mathrm{base\_vel} = 0\mathrm{m/s})$$

Forward walking at 0.7m/s

$$R = (\mathrm{head\_height} = 1.2\mathrm{m}) \wedge (\mathrm{base\_vel\_forward} = 0.7\mathrm{m/s})$$

Sideways movement to the left at 0.3m/s

$$R = (\mathrm{head\_height} = 1.2\mathrm{m}) \wedge (\mathrm{base\_vel\_left} = 0.3\mathrm{m/s})$$

Backward walking at 0.3m/s

$$R = (\mathrm{head\_height} = 1.2\mathrm{m}) \wedge (\mathrm{base\_vel\_backward} = 0.3\mathrm{m/s})$$

Sideways movement to the right at 0.3m/s

$$R = (\mathrm{head\_height} = 1.2\mathrm{m}) \wedge (\mathrm{base\_vel\_right} = 0.3\mathrm{m/s})$$

Anticlockwise turning at 5.0 rad/s

$$R = (\mathrm{base\_height} > 0.5\mathrm{m}) \wedge (\mathrm{base\_ang\_vel\_z} = 5.0\mathrm{rad/s})$$

Clockwise turning at 5.0 rad/s

$$R = (\mathrm{base\_height} > 0.5\mathrm{m}) \wedge (\mathrm{base\_ang\_vel\_z} = -5.0\mathrm{rad/s})$$

Arm Control

Put down the arm (low) or Raise the arm (high)

$$R_{\text{low}} = 1 - \min\{|\mathrm{wrist\_height} - 0.7\mathrm{m}| - 0.1\mathrm{m}\} = (\mathrm{wrist\_height} \in [0.6, 0.8]\mathrm{m})$$ $$R_{\text{high}} = \min\{(\mathrm{wrist\_height} - 1.0\mathrm{m}), 1\} = (\mathrm{wrist\_height} > 1.0\mathrm{m})$$

Reward = Right wrist & Left wrist

Right wrist =
Left wrist =
Select reward values to show image

Base Height Control

Low-height Forward Motion
Base Height = 0.6m & Go Forward
Seated Crouch
Base Height = 0m
Supported Crouch
Base Height = 0.25m & Higher Knee
Grounded Crouch
Base Height = 0m & Higher Knee

Behavior Diversity

By sampling different sub-buffers from the replay buffer, we can find different behaviors even with the same reward function.

$$R = (\mathrm{left\_wrist\_height} \in [0.6, 0.8]\mathrm{m}) \wedge (\mathrm{right\_wrist\_height} > 1\mathrm{m})$$

Observation:

  • 1. All five poses satisfy the reward function
  • 2. The lower-body postures have some differences
  • 3. The upper-body postures, especially the right arm in the last pose, have significant differences

Skill Composition

Taking the arm control and basic locomotion as examples, we can combine them to form new skills.

$$R = w_{\text{arm}} \cdot R_{\text{arm}} + w_{\text{loco}} \cdot R_{\text{loco}} $$

Here, w is the corresponding weight for the reward function, we express low as "l", high as "h", "arm-l-h" means the right wrist is low and the left wrist is high.

Note: All demos are from a continuous video shooting with the same policy.

Note: Right/Left is relative to the robot; the reward functions are for illustration purposes, in the inference time, we also have some soft constraints and regularization terms.

Natural Recovery from Large Disturbance

We demonstrate the robustness and flexibility of BFM-Zero: The policy enables the humanoid robot to recover gracefully from various disturbances, including heavy pushes, torso kicks, ground pulls, and leg kicks.

Highlight: Natural recovery from pulling to the ground

Highlight: Emergent behavior (running) from heavy pushes

Few-shot Adaptation

We demonstrate BFM-Zero's few-shot adaptation capability, the smooth structure of our latent space enables efficient search-based optimization in simulation to discover a better latent instead of directly using zero-shot inference within a short time.

Adaptation Setting

When the robot carries a 4kg payload on its torso, we can perform adaptation to let the robot keep single-leg standing for longer time.

Before Adaptation

Adaptation in Sim
for less than 2 minutes

After Adaptation

Space Interpolation

The structured nature of the learned space enables smooth interpolation between latent representations. We can leverage Spherical Linear Interpolation to generate intermediate latent vectors along the geodesic arc between the two end-points.

$$z_{t} := \frac{\sin((1-t)\theta)}{\sin \theta}z_0 + \frac{\sin(t\theta)}{\sin \theta}z_1, \quad \theta := \arccos\left(\langle z_0, z_1 \rangle\right), \ z_0\ne z_1, t\in[0,1].$$

We can see simple interpolation shows meaningful semantic-level changes.

$z_0: \text{strafe-left}, z_1: \text{strafe-right}$

$z_0: \text{arms-low-low}, z_1: \text{arms-low-high}$