| CARVIEW |
Universal Visual Decomposer:
Long-Horizon Manipulation Made Easy
Finalist for the Best Paper Award in Robot Vision, ICRA 2024
Dinesh Jayaraman2, Yecheng Jason Ma2‡, Luca Weihs1‡
Abstract
Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the overarching task into several manageable subtasks to facilitate policy learning and generalization to unseen tasks. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. To address these shortcomings, we propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long-horizon manipulation using pre-trained visual representations designed for robotic control. At a high level, UVD discovers subgoals by detecting phase shifts in the embedding space of the pre-trained representation. Operating purely on visual demonstrations without auxiliary information, UVD can effectively extract visual subgoals embedded in the videos, while incurring zero additional training cost on top of standard visuomotor policy training. Goal-conditioned policies learned with UVD-discovered subgoals exhibit significantly improved compositional generalization at test time to unseen tasks. Furthermore, UVD-discovered subgoals can be used to construct goal-based reward shaping that jump-starts temporally extended exploration for reinforcement learning. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings on in-domain and out-of-domain task sequences alike, validating the clear advantage of automated visual task decomposition within the simple, compact UVD framework.
Try our UVD decomposition hosted with Gradio below! Note: due to the limited memory, only VIP preprocessor is supported for now. If the demo is down, please contact the author.
Methods
Our goal is to derive a general-purpose subgoal decomposition method that can operate purely from raw visual inputs on a per-trajectory basis. The key intuition of UVD is that, conditioned on a goal frame, some frames preceding it must visually approach the goal frame; once we discover the first frame in this goal-reaching sequence, the frame that precedes it is then another subgoal. Now by conditioning the new subgoal, we can apply the algorithm recursively until the full sequence is exhausted. Now we show the low-level and high-level psudocode with a visualization of the recursive decomposition process below.
UVD low-level pseudocode in Python
from scipy.signal import argrelextrema
def UVD(
embeddings: np.ndarray | torch.Tensor,
smooth_fn: Callable,
min_interval: int = 15,
) -> list[int]:
# last frame as the last subgoal
cur_goal_idx = -1
# saving (reversed) subgoal indices (timesteps)
goal_indices = [cur_goal_idx]
cur_emb = embeddings.copy() # L, d
while cur_goal_idx > min_interval:
# smoothed embedding distance curve (L,)
d = norm(cur_emb - cur_emb[-1], axis=-1)
d = smooth_fn(d)
# monotonicity breaks (e.g. maxima)
extremas = argrelextrema(d, np.greater)[0]
extremas = [
e for e in extremas
if cur_goal_idx - e > min_interval
]
if extremas:
# update subgoal by Eq.(3)
cur_goal_idx = extremas[-1] - 1
goal_indices.append(cur_goal_idx)
cur_emb = embeddings[:cur_goal_idx + 1]
else:
break
return embeddings[
goal_indices[::-1] # chronological
]
$o_{t-n-1} :=\arg \max_{o_h} d_\phi(o_h;o_t) < d_\phi(o_{h+1};o_t), h < t$ (Eq.3)
$ t = t - n - 1$
Visualization of UVD recursive decomposition
UVD in the wild
UVD is not limited to robotic settings—it's also highly effective in household scenarios on human videos. Here are some examples of how UVD can decompose subgoals in the wild:
Open a cabinet and rearrange
Open a drawer and charge
Unlock a computer
Wash hands in bathroom
Activities in kitchen
Experiments
Simulation Results
In-domain and out-of-domain IL Results on FrankaKitchen. We report the mean and standard deviation of success rate (full-stage completion) and the percentage of the completion (out of 4 stages), evaluated over diverse existing pretrained visual representations trained by GCBC with three seeds. Highlighted scores represent improvements in out-of-domain evaluations and in-domain results with gains exceeding 0.01.
| Representation | Method | InD success | InD completion | OoD success | OoD completion |
|---|---|---|---|---|---|
| VIP (ResNet 50) | GCBC | 0.736 (0.011) | 0.898 (0.006) | 0.035 (0.014) | 0.236 (0.057) |
| GCBC + Ours | 0.737 (0.012) | 0.903 (0.009) | 0.188 (0.024) | 0.566 (0.020) | |
| R3M (ResNet 50) | GCBC | 0.742 (0.026) | 0.856 (0.006) | 0.014 (0.007) | 0.223 (0.029) |
| GCBC + Ours | 0.738 (0.024) | 0.879 (0.000) | 0.084 (0.045) | 0.427 (0.002) | |
| LIV (ResNet 50) | GCBC | 0.608 (0.068) | 0.816 (0.046) | 0.008 (0.008) | 0.116 (0.082) |
| GCBC + Ours | 0.649 (0.013) | 0.868 (0.007) | 0.066 (0.025) | 0.496 (0.033) | |
| CLIP (ResNet 50) | GCBC | 0.391 (0.017) | 0.692 (0.008) | 0.005 (0.001) | 0.119 (0.017) |
| GCBC + Ours | 0.394 (0.036) | 0.701 (0.012) | 0.073 (0.003) | 0.403 (0.01) | |
| DINO-v2 (ViT-Large) | GCBC | 0.329 (0.025) | 0.654 (0.019) | 0.012 (0.01) | 0.261 (0.213) |
| GCBC + Ours | 0.322 (0.053) | 0.669 (0.037) | 0.055 (0.025) | 0.446 (0.034) | |
| VIP (ResNet 50) | GCBC-GPT | 0.702 (0.029) | 0.841 (0.02) | 0.039 (0.027) | 0.302 (0.028) |
| GCBC-GPT + Ours | 0.708 (0.056) | 0.897 (0.024) | 0.213 (0.054) | 0.600 (0.038) |
Next, we visualize the qualititive results for one of the task in FrankaKitchen: open the microwave, turn on the bottom burner, toggle the light switch, and slide the cabinet. We compare the decomposition results with different frozen visual backbones, as well as 3D t-SNE visualizations (colors are labeled by each subgoal). Representations pretrained with temporal objectives like VIP and R3M provide more smooth, continuous, and monotone clusters in feature space than others, whereas the ResNet trained for supervised classification on ImageNet-1k provide the most sparse embeddings.
UVD Decomposition Results in Simulation
GCBC ❌
GCBC + UVD
GCRL ❌
Real Robot Results
In-domain evaluation. For real-world applications, we've tested UVD on three multistage tasks: placing an apple in an oven and close the oven ($\texttt{Apple-in-Oven}$), pouring fries then place on a rack ($\texttt{Fries-and-Rack}$), and folding a cloth ($\texttt{Fold-Cloth}$). The corresponding videos show how we break down these tasks into semantically meaningful sub-goals. Two successful and one failed rollouts on these three tasks. All videos for real robot experiments are 2x speed up.
$\texttt{Apple-in-Oven}$
UVD Decomposition Results
❌
$\texttt{Fries-and-Rack}$
UVD Decomposition Results
❌
$\texttt{Fold-Cloth}$
UVD Decomposition Results
❌
Compositional Generalization. We evaluate UVD's ability to generalize compositionally by introducing unseen initial states for these tasks. While methods like GCBC fail (first row) under these circumstances, GCBC + UVD (second row) successfully adapts.
GCBC ❌
GCBC + UVD
Robustness with Human Involvement. We further demonstrate how UVD is able to recover or continue to complete the task with human interference. In $\texttt{Apple-in-Oven}$ and $\texttt{Fries-and-Rack}$ task, we either reset the scene by putting the apple to initial position or skip the intermediate step with human interference. Our method shows great robustness in these cases.
Reset the apple to initial position
Accomplish intermediate step "pushing"
Accomplish intermediate step "pouring"
Implementation Details
Model: To underscore that our method serves as an off-the-shelf method that is applicable to different policies, we ablate with a Multilayer Perceptron (MLP) based single-step policy and a GPT-like causal transformer policy.
This MLP ingests a combination of the frozen visual embeddings from step-wise RGB observations and goal images followed by a 1D BatchNorm, as well as the 9D proprioceptive data encoded through a single layer complemented by a LayerNorm.
Our GPT policy removes the BatchNorm and replaces the MLP with the causal self-attention blocks consisting of 8 layers, 8 heads, and an embedding dimension of 768. We set an attention dropout rate of 0.1 and a context length of 10.
We transition from the conventional LayerNorm to the Root Mean Square Layer Normalization (RMSNorm) and enhance the transformer with rotary position embedding (RoPE). Actions are predicted via a linear. At inference time, we cache the keys and values of the self-attention at every step, ensuring that there's no bottleneck as the context length scales up. Nevertheless, in the FrankaKitchen tasks, we observed that a longer context length tends to overfit and performance drop. Therefore, we consistently use a context length of 10 for all experiments. More details can be found in appendix.
| Hyperparameter/Value | MLP-Policy | GPT-Policy |
|---|---|---|
| Optimizer | AdamW | AdamW |
| Learning Rate | 3e-4 | 3e-4 |
| LR Schedule | cos decay | cos decay |
| Warmup Steps | 0 | 1000 |
| Decay Steps | 150k | 200k |
| Weight Decay | 0.01 | 0.1 |
| Betas | [0.9, 0.999] | [0.9, 0.99] |
| Max Gradient Norm | 1.0 | 1.0 |
| Batch Size | 512 | 128 |
IL training hyperparameters
| Hyperparameter | Value |
|---|---|
| Hidden Dim. | [1024, 512, 256] |
| Activation | ReLU |
| Proprio. Hidden dim. | 512 |
| Proprio. Activation | Tanh |
| Visual Norm. | Batchnorm1d |
| Proprio. Norm. | LayerNorm |
| Action Activation | Tanh |
| Trainable Parameters | 3.3M |
MLP policy hyperparameters
| Hyperparameter | Value |
|---|---|
| Context Length | 10 |
| Embedding Dim. | 768 |
| Layers | 8 |
| Heads | 8 |
| Embedding Dropout | 0.0 |
| Attention Dropout | 0.1 |
| Normalization | RMSNorm |
| Action Activation | Tanh |
| Trainable Parameters | 58.6M |
GPT policy hyperparameters
Reinforcement Learning: All RL experiments are trained using the Proximal Policy Optimization (PPO) RL algorithm implemented within the AllenAct RL training framework. In our RL setting, the configurations for both training and inference remain consistent. This is analogous to the inference for IL as detailed in Appendix. A2. Specifically, the task is also specified by an unlabeled video trajectory $\tau$. Given the initial observation $o_0$ and UVD subgoal $g_0\in\tau_{goal}$, the agent continuously predicts and executes actions conditioned on subgoal $g_0$ using the online policy with frozen visual encoder $\phi$, until the condition $d_\phi(o_t;g_0) < \epsilon$ satisfied for some timestep $t$ and positive threshold $\epsilon$. As shown in Eq. 4, we provide progressive rewards defined as goal-embedding distance difference using UVD subgoals. Recognizing that the distance between consecutive subgoals can vary, we employ the normalized distance function: $\bar{d_\phi}(o_t;g_i) := d_\phi(o_t;g_i) / d_\phi(g_{i-1}; g_i)$. This ensures that $\bar{d_\phi}(o_{t-h};g_i) \approx 1$ for some timestep $t-h$ that the subgoal was transitioned from $g_{i-1}$ to $g_i$. Additionally, we provide modest discrete rewards for encouraging (chronically) subgoal transitions, and larger terminal rewards for the full completion of task sub-stages, which is equivalent as the embedding distance between the observation and the final subgoal becomes sufficiently small. To sum up, at timestep $t$, the agent is receiving a weighted reward \begin{equation}\label{eq:literal_rl_rewards} \begin{aligned} R_t &= \alpha\cdot\left(\bar{d_\phi} (o_{t-1};g_i) - d_\phi(o_t;g_i) \right) \\ &+ \beta \cdot \mathbf{1}_{\bar{d_\phi}(o_t;g_i) < \epsilon} \\ &+ \gamma \cdot \mathbf{1}_{\bar{d_\phi}(o_t;g_m) < \epsilon} \end{aligned} \end{equation} based on the RGB observations $o_t, o_{t-1}$, corresponding UVD subgoal $g_i\in\tau_{goal}$, and final subgoal $g_m\in\tau_{goal}$. While similar reward formulations appear in works, we are the first in delivering optimally monotonic implicit rewards unsupervisedly by UVD, derived directly from RGB features. In our experiments, we use $\alpha=5, \beta=3, \gamma=6, \epsilon=0.2$, and confine the first term within the range $[-\alpha, \alpha]$ in case edge cases in feature space. For the final-goal-conditioned RL baseline, it is equivalent as $g_i = g_m = o_{T}\in \tau = \{o_0, \cdots, o_T\}$ and $\beta=0$ in Equation above. Tab. II illustrates that the simple incorporation of UVD-rewards greatly enhances performance. We also showcase a comparison of evaluation rewards between GCRL and GCRL augmented with our UVD rewards. This is done using the R3M and VIP backbones, as seen in Fig. 2. This highlights the capability of our UVD to offer more streamlined progressive rewards. This capability is pivotal for the agent to adeptly manage the challenging, multi-stage tasks presented in FrankaKitchen. To the best of our knowledge, ours is the first work to achieve such a high success rate in the FrankaKitchen task without human reward engineering and additional training. Notably, our RL agent, trained with the optimally monotonic UVD-reward, can complete 4 sequential tasks in as few as 90 --- a stark contrast to the over 200 steps observed in human-teleoperated demonstrations. This further illustrates the UVD-reward's potential to encourage agents to accomplish multi-stage goals more efficiently. The videos of rollouts can be found on our website.
Inference: We now elucidate the specifics of applying UVD subgoals in a multi-task setting during inference. Remember that given a video demonstration represented as $\tau = (o_0, \cdots, o_T)$ and UVD-identified subgoals $\tau_{goal} = (g_0, \cdots, g_m)$, we can extract an augmented trajectory labeled with goals, represented as $\tau_{aug} = {(o_a, a_0, g_0), \cdots, {o_T, a_T, g_m}}$. This is useful for goal-conditioned policy training, as discussed in Sec. IV-B. For inference, we can similarly produce an augmented offline trajectory without the necessity of ground-truth actions, i.e., $\tau_{aug,infer} = \{(o_0, g_0), \cdots, (o_T, g_m)\}$. In the online rollout, after resetting the environment to $o_0$, the agent continuously predicts and enacts actions conditioned on subgoal $g_0$ using the trained policy. This continues until the embedding distance between the current observation and the subgoal surpasses a pre-set positive threshold $\epsilon$ at a specific timestep $i$, i.e. $d_\phi(o_i; g_0) < \epsilon$, where $\phi$ is the same frozen visual backbone used in decomposition and training. Following this, the subgoal will be seamlessly transitioned to the next, continuing until success or failure is achieved. In practice, the straightforward goal-relaying inference method might face accumulative errors during multiple subgoal transitions, especially due to noise from online rollouts. However, when an agent is guided explicitly by tasks depicted in a video, incorporating the duration dedicated to each subgoal can help reduce this vulnerability. To clarify, once we've aligned subgoals with observations from the video, we also draw a connection between the timesteps of observations and their corresponding subgoals. We denote the subgoal budget for subgoal $g_i = o_t$ as $\mathcal{B}_{g_i} := n + 1$ where $g_{i-1} = o_{t-n-1}$ based on Eq. 3. Building on this, we propose a secondary criterion for switching subgoals: verify if the relative steps completing the current stage are in the neighborhood of the subgoal budget. This measure ensures timely transitions: it avoids prematurely switching before completing a sub-stage or delaying the transition despite accomplishing the sub-stage in the environment. To sum up, given an ongoing observation $o_t$ and subgoal $g_i$ at timestep $t$, and considering the preceding subgoal $g_{i-1}$ at timesteps $t - h$, the subgoal will transition to $g_{i+1}$ if \begin{equation} d_\phi(o_t; g_i) < \epsilon \quad \text{and} \quad |h - \mathcal{B}_{g_i}| < \delta \end{equation} We use $\epsilon=0.2$ and $\delta=2$ steps for all of our experiments, except in baseline tests that are conditioned solely on final goals.
BibTeX
@inproceedings{zhang2024universal,
title={Universal visual decomposer: Long-horizon manipulation made easy},
author={Zhang, Zichen and Li, Yunshuang and Bastani, Osbert and Gupta, Abhishek and Jayaraman, Dinesh and Ma, Yecheng Jason and Weihs, Luca},
booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages={6973--6980},
year={2024},
organization={IEEE}
}