Methods

UVD low-level pseudocode in Python

from scipy.signal import argrelextrema

def UVD(
    embeddings: np.ndarray | torch.Tensor, 
    smooth_fn: Callable,
    min_interval: int = 15,
) -> list[int]:
    # last frame as the last subgoal
    cur_goal_idx = -1 
    # saving (reversed) subgoal indices (timesteps)
    goal_indices = [cur_goal_idx]
    cur_emb = embeddings.copy() # L, d
    while cur_goal_idx > min_interval:
        # smoothed embedding distance curve (L,)
        d = norm(cur_emb - cur_emb[-1], axis=-1)
        d = smooth_fn(d)
        # monotonicity breaks (e.g. maxima)
        extremas = argrelextrema(d, np.greater)[0]
        extremas = [
            e for e in extremas 
            if cur_goal_idx - e > min_interval
        ]
        if extremas:
            # update subgoal by Eq.(3)
            cur_goal_idx = extremas[-1] - 1
            goal_indices.append(cur_goal_idx)
            cur_emb = embeddings[:cur_goal_idx + 1]
        else:
            break
    return embeddings[
        goal_indices[::-1]  # chronological
    ]

Universal Visual Decomposer (UVD)

Init: frozen visual encoder $\phi$, $\tau = \{o_0, \cdots, o_T\}$

Init: set of subgoals $\tau_{goal}$ = {}, $t = T$

While $t$ not small enough:

$\tau_{goal} = \tau_{goal} \cup \{o_{t}\}$
$o_{t-n-1} :=\arg \max_{o_h} d_\phi(o_h;o_t) < d_\phi(o_{h+1};o_t), h < t$ (Eq.3)
$ t = t - n - 1$

End

Visualization of UVD recursive decomposition

Experiments

Simulation Results

In-domain and out-of-domain IL Results on FrankaKitchen. We report the mean and standard deviation of success rate (full-stage completion) and the percentage of the completion (out of 4 stages), evaluated over diverse existing pretrained visual representations trained by GCBC with three seeds. Highlighted scores represent improvements in out-of-domain evaluations and in-domain results with gains exceeding 0.01.

Hide/Show Numerical Results

Representation	Method	InD success	InD completion	OoD success	OoD completion
VIP (ResNet 50)	GCBC	0.736 (0.011)	0.898 (0.006)	0.035 (0.014)	0.236 (0.057)
VIP (ResNet 50)	GCBC + Ours	0.737 (0.012)	0.903 (0.009)	0.188 (0.024)	0.566 (0.020)
R3M (ResNet 50)	GCBC	0.742 (0.026)	0.856 (0.006)	0.014 (0.007)	0.223 (0.029)
R3M (ResNet 50)	GCBC + Ours	0.738 (0.024)	0.879 (0.000)	0.084 (0.045)	0.427 (0.002)
LIV (ResNet 50)	GCBC	0.608 (0.068)	0.816 (0.046)	0.008 (0.008)	0.116 (0.082)
LIV (ResNet 50)	GCBC + Ours	0.649 (0.013)	0.868 (0.007)	0.066 (0.025)	0.496 (0.033)
CLIP (ResNet 50)	GCBC	0.391 (0.017)	0.692 (0.008)	0.005 (0.001)	0.119 (0.017)
CLIP (ResNet 50)	GCBC + Ours	0.394 (0.036)	0.701 (0.012)	0.073 (0.003)	0.403 (0.01)
DINO-v2 (ViT-Large)	GCBC	0.329 (0.025)	0.654 (0.019)	0.012 (0.01)	0.261 (0.213)
DINO-v2 (ViT-Large)	GCBC + Ours	0.322 (0.053)	0.669 (0.037)	0.055 (0.025)	0.446 (0.034)
VIP (ResNet 50)	GCBC-GPT	0.702 (0.029)	0.841 (0.02)	0.039 (0.027)	0.302 (0.028)
VIP (ResNet 50)	GCBC-GPT + Ours	0.708 (0.056)	0.897 (0.024)	0.213 (0.054)	0.600 (0.038)

Next, we visualize the qualititive results for one of the task in FrankaKitchen: open the microwave, turn on the bottom burner, toggle the light switch, and slide the cabinet. We compare the decomposition results with different frozen visual backbones, as well as 3D t-SNE visualizations (colors are labeled by each subgoal). Representations pretrained with temporal objectives like VIP and R3M provide more smooth, continuous, and monotone clusters in feature space than others, whereas the ResNet trained for supervised classification on ImageNet-1k provide the most sparse embeddings.

UVD Decomposition Results in Simulation

GCBC ❌

GCBC + UVD

GCRL ❌

GCRL + UVD

Real Robot Results

In-domain evaluation. For real-world applications, we've tested UVD on three multistage tasks: placing an apple in an oven and close the oven ($\texttt{Apple-in-Oven}$), pouring fries then place on a rack ($\texttt{Fries-and-Rack}$), and folding a cloth ($\texttt{Fold-Cloth}$). The corresponding videos show how we break down these tasks into semantically meaningful sub-goals. Two successful and one failed rollouts on these three tasks. All videos for real robot experiments are 2x speed up.

$\texttt{Apple-in-Oven}$

UVD Decomposition Results

❌

$\texttt{Fries-and-Rack}$

UVD Decomposition Results

❌

$\texttt{Fold-Cloth}$

UVD Decomposition Results

❌

Compositional Generalization. We evaluate UVD's ability to generalize compositionally by introducing unseen initial states for these tasks. While methods like GCBC fail (first row) under these circumstances, GCBC + UVD (second row) successfully adapts.

GCBC ❌

GCBC + UVD

Robustness with Human Involvement. We further demonstrate how UVD is able to recover or continue to complete the task with human interference. In $\texttt{Apple-in-Oven}$ and $\texttt{Fries-and-Rack}$ task, we either reset the scene by putting the apple to initial position or skip the intermediate step with human interference. Our method shows great robustness in these cases.

Reset the apple to initial position

Accomplish intermediate step "pushing"

Accomplish intermediate step "pouring"

Implementation Details

Policies

Model: To underscore that our method serves as an off-the-shelf method that is applicable to different policies, we ablate with a Multilayer Perceptron (MLP) based single-step policy and a GPT-like causal transformer policy. This MLP ingests a combination of the frozen visual embeddings from step-wise RGB observations and goal images followed by a 1D BatchNorm, as well as the 9D proprioceptive data encoded through a single layer complemented by a LayerNorm. Our GPT policy removes the BatchNorm and replaces the MLP with the causal self-attention blocks consisting of 8 layers, 8 heads, and an embedding dimension of 768. We set an attention dropout rate of 0.1 and a context length of 10. We transition from the conventional LayerNorm to the Root Mean Square Layer Normalization (RMSNorm) and enhance the transformer with rotary position embedding (RoPE). Actions are predicted via a linear. At inference time, we cache the keys and values of the self-attention at every step, ensuring that there's no bottleneck as the context length scales up. Nevertheless, in the FrankaKitchen tasks, we observed that a longer context length tends to overfit and performance drop. Therefore, we consistently use a context length of 10 for all experiments. More details can be found in appendix.

Hyperparameter/Value	MLP-Policy	GPT-Policy
Optimizer	AdamW	AdamW
Learning Rate	3e-4	3e-4
LR Schedule	cos decay	cos decay
Warmup Steps	0	1000
Decay Steps	150k	200k
Weight Decay	0.01	0.1
Betas	[0.9, 0.999]	[0.9, 0.99]
Max Gradient Norm	1.0	1.0
Batch Size	512	128

IL training hyperparameters

Hyperparameter	Value
Hidden Dim.	[1024, 512, 256]
Activation	ReLU
Proprio. Hidden dim.	512
Proprio. Activation	Tanh
Visual Norm.	Batchnorm1d
Proprio. Norm.	LayerNorm
Action Activation	Tanh
Trainable Parameters	3.3M

MLP policy hyperparameters

Hyperparameter	Value
Context Length	10
Embedding Dim.	768
Layers	8
Heads	8
Embedding Dropout	0.0
Attention Dropout	0.1
Normalization	RMSNorm
Action Activation	Tanh
Trainable Parameters	58.6M

GPT policy hyperparameters

Training

Inference

UVD in the wild

In closing, we've found that UVD is not limited to robotic settings—it's also highly effective in household scenarios on human videos. Here are some examples of how UVD can decompose subgoals:

Open a cabinet and rearrange

Universal Visual Decomposer:
Long-Horizon Manipulation Made Easy

Methods

UVD low-level pseudocode in Python

Visualization of UVD recursive decomposition

Experiments

Simulation Results

Real Robot Results

$\texttt{Apple-in-Oven}$

$\texttt{Fries-and-Rack}$

$\texttt{Fold-Cloth}$

GCBC ❌

GCBC + UVD

Reset the apple to initial position

Accomplish intermediate step "pushing"

Accomplish intermediate step "pouring"

Implementation Details

UVD in the wild

Open a cabinet and rearrange

Open a drawer and charge

Unlock a computer

Wash hands in bathroom

Universal Visual Decomposer:Long-Horizon Manipulation Made Easy

Methods

UVD low-level pseudocode in Python

Visualization of UVD recursive decomposition ​

Experiments

Simulation Results

Real Robot Results

$\texttt{Apple-in-Oven}$

$\texttt{Fries-and-Rack}$

$\texttt{Fold-Cloth}$

GCBC ❌

GCBC + UVD

Reset the apple to initial position

Accomplish intermediate step "pushing"

Accomplish intermediate step "pouring"

Implementation Details

UVD in the wild

Open a cabinet and rearrange

Open a drawer and charge

Unlock a computer

Wash hands in bathroom

Universal Visual Decomposer:
Long-Horizon Manipulation Made Easy

Visualization of UVD recursive decomposition