from scipy.signal import argrelextrema
def UVD(
embeddings: np.ndarray | torch.Tensor,
smooth_fn: Callable,
min_interval: int = 15,
) -> list[int]:
# last frame as the last subgoal
cur_goal_idx = -1
# saving (reversed) subgoal indices (timesteps)
goal_indices = [cur_goal_idx]
cur_emb = embeddings.copy() # L, d
while cur_goal_idx > min_interval:
# smoothed embedding distance curve (L,)
d = norm(cur_emb - cur_emb[-1], axis=-1)
d = smooth_fn(d)
# monotonicity breaks (e.g. maxima)
extremas = argrelextrema(d, np.greater)[0]
extremas = [
e for e in extremas
if cur_goal_idx - e > min_interval
]
if extremas:
# update subgoal by Eq.(3)
cur_goal_idx = extremas[-1] - 1
goal_indices.append(cur_goal_idx)
cur_emb = embeddings[:cur_goal_idx + 1]
else:
break
return embeddings[
goal_indices[::-1] # chronological
]
In-domain and out-of-domain IL Results on FrankaKitchen. We report the mean and standard deviation of success rate (full-stage completion) and the percentage of the completion (out of 4 stages), evaluated over diverse existing pretrained visual representations trained by GCBC with three seeds. Highlighted scores represent improvements in out-of-domain evaluations and in-domain results with gains exceeding 0.01.
Next, we visualize the qualititive results for one of the task in FrankaKitchen: open the microwave, turn on the bottom burner, toggle the light switch, and slide the cabinet. We compare the decomposition results with different frozen visual backbones, as well as 3D t-SNE visualizations (colors are labeled by each subgoal). Representations pretrained with temporal objectives like VIP and R3M provide more smooth, continuous, and monotone clusters in feature space than others, whereas the ResNet trained for supervised classification on ImageNet-1k provide the most sparse embeddings.
UVD Decomposition Results in Simulation
GCBC ❌
GCBC + UVD
GCRL ❌
In-domain evaluation. For real-world applications, we've tested UVD on three multistage tasks: placing an apple in an oven and close the oven ($\texttt{Apple-in-Oven}$), pouring fries then place on a rack ($\texttt{Fries-and-Rack}$), and folding a cloth ($\texttt{Fold-Cloth}$). The corresponding videos show how we break down these tasks into semantically meaningful sub-goals. Two successful and one failed rollouts on these three tasks. All videos for real robot experiments are 2x speed up.
UVD Decomposition Results
❌
UVD Decomposition Results
❌
UVD Decomposition Results
❌
Compositional Generalization. We evaluate UVD's ability to generalize compositionally by introducing unseen initial states for these tasks. While methods like GCBC fail (first row) under these circumstances, GCBC + UVD (second row) successfully adapts.
Robustness with Human Involvement. We further demonstrate how UVD is able to recover or continue to complete the task with human interference. In $\texttt{Apple-in-Oven}$ and $\texttt{Fries-and-Rack}$ task, we either reset the scene by putting the apple to initial position or skip the intermediate step with human interference. Our method shows great robustness in these cases.
In closing, we've found that UVD is not limited to robotic settings—it's also highly effective in household scenarios on human videos. Here are some examples of how UVD can decompose subgoals: