| CARVIEW |
Results
RT-Sketch is a sketch-to-action behavior-cloning agent which is
(1) On par with image-conditioned and language-conditioned agents for tabletop / countertop manipulation
(2) Compatible with sketches of varied detail
(3) Robust to visual distractors
(4) Unaffected by semantic ambiguities
Tabletop / Countertop Manipulation
For straightforward manipulation tasks such as those in the RT-1 benchmark, RT-Sketch performs on par with language-conditioned and image-conditioned agents for nearly all skills:
Move Near Skill
Pick Drawer
Drawer Open
Drawer Close
Knock
Upright
Robustness to Sketch Detail
RT-Sketch further affords input sketches with varied levels of detail, ranging from free-hand sketches to colorized sketches, without a performance drop compared to upper-bound representations like edge-detected images.
Move Near Skill
Drawer Open Skill
Emergent Capabilities: Robustness to Visual Distractors
Although RT-Sketch is only trained on distractor-free settings, we find that it is able to handle visual distractors in the scene well, while goal-image conditioned policies are easily thrown out of distribution and fail to make task progress. This is likely due to the minimal nature of sketches, which inherently helps the policy attend to only task-relevant objects.
In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a
1.5X and 1.6X improvement over a goal-image conditioned policy.
Goal Image
Rollout
RT-Goal Image
Goal Sketch
Rollout
RT-Sketch
Semantic Ambiguity
While convenient, language instructions can often be underspecified, ambiguous, or may require lengthy descriptions to communicate task goals effectively. These issues do not arise with sketches, which offer a minimal yet expressive means of conveying goals. We find that RT-Sketch is performant in scenarios where language can be ambiguous or too out-of-distribution for policies like RT-1 to handle.
In terms of perceived semantic and spatial alignment on a 1-7 scale, RT-Sketch achieves a
2.4X and 2.8X improvement over RT-1.
Language Goal + Rollout
RT-1
Goal Sketch
Rollout
RT-Sketch
Failure Modes
RT-Sketch's main failure modes are imprecision and moving the wrong object. We see the first failure mode typically when RT-Sketch positions an object correctly but fails to reorient it (common in the upright task). The second failure mode is most apparent in the case of visual distractor objects, where RT-Sketch mistakenly picks up the wrong object and puts in the appropriate place. We posit that both of these failures are due to RT-Sketch being trained on GAN-generated sketches, which occasionally do not preserve geometric details well, leading the policy to not pay close attention to objects or object orientations.
Imprecision
Coke can moved to correct location, but not upright
Pepsi can moved to correct location, but not upright
Wrong Object
Apple moved instead of coke can
Coke can moved instead of fruit