| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 07 Nov 2023 13:24:49 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"654a3aa1-50f9"
expires: Sun, 28 Dec 2025 12:05:15 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 23F6:2D8B9D:79262C:87CE32:69511AA2
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 11:55:15 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210029-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766922915.281636,VS0,VE204
vary: Accept-Encoding
x-fastly-request-id: dad48dded2cc3c85833a279bf1d5ec8d2171a9df
content-length: 4629
Learning to Design and Use Tools for Robotic Manipulation
Learning to Design and Use Tools for Robotic Manipulation
Stanford University
Framework
Solving a task using learned designer and controller policies. During the design phase, the designer policy outputs the parameters for a tool that will help solve the given task. In the control phase, the controller policy outputs motor commands given the tool structure, task specification, and environment observation.
Sample Efficiency
Push
Catch Balls
Scoop
Fetch Cube
Lift Cup
3D Scoop
Learning curves for our framework, prior methods, and baselines. Across all tasks, our framework achieves improved performance and sample efficiency. Shaded areas indicate standard error across 6 random seeds on all methods, except the Scoop (3D) task where we use 3 seeds due to computational constraints.
Design-Control Tradeoff
Control cost / design cost ratio with different α.
α = 1.0
α = 0.3
α = 0.7
α = 0.0
Qualitative examples of tools generated by setting our tradeoff parameter α to different values. We can see that as α increases, the tools created by the designer policy have shorter links at the left and right sides to decrease material usage. With low α values, large tools prevent the control policy from having to move the tool far.
Generalization
(a) Initialization ranges and zero-shot performance when cutting out 60% of the area of the entire possible training region.
(b) Returns for policies trained with varying relative cutout region area.
(c) Fine-tuning performance compared to learning from scratch across 4 target goals
Interpolation results on the pushing task. In (a), we plot the success (light blue) and failure (dark blue) goal regions. Areas within the dotted yellow borders denote unseen cutout regions (interpolation). The region within the teal border (but outside cutout regions) shows the training region. The area outside the teal border is unseen during training (extrapolation). (b) and (c) show return curves (averaged over 3 runs). Shaded regions denote standard error. In (b), we observe that the performance for policies trained with small cutouts is close to that of the setting trained on all goals. In (c) we show that even for poses that are far away from the initial training region, our policies are able to learn to solve the task within a handful of gradient steps, and is much more effective than learning from scratch.
Goal-Conditioned Design and Control
