CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 01 Jul 2025 22:13:27 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68645d87-512d" expires: Mon, 29 Dec 2025 08:40:05 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: E2BC:2D8B9D:884056:9915E1:69523C0D accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 08:30:05 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210038-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766997006.552846,VS0,VE208 vary: Accept-Encoding x-fastly-request-id: cca6adbda62bd19d4b79bc4bf32a6bd9d8bf338a content-length: 4359 Touch begins where vision ends: Generalizable policies for contact-rich manipulation

Touch begins where vision ends: Generalizable policies for contact-rich manipulation

Zifan Zhao¹, Siddhant Haldar², Jinda Cui³, Lerrel Pinto², Raunaq Bhirangi^2†

¹New York University Shanghai ²New York University ³Honda Research

† Corresponding author: raunaqbhirangi@nyu.edu

Paper Code

VisuoTactile Local (ViTaL) policies, an effective framework for precise, contact-rich manipulation that combines visuotactile learning, visual augmentation, and residual RL for highly robust, generalizable policies.

Abstract

Data-driven approaches struggle with precise manipulation: imitation learning requires many hard-to-obtain demonstrations, while reinforcement learning yields brittle, non-generalizable policies. We introduce VisuoTactile Local (VITAL) policy learning, a framework that solves fine-grained manipulation tasks by decomposing them into two phases: a reaching phase, where a vision-language model (VLM) enables scene-level reasoning to localize the object of interest, and a local interaction phase, where a reusable, scene-agnostic VITAL policy performs contact-rich manipulation using egocentric vision and tactile sensing. This approach is motivated by the observation that while scene context varies, the low level interaction remains consistent across task instances. By training local policies once in a canonical setting, they can generalize via a localize-then-execute strategy. VITAL achieves ~90% success on contact-rich tasks in unseen environments and is robust to distractors. VITAL's effectiveness stems from three key insights: (1) foundation models for segmentation enable training robust visual encoders via behavior cloning; (2) these encoders improve the generalizability of policies learned using residual RL; and (3) tactile sensing significantly boosts performance in contact-rich tasks. Ablation studies validate each of these insights, and we demonstrate that VITAL integrates well with high-level VLMs enabling robust, reusable low-level skills.