CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 04 Aug 2025 07:51:39 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6890668b-a6cc" expires: Sun, 28 Dec 2025 18:21:26 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 2945:2D64E0:7D43B9:8C943D:695172CD accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 18:11:26 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210073-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766945486.059712,VS0,VE224 vary: Accept-Encoding x-fastly-request-id: 8f5006fc1873d8d501f9e6a5417e709588dc6ea7 content-length: 5459 AdpaTac-Dex

Adaptive Visuo-Tactile Fusion with Predictive Force
Attention for Dexterous Manipulation

IROS 2025

Jinzhou Li^{1,2 *}, Tianhao Wu^{1,2,3 *}, Jiyao Zhang^{1,2,3 †}, Zeyuan Chen^{1,2 †}, Haotian Jin², Mingdong Wu^1,2,
Yujun Shen⁵, Yaodong Yang⁴, Hao Dong^{1,2,3 ✉}

¹Center on Frontiers of Computing Studies, School of Computer Science, Peking University ²PKU-Agibot Lab, School of Computer Science, Peking University ³National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University ⁴Institute for Artificial Intelligence, Peking University; ⁵Ant Research
^{* and †} Equal contribution, ^✉ Corresponding author

ArXiv Video Hardware Policy Teleoperation

AdpaTac-Dex extracts visual and tactile features to predict future force and combines it with the observed force to
adaptively adjust the attention of different modalities at different stages of dexterous manipulation.

Abstract

Effectively utilizing multi-sensory data is important for robots to generalize across diverse tasks. However, the heterogeneous nature of these modalities makes fusion challenging. Existing methods propose strategies to obtain comprehensively fused features but often ignore the fact that each modality requires different levels of attention at different manipulation stages. To address this, we propose a force-guided attention fusion module that adaptively adjusts the weights of visual and tactile features without human labeling. We also introduce a self-supervised future force prediction auxiliary task to reinforce the tactile modality, improve data imbalance, and encourage proper adjustment. Our method achieves an average success rate of 93% across three fine-grained, contact-rich tasks in real-world experiments. Further analysis shows that our policy appropriately adjusts attention to each modality at different manipulation stages.

(a) We use a pre-trained tactile encoder to encode 3D tactile signals. (b) We use a sparse encoder to encode the point cloud data. (c) The encoded visual and tactile features are used to predict the future net force. (d) The predicted future net force is combined with the observed net force to guide visual-tactile fusion through an attention mechanism. (e) The fused action feature is used as a condition for learning the dexterous manipulation policy.

Video

Force Variations in Expert Demonstrations

We randomly selected 10 expert demonstrations from three tasks to illustrate the variations in net force values during task execution. The data shows that due to environmental noise (especially in the flip task) and task variations, pre-defining precise contact force thresholds is extremely challenging. Setting appropriate force thresholds is difficult in practical robotic applications, as force values inevitably change during manipulation, even within the same task. Additionally, various environmental noises directly affect the effectiveness of thresholds, making statically predetermined thresholds inadequate for the dynamic requirements of real-world operations.

Flip