CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 17 Oct 2024 06:31:00 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6710af24-751b" expires: Sun, 28 Dec 2025 23:58:20 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: F1F6:2D8B9D:812584:91101A:6951C1C3 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 23:48:20 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210088-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766965700.246860,VS0,VE229 vary: Accept-Encoding x-fastly-request-id: bcda8dd9b94e0c88456bf0da2ca7dc890ebfc90f content-length: 6433 GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

GenSim2:
Scaling Robot Data Generation
with Multi-modal and Reasoning LLMs

Pu Hua^1*, Minghuan Liu^{2,3 *}, Annabella Macaluso^{2 *},

Yunfeng Lin³, Weinan Zhang³, Huazhe Xu¹, Lirui Wang^4†

Tsinghua University¹, UCSD², Shanghai Jiao Tong University³, MIT CSAIL⁴

* equal contribution. † project lead.

Conference on Robot Learning, 2024

Paper Twitter Summary Task Creation Tutorial Github

TL;DR: GenSim2 uses multimodal LLMs to generate vast amounts of articulated, 6-dof robotic tasks in simulation for pre-training a generalist 3D multitask policies. The framework "amplifies" limited real world tasks and trajectories with foundation models.

Abstract

Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulationtrained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including longhorizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts.

To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.

Generated Task Library

Primitive Tasks

Task

instance

Long-horizon Tasks

Task

instance

Real-Robot Experiments

Real Only

Task

instance

Sim+Real

Task

instance

Compared to using only 10 real-world trajectories, incorporating generated simulation data enhances the generalization of real-world policies across multiple tasks. Tasks shown here are executed using a multi-task policy.

GenSim2 Framework

The GenSim2 framework consists of (1) task proposal, (2) solver creation, (3) multi-task training, and (4) generalization and sim-to-real transfer.

GenSim2 Solver Generation Pipeline

Multi-modal task solver generation pipeline that utilizes GPT-4 and optimization configurations for scalable manipulation task solutions.

Planner Overview

We demonstrate how to leverage the keypoint planner to solve the OpenBox task. Initially, constraints are defined to ensure the gripper contacts the box lid. Based on this actuation pose, specific motions are assigned to complete the task of opening the box.

Proprioceptive Pointcloud Transformer

The proposed Proprioception Point cloud Transformer (PPT) policy architecture maps language, point cloud, and proprioception inputs in a shared latent space for action prediction.

Experiments

▶ Task Generation Ablation

▶ Rea-World Experiments

BibTeX

      
        @inproceedings{huagensim2,
          title={GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs},
          author={Hua, Pu and Liu, Minghuan and Macaluso, Annabella and Lin, Yunfeng and Zhang, Weinan and Xu, Huazhe and Wang, Lirui},
          booktitle={8th Annual Conference on Robot Learning}
        }

Acknowledgement

We would like to thank Professor Xiaolong Wang for his kind support and discussion of this project. We thank Yuzhe Qin and Fanbo Xiang for their generous help in SAPIEN development. We thank Mazeyu Ji for his help on real-world experiments. Many ideas are inspired by GenSim.

Original Source | Taken Source