| CARVIEW |
SYNERGAI: Perception Alignment for Human-Robot Collaboration
*Indicates Equal Contribution
Abstract
Recently, large language models (LLMs) have shown strong potential in facilitating human-robotic interaction and collaboration. However, existing LLM-based systems often overlook the misalignment between human and robot perceptions, which hinders their effective communication and real- world robot deployment. To address this issue, we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration. At its core, SYNERGAI employs 3D Scene Graph (3DSG) as its explicit and innate representation. This enables the system to leverage LLM to break down complex tasks and allocate appropriate tools in intermediate steps to extract relevant information from the 3DSG, modify its structure, or generate responses. Importantly, SYNERGAI incorporates an automatic mechanism that enables perceptual misalignment correction with users by updating its 3DSG with online interaction. SYNERGAI achieves comparable performance with the data-driven models in ScanQA in a zero- shot manner. Through comprehensive experiments across 10 real-world scenes, SYNERGAI demonstrates its effectiveness in establishing common ground with humans, realizing a success rate of 61.9% in alignment tasks. It also significantly improves the success rate from 3.7% to 45.68% on novel tasks by transferring the knowledge acquired during alignment.
Overview
Leveraging 3DSG as its representation, SYNERGAI decomposes complex tasks with LLMs and takes actions with our designed tools in intermediate steps. It interacts with humans through natural language and non-verbal mouse clicking to enhance object references, capable of facilitating human-robot collaboration and perceptual alignment by automatically modifying the data stored in 3DSG.
System Design
SYNERGAI represents 3D scene with 3DSGs and leverages LLMs to respond to user inputs. It is first prompted to generate a plan, which effectively decomposes the input task into sub-tasks to be solved in a sequential process. At each step, SYNERGAI selects a tool as its action based on the observation, which contains the results of the previous actions. In this example, the system identifies the correct object of relationship “on the blue box”, but incorrectly recognizes it as a book, where perception misalignment happens.
Tool Design
The tools developed for SYNERGAI support accessing relevant information (the top five) from 3DSG, modifying its data (the following four) and generating responses to the user (the last two).
Human-Robot Collaboration
Qualitative results for various 3D reasoning tasks.
Zero-Shot testing on ScanQA Datasets.
Human-Robot Alignment
Quantitative results of human-robot alignment. “SR” denotes the success rate, “RR” for the rate of reasonable responses and “QR” for the query ratio of the 3DSG. Our system can establish common ground with humans, realizing a success rate of 61.9% in alignment tasks.
Statistics of alignment experiments.
Knowledge Transfer to Novel Tasks, reported in Success Rate (%) as measured by LLM.This shows capability to transfer the acquired knowledge during alignment to novel tasks.
Alignment Task List and Examples
Examples of EASY and HARD alignment tasks. We devise 42 tasks to assess the human-robot alignment, which are related to perceptual concepts in form of question-answering.
A screenshot for our interface. The left part consists of the reconstructed 3D scene, the local 3DSG for the object of interest (bottom left), and object segmentation (bottom middle). The user can chat with our system and also select an object by clicking it in the reconstruction panel or the 3DSG. In this example, the user marks the door.
Full list of alignment tasks designed for the ScanNet dataset, Part 1/2.
Full list of alignment tasks designed for the ScanNet dataset, Part 2/2.
The novel tasks designed to test if the knowledge acquired during alignment is transferable.
Scene Reconstruction and 3DSG
Qualitative results of 3D reconstruction and segmentation. The figure shows that different methods reveal limitations and failures in both reconstruction or segmentation.
Relationships captured in 3DSG. The 3DSG is defined as a hierarchical graph, where each node represents one 3D object and the edges represent spatial relationships between nodes. We instantiate the nodes with the instance segmentation and traverse all the nodes to determine their spatial relationships. The object’s visual and physical properties are obtained through BLIP2 to generate information about the object color, shape, material and affordance, etc
Prompts and Doc-strings
The major system prompts in SYNERGAI.
