HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 03 Jun 2025 16:22:56 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"683f2160-63fd" expires: Tue, 30 Dec 2025 08:50:42 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: DE6D:2D8B9D:9D8C36:B0F03C:6953900A accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 08:40:42 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210077-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767084042.308961,VS0,VE232 vary: Accept-Encoding x-fastly-request-id: e8aeb4f903ac59004cdaa9dec60eddf752aacea6 content-length: 5031 M3C

Enabling Chatbots with Eyes and Ears:
An Immersive Multimodal Conversation System for Dynamic Interactions

Jihyoung Jang^1*, Minwook Bae^2*, Minji Kim¹, Dilek Hakkani-Tür³, Hyounghun Kim¹

¹POSTECH, ²UNIST, ³UIUC, ^*Equal contribution.
Proceedings of ACL 2025

arXiv Code Dataset

Abstract

As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation (M³C), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the M³C, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model’s strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.

A sample of our M³C. The main speaker (Alex) engages in conversation with two different partners per session, where all speakers simultaneously experience the provided multimodal inputs in a shared environment. At the end of each session, the main speaker collects each partner’s memory from their own perspective and utilizes this information to guide the conversation in subsequent sessions. In later sessions, Alex can encounter new partners and continue the interaction. The memories referenced when generating utterances are marked with symbols, where shared or connected memories are indicated by the same symbol.

Dataset Comparision

We propose M³C, a machine-generated multimodal conversation dataset uniquely designed for multi-session, multi-speaker, and multimodal (image & audio) interactions. Unlike previous datasets that typically support only single-session or two-speaker conversations, M³C features:

Three sessions per episode, capturing temporal continuity,
Four speakers per episode, enabling diverse partner interactions,
Both image and audio modalities, grounding conversations in shared perceptual context.

Our dataset contains 54K episodes and 2.5M dialogue turns, significantly expanding the scale and depth of multimodal conversational benchmarks. The comparison below highlights how M³C differs from existing datasets across structure, modality, and scale.

Datasets	Type	Multiple Sessions	Multiple Speakers	Image (# of Images)		Audio (# of Audios)		# of Sessions	# of Turns
AMI	Open-Domain	❌	✔	❌		✔	-	279	-
VisDial	Modality-QA	❌	❌	✔	(120K)	❌		123K	2.4M
MELD	Open-Domain	❌	✔	✔	-	✔	-	1.4K	13K
ImageChat	Modality-Centric	❌	❌	✔	(202K)	❌		202K	401K
MMConv	Modality-Centric	❌	❌	✔	(114K)	❌		5.1K	39.8K
PhotoChat	Open-Domain	❌	❌	✔	(10.9K)	❌		12K	156K
MMDD	Modality-Centric	❌	❌	✔	(13K)	❌		17K	-
MMDialog	Modality-Centric	❌	❌	✔	(1.53M)	❌		1.08M	4.92M
MPCHAT	Modality-Centric	❌	❌	✔	(153K)	❌		15K	42.5K
Audio Dialogues	Modality-QA	❌	❌	❌		✔	-	163K	-
MiSC	Open-Domain	✔	✔	❌		❌		51K	-
DialogCC	Open-Domain	❌	❌	✔	(129.8K)	❌		83K	-
LOCOMO	Open-Domain	✔	❌	✔	(2K)	❌		1.7K	-
Stark	Open-Domain	✔	❌	✔	(900K)	❌		500K	-
Ours	Open-Domain	✔	✔	✔	(24K)	✔	(73K)	16K	2.5M

Type: Modality-QA = question-answering, Modality-Centric = modality-centered (e.g., image/audio), Open-Domain = general conversation.
Note: '-' means unreported data.

Model Architecture

We propose a multimodal, multi-session, multi-party conversation model that perceives both images and audio—enabling the system to engage in conversations as if it has "eyes and ears." The model is designed to maintain coherence across sessions while interacting with different speakers in a shared environment.
Our architecture consists of two main modules:

Dialogue Module: Generates responses grounded in the current multimodal context, constructs session memories, and links past interactions to maintain dialogue consistency.
Retriever Module: Retrieves relevant multimodal memories—spanning image, audio, and text modalities—from prior sessions to inform ongoing conversations.

By integrating these modules, the model ensures temporally-aware, partner-sensitive, and modality-grounded interactions.

Examples of datasets and our model's responses

Each episode in the dataset consists of three sessions

Live human chat examples comparing model responses to various modalities

More details coming soon!

Enabling Chatbots with Eyes and Ears:< An Immersive Multimodal Conversation System for Dynamic Interactions
Jihyoung Jang, Minwook Bae, Minji Kim, Dilek Hakkani-Tür, Hyounghun Kim
Proceedings of ACL 2025

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source

Enabling Chatbots with Eyes and Ears:An Immersive Multimodal Conversation System for Dynamic Interactions