Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Model Architecture

We propose a multimodal, multi-session, multi-party conversation model that perceives both images and audio—enabling the system to engage in conversations as if it has "eyes and ears." The model is designed to maintain coherence across sessions while interacting with different speakers in a shared environment.
Our architecture consists of two main modules:
  • Dialogue Module: Generates responses grounded in the current multimodal context, constructs session memories, and links past interactions to maintain dialogue consistency.
  • Retriever Module: Retrieves relevant multimodal memories—spanning image, audio, and text modalities—from prior sessions to inform ongoing conversations.
By integrating these modules, the model ensures temporally-aware, partner-sensitive, and modality-grounded interactions.
Teaser

Examples of datasets and our model's responses

Each episode in the dataset consists of three sessions

Live human chat examples comparing model responses to various modalities

More details coming soon!

Enabling Chatbots with Eyes and Ears:< An Immersive Multimodal Conversation System for Dynamic Interactions
Jihyoung Jang, Minwook Bae, Minji Kim, Dilek Hakkani-Tür, Hyounghun Kim
Proceedings of ACL 2025