Programme > Tutorials
Tutorial 1
Title: Multimodal Learning for Spatio-Temporal Data Mining
Presenter: Yuxuan Liang, Siru Zhong, Xixuan Hao (Hong Kong University of Science and Technology, Guangzhou), Hao Miao (Aalborg University), Yan Zhao (University of Electronic Science and Technology of China), Qingsong Wen (Squirrel AI, USA), Roger Zimmermann (National University of Singapore)
Date: 27.10.2025
Abstract: Spatio-temporal data mining has become a critical research area in multimedia, driven by the increasing availability of multimodal data from diverse sources, such as remote sensing satellites, radar systems, IoT sensors, and multimedia content (e.g., social media, street-level imagery, and video surveillance). While traditional ap- proaches based on single-modal spatio-temporal data have achieved notable success, they often struggle to capture the full complex- ity of real-world scenarios. Integrating multiple data modalities enhances spatio-temporal data mining by providing richer, more accurate insights that traditional methods cannot achieve. This half-day tutorial, titled Multimodal Learning for Spatio-Temporal Data Mining, will explore how multimodal learning can transform spatio-temporal analysis. Topics will include fundamentals of spatio- temporal mining, challenges of integrating heterogeneous data, state-of-the-art multimodal modeling techniques, and emerging research trends in this field. Attendees will be equipped with the knowledge and tools to develop scalable, robust solutions for spatio- temporal data mining. All materials will be available online.
Tutorial 2
Title: Perceptually Inspired Visual Quality Assessment in Multimedia Communication
Presenter: Wei Zhou (Cardiff University, UK), Hadi Amirpour (Klagenfurt University, Austria)
Date: 27.10.2025
Abstract: As multimedia services like video streaming, video conferencing, virtual reality (VR), and online gaming continue to expand, ensuring high perceptual quality becomes a priority for maintaining user satisfaction and competitiveness. However, during acquisition, compression, transmission, and storage, multimedia content undergoes various distortions, causing degradation in experienced quality. Thus, perceptual quality assessment, which focuses on evaluating the quality of multimedia content based on human perception, is essential for optimizing user experiences in advanced communication systems. Several challenges are involved in the quality assessment process, including diverse characteristics of multimedia content such as image, video, VR, point cloud, mesh, multimodality, etc., and complex distortion scenarios as well as viewing conditions. The tutorial first presents a detailed overview of principles and methods for perceptually inspired visual quality assessment. This includes both subjective methods, where users directly rate their experience, and objective methods, where algorithms predict human perception based on measurable factors such as bitrate, frame rate, and compression levels. Based on the basics of perceptually inspired visual quality assessment, metrics for different multimedia data are then introduced. Apart from the traditional image and video, immersive multimedia and AI-generated content will also be involved.
Tutorial 3 (Cancelled)
Title: Reasoning and Planning for Multimodal Large Language Models: A Multilingual and Cross-Domain Exploration
Presenter: Akash Ghosh, Sriparna Saha (Indian Institute of Technology Patna, India), Koustava Goswami, Joseph K J (Adobe Research Bangalore, India)
Date: 27.10.2025
Abstract: A Multilingual and Cross-Domain Exploration Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly improved reasoning and decisionmaking across text, vision, and audio. This tutorial explores the core principles, methodologies, and real-world applications of MLLM reasoning, with a focus on multilingual and cross-domain capabilities. We cover key challenges such as modality alignment, and fusion strategies demonstrating how MLLMs enhance reasoning and planning in crosslingual and cross-domain scenarios. A live demonstration will showcase MLLMs in action across diverse applications. This session is designed for researchers and practitioners, equipping them with essential tools to effectively integrate MLLMs into their work.
Tutorial 4
Title: Combating Online Misinformation Videos: Characterization, Detection, and Prevention
Presenter: Qiang Sheng, Yuyan Bu, Tianyun Yang, Juan Cao (Institute of Computing Technology, China), Peng Qi, Wynne Hsu, Mong Li Lee (National University of Singapore)
Date: 27.10.2025
Abstract: Recent progress of generative AI and the popularity of short-form video-sharing platforms has raised new risks on misinformation video issues, posing a potential threat to online multimedia ecosystems. With the aid of generative AI tools, producing and spreading vivid, persuasive misinformation videos has been easier, while detecting and preventing becomes harder. Our half-day tutorial explores how to characterize, detect, and prevent misinformation videos, which consists of three technical parts: 1) Characterization of AI-generated and human-edited misinformation videos; 2) Detection approaches, covering those tailored for fully generated, manipulated, and human-edited videos; and 3) Prevention strategies, including those effective for the creation and spread phases. This tutorial concludes by discussing the status quo and ongoing challenges and highlighting the promising directions for future research. We expect to bring broader attention to misinformation
video issues, gather and communicate with researchers of interest, and facilitate the engagement of those who are new to this field. Participants will have access to all materials and gain insights into combating misinformation videos.
Tutorial 5
Title: Video Question Answering and Beyond
Presenter: Yicong Li, Junbin Xiao, Angela Yao, Tat Seng Chua (National University of Singapore)
Date: 28.10.2025
Abstract: The proliferation of video content across various platforms has dramatically increased the need for intelligent systems that can interpret and understand videos. Video Question Answering (VideoQA) is a crucial task in this context, as it involves comprehensively analyzing video content to provide accurate and contextually relevant answers to user queries. The complexity of VideoQA lies in its requirement to integrate multiple modalities such as vision, language, and temporal dynamics, making it a rich area of research with numerous practical applications. Despite significant advances, VideoQA remains a challenging problem due to the inherent difficulties in video understanding, the need for robust and scalable models, the necessity of effective temporal modeling, and the practical techniques for cross-modal learning. A tutorial dedicated to VideoQA will provide an invaluable opportunity to bridge the gap between the current state-of-the-art techniques and the broader multimedia community, fostering collaboration and innovation.
Tutorial 6
Title: AI-based Multimedia Data Compression: Perception Utility Optimization and Standardization
Presenter: Wei Gao, Ge Li (Peking University, China)
Date: 28.10.2025
Abstract: Different 2D and 3D visual media data have been widely used in diverse applications, such as UHDTV broadcasting, mobile phones, digitalentertainments, autonomous driving, robots, and UAVs, etc. The large-scale data amount has raised the research and development trendsof2Dimage and video coding, and 3D point cloud coding. The compression technologies can efficiently reduce the data size to reliever theburdenofcommunication and storage. Traditional techniques without learning have achieved significant success in elevating the compressionefficiency, by devising dedicated tools to remove redundancies among the 2D and 3D visual signals. However, this kind of approaches have becomelessefficient recently due to the large computation cost and the trivial coding gains. Fortunately, we have witnessed the great progress andsuccessofdeep learning theories, methods, and applications, and deep learning-based image, video and point cloud coding methods have beenprovedtobring better coding performances than traditional methods. By adopting effective neural networks structures and training strategies, theend-to-end manner can significantly outperform non-leaning methods, and therefore these methods even have become the mainstreamresearchdirections in the field of visual data coding. Along with the emerging scenarios, the applications of 2D and 3D visual media data canbegenerallydivided into two types, including human perception and machine perception, for which the compression algorithms can be optimized. Thedevelopments of imaging sensors and emerging applications make the problem of data storage and transmission become much more difficult forhandling. This tutorial will discuss diverse aspects of multimedia data compression, including 2D and 3D visual data acquisition, perceptionmodels, deep learning-based coding methods, AI-based standards, open source projects, and future research
