I am a graduate student in the Graduate School of Artificial Intelligence integrated M.S. & Ph.D. program at POSTECH. I am a member of the Computer Vision Lab at POSTECH, working with Minsu Cho.
Currently, I am a visiting researcher at the IMAGINE group at École des Ponts ParisTech (aka ENPC), working with Gül Varol.
Previously, I completed my B.S. in Convergence IT Engineering at POSTECH.
My research interests lie in computer vision and deep learning, especially in understanding temporal and semantic relations between actions in long-term videos.
I’ve worked on long-term action anticipation.
If you are interested in my research projects, please feel free to contact me by clicking one of the icons below.
News
Jun 25, 2025
📝 Our paper ‘Generic Event Boundary Detection via Denoising Diffusion’ is accepted to ICCV 2025.
Feb 27, 2025
📝 Our paper ‘Video Summarization with Large Language Models’ is accepted to CVPR 2025.
Sep 27, 2024
📝 Our paper ‘ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation’ is accepted to NeurIPS 2024. See you in Vancouver 🇨🇦!
Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, TAPOS and Kinetics-GEBD, generating diverse and plausible event boundaries.
Video Summarization with Large Language Models
Min Jung Lee,
Dayoung Gong, and Minsu Cho
IEEE/CVF Conference on Computer Vision and Pattern Recogntiion (CVPR), 2025
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes.
Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using an image caption model and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation
Dayoung Gong, Suha Kwak, and Minsu Cho
Conference on Neural Information Processing Systems (NeurIPS), 2024
Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos.
Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion.
The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner;
the visible part is for temporal segmentation, and the invisible part is for future anticipation.
To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future.
Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation.
ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.
Activity Grammars for Temporal Action Segmentation
Dayoung Gong*, Joonseok Lee*, Deunsol Jung, Suha Kwak, and Minsu Cho
Conference on Neural Information Processing Systems (NeurIPS), 2023
LPVL workshop @ CVPR 2024 BK21 Best Paper Award Winner, 2024
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason.
This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation.
We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules.
Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure.
Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
Future Transformer for Long-term Action Anticipation
Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho
IEEE/CVF Conference on Computer Vision and Pattern Recogntiion (CVPR), 2022
The task of predicting future actions from a video is crucial for a real-world agent interacting with others.
When anticipating actions in the distant future,
we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future.
In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions.
Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation.
We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.
Professional Services
Technical Chair
Asian Conference on Computer Vision (ACCV), 2024
Workshop Organizer
Women in Computer Vision in Asian Conference on Computer Vision (ACCV), 2024
Conference Reviewer
Conference on Neural Information Processing Systems (NeurIPS), 2024
European Conference on Computer Vision (ECCV), 2024
International Conference on Computer Vision (ICCV), 2023, 2025
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022-2024
Journal Reviewer
IEEE Transactions on Multimedia (TMM), 2024-25
International Journal of Computer Vision (IJCV), 2025
Honors and Awards
BK21 Best Paper Award (Excellence Award), POSTECH GSAI, 2023, 2024, 2025