You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We evaluate our A2Summ on two multimodal summarization multimodal output datasets (CNN, Daily_Mail) and two standard video summarization datasets (SumMe, TVSum).
We also collected a large-scale multimodal summarization dataset BLiSS which consists of livestream videos and transcripts with annotated summary.
Before running the code, please download the pre-processed datasets from google drive link.
Unzip it under the data/ folder and make sure the data structure is as below.
For the BLiSS dataset, due to the copyright issue, we only provide the extracted video/thumbnail features instead of the original videos/thunmbnails. If you need access to the original videos, please email me (bohe@umd.edu) for the public URLs of each video.
Running
Training
We train the model on a single GTX-1080ti GPU. To train the model on different dataset, please execute the following command.
python train.py --dataset ${dataset}
Testing
First, download the checkpoints into "saved_model" directory and pass it as the checkpoint flag.
If you find our code or our paper useful for your research, please [★star] this repo and [cite] the following paper:
@inproceedings{he2023a2summ,
title = {Align and Attend: Multimodal Summarization with Dual Contrastive Losses},
author={He, Bo and Wang, Jun and Qiu, Jielin and Bui, Trung and Shrivastava, Abhinav and Wang, Zhaowen},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}