Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Downloads

Dataset publicly available for research purposes

Data download

Extracted features:

Annotations (train, val and test set): Available for download at GitHub.

We provide video-level annotations for the training set, provide both video-level and event-level annotations for validation and testing sets. The annotation files are stored in CSV format.

  • Training set
  • train_audio_weakly.csv: video-level audio annotaions of training set
    train_visual_weakly.csv: video-level visual annotaions of training set
    train_weakly.csv: video-level annotations (union of video-level audio annotations and visual annotations) of training set

  • Validation set
  • val_audio_weakly.csv: video-level audio annotaions of validation set
    val_visual_weakly.csv: video-level visual annotaions of validation set
    val_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of validation set
    val_audio.csv: event-level audio annotaions of validation set
    val_visual.csv: event-level visual annotaions of validation set

  • Testing set
  • test_audio_weakly.csv: video-level audio annotaions of testing set
    test_visual_weakly.csv: video-level visual annotaions of testing set
    test_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of testing set
    test_audio.csv: event-level audio annotaions of testing set
    test_visual.csv: event-level visual annotaions of testing set


Publication(s)

If you find our work useful in your research, please cite our paper.

        
          @article{hou2023towards,
            title={Towards Long Form Audio-visual Video Understanding},
            author={Hou, Wenxuan and Li, Guangyao and Tian, Yapeng and Hu, Di},
            journal={ACM Transactions on Multimedia Computing, Communications and Applications},
            year={2023},
            publisher={ACM New York, NY}
          }
        
        

Disclaimer

The released LFAV dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.


Copyright Creative Commons License

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

An event-centric framework

We propose an event-centric framework that contains three phases from snippet prediction, event extraction to event interaction. Firstly, we propose a pyramid multimodal transformer model to capture the events with different temporal lengths, where the audio and visual snippet features are required to interact with each other within multi-scale temporal windows. Secondly, we propose to model the video as structured event graphs according to the snippet prediction, based on which we refine the event-aware snippet-level features and aggregate them into event features. At last, we study event relations by modeling the influence among multiple aggregated audio and visual events and then refining the event features. The three phases progressively achieve a comprehensive understanding of video content as well as event relations and are jointly optimized with video-level event labels in an end-to-end fashion. We want to highlight that the inherent relations among multiple events are essential for understanding the temporal structures and dynamic semantic of the long form audio-visual videos, which has not been sufficiently considered in previous event localization works. More details are in the paper (paper).

An illustration of our event-centric framework. Top: In the first phase of snippet prediction, we propose a pyramid multimodal transformer to generate the snippet features as well as their category prediction. Middle left: In the second phase of event extraction, we build an event-aware graph to refine the snippet features and then aggregate the event-aware snippet features into event features. Middle right: In the third phase of event interaction, we model the event relations in both intra-modal and cross-modal scenarios and then refine the event feature by referring to its relation to other events. Bottom left The architecture of temporal attention pooling. It both outputs snippet-level predictions and video-level predictions. The inside attention weights are used to obtain event features in the event extraction phase. Bottom right: An equivalent form of window attention in the PMT Layer, shows how the window operates in the first phase. The window splits the feature sequence into several sub-sequences, then these sub-sequences are performed attention operations respectively. Here we show an example of self attention, operation in cross-modal attention is similar.

Experiments and Analysis

Main experiment results

To validate the superiority of our proposed framework, we choose 16 related methods for comparison, including weakly supervised temporal action localization methods: STPN, RSKP; long sequence modeling methods: Longformer, Transformer-LS, ActionFormer, FLatten Transformer; audio-visual learning methods: AVE, AVSlowFast, HAN, PSP, DHHN, CMPAE, CM-PIE; video classification methods: SlowFast, MViT, and MeMViT.

Comparison to Other Methods. Firstly, temporal action localization and long sequence modeling methods aim to effectively localize action events in untrimmed videos or model long sequences. But they ignore the valuable cooperation among audio and video modality, which is important in achieving more comprehensive video event understanding. Secondly, although some methods take the audio signal into account, they are consistently worse than our method. This could be because they mainly aim at understanding trimmed short videos, resulting in limited modeling of long-range dependencies and event interactions. Thirdly, our proposed method outperforms all the comparison ones obviously, although some recent video classification methods achieve slightly better results on visual mAP, their overall performance still lags obviously behind our proposed method, showing that our proposed event-centric framework can localize both audio and visual events in long form audio-visual videos better.

Effectiveness of Three Phases. Our full method consists of three progressive phases. The performance of the snippet prediction phase has already surpassed most comparison methods, then the subsequent phases can further improve localization performance. Results are shown in the last three rows of the above Table, which indicate the potential importance of decoupling a long form audio-visual video into multiple uni-modal events with different lengths and modeling their inherent relations in both uni-modal and cross-modal scenarios.


Visualization results

We visualize the event-level localization results in the videos, two examples are shown in the above figure. Compared with the audio-visual video parsing method HAN, our proposed method achieves better localization results. In some situations (e.g., event guitar in both audio and visual modality of video 01, and event speech in the audio modality of video 02), HAN tends to localize some sparse and short video clips instead of a long and complete event, which shows that HAN exists some limitations to understanding long-form videos. The possible reason is that HAN cannot learn long-range dependencies well.

We also notice that, although our proposed event-centric method has achieved the best performance among all methods, there still exist some failure cases in the shown examples (red and black boxes in the figure). The multisensory events take huge different lengths and occur in a dynamic long-range scene, which makes multisensory temporal event localization become a very challenging task, especially with only video-level labels in training. More experiment results and analysis are in the paper (paper).

The Team

We are a group of researchers working in multimodal learning and computer vision from the Renmin University of China and University of Texas at Dallas.


Wenxuan Hou

Ph.D. Candidate
(Sep 2022 - )
Renmin University of China

Guangyao Li

Ph.D.
(Sep 2020 - Jun 2024)
Renmin University of China

Yapeng Tian

Assistant Professor
University of Texas at Dallas

Di Hu

Associate Professor
Renmin University of China

Acknowledgement

  • This research was supported by National Natural Science Foundation of China (NO.62106272), and Public Computing Cloud, Renmin University of China.
  • This web-page design inspired by the official website of Music-AVQA.