| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Mon, 30 Jan 2023 22:34:23 GMT
access-control-allow-origin: *
etag: W/"63d845ef-290c"
expires: Mon, 29 Dec 2025 07:52:18 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 1A1C:3A7A40:86FED6:97BF77:695230D8
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 07:42:18 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210027-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766994138.020315,VS0,VE222
vary: Accept-Encoding
x-fastly-request-id: 9dc69bc3bdbd1820e068931d1befeda02938071f
content-length: 3047
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
AdaMML: Adaptive Multi-Modal Learning for
Efficient Video Recognition
†: Equal Contribution
Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition.
While traditional multi-modal learning offers excellent recognition results, its computational expense limits its impact for many real-world applications.
In this paper, we propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition.
Specifically, given a video segment, a multi-modal policy network is used to decide what modalities should be used for processing by the recognition model, with the goal of improving both accuracy and efficiency.
We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on four challenging diverse datasets demonstrate that our proposed adaptive approach
yields 35%-55% reduction in computation when compared to the traditional baseline that simply uses all the modalities irrespective of the input, while also achieving consistent improvements in accuracy over the
state-of-the-art methods.
Qualitative examples showing the effectiveness of AdaMML in selecting the right modalities per video segment (marked by green borders)
Efficient Video Recognition
|
|
|
|
|
|
|
|
|
|
|
![]() |
Abstract
Qualitative Results
![]() |
Paper & Code
|
Rameswar Panda*, Chun-Fu (Richard) Chen*, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition International Conference on Computer Vision (ICCV), 2021 [PDF] [Supp] [Poster] [Slides] [Code] |

