| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Sat, 22 Nov 2025 19:24:35 GMT
access-control-allow-origin: *
etag: W/"69220df3-27044"
expires: Sun, 28 Dec 2025 08:24:53 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: D8F9:15317B:768637:84CAE0:6950E6FC
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 08:14:53 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210047-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766909694.619613,VS0,VE213
vary: Accept-Encoding
x-fastly-request-id: ffe8c2c8ebecda332ac815221fb4163a4ea6fe7a
content-length: 37973
Christoph Feichtenhofer















































































Christoph Feichtenhofer
Director, Research Scientist
Meta Superintelligence Labs
Meta Superintelligence Labs
feichtenhofer _at_ meta.com
Recent technical reports · Google Scholar
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath,
Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham
Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus
Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Radle, Triantafyllos
Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane
Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li,
Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollar, Nikhila Ravi, Kate
Saenko, Pengchuan Zhang, Christoph Feichtenhofer
Technical report, arXiv, November 2025
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 delivers a 2× gain over existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.


Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer
Advances in Neural Information Processing Systems (NeurIPS) 2025, Oral
We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
Advances in Neural Information Processing Systems (NeurIPS) 2025, Spotlight
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
News & Highlights
I am a Lead Area Chair for CVPR 2026 and Area Chair for ICLR & ICML 2026
SAM 2 received the ICLR Best Paper Honorable Mention Award
We have released the Perception Encoder & Perception Language Model
I am a Lead Area Chair for CVPR & ICCV 2025 and Area Chair for ICLR & NeurIPS 2025
At ECCV 2024, I am speaker and panelist at the
Visual object tracking and segmentation challenge VOTS2024 workshop, a speaker at the OmniLabel Workshop, and a Mentor at the Doctoral Consortium
At CVPR 2024, I am speaker and panelist at the workshop on
Computer Vision with Humans in the Loop, a speaker at the Efficient Large Vision Models workshop and a Mentor at the Doctoral Consortium
I am honored to have received the PAMI Young Researcher Award 2023
We organized a Video AI Symposium at DeepMind London
I am a Senior Area Chair for ECCV 2024, Area Chair for CVPR and ICLR 2024
At ICCV 2023, we organize a tutorial on
Self-Supervised Representation Learning in Computer Vision and I am an invited speaker at the New Ideas in Vision Transformers
Workshop
1/1, 3/3 and 1/1 submissions accepted to ICML, ICCV, and NeurIPS 2023
At ICML 2023, we organize a tutorial on
Self-Supervised Learning in Vision: from Research Advances to Best Practices
I am an Area Chair for CVPR, ICCV, BMVC and NeurIPS in 2023
At ECCV 2022, we organize a tutorial on
Self-Supervised Representation Learning in Computer Vision,
I am an invited speaker at the Ego4D Workshop and a Mentor at the LatinX in CV Workshop.
2/2 submissions accepted to NeurIPS 2022 and I will serve as an Area Chair for BMVC 2022
7/7 submissions accepted to CVPR 2022 and I will serve as an Area Chair for ECCV
2022
At ICCV 2021, we organize a tutorial on Efficient Video Understanding: State-of-the-art, Challenges, and Opportunities
and I'll be a Mentor at the Doctoral Consortium & Social Events
We released PyTorchVideo - a deep learning library for video research and applications
I will serve as an Area Chair of CVPR 2021 and BMVC 2021
We organized a tutorial on Visual Recognition for Images, Video, and 3D at ECCV 2020
At CVPR 2020, we organize a tutorial on Images, Video, and 3D,
and I will serve as invited speaker for the Large Scale Holistic Video Understanding Tutorial,
the International Challenge on Activity Recognition (ActivityNet), and the Language & Vision with applications to Video Understanding workshop.
We organized a tutorial on Images, Video, and 3D research and code at ICCV 2019
SlowFast has been covered in a venturebeat article: "Facebook’s SlowFast video classifier AI was inspired by primate eyes"
PySlowFast has been released! A codebase supporting video research and applications in PyTorch
Winner of the AVA video activity detection challenge at the International Challenge on Activity Recognition (ActvityNet)
Our entry based on SlowFast achieved 34.3 mAP which corresponds to a gain of 13 mAP over the winning solution of 2018. AVA Challenge report
The top 3 ranking teams all used SlowFast networks as backbone
We organized a tutorial on Visual Recognition at CVPR 2019
We organized a tutorial on Action Classification and Video Modelling at CVPR 2019
Invited talk at the International Conference on Predictive Vision 2019
We organized a tutorial on Visual Recognition at ECCV 2018
Timeline
2015 - 2018
Visiting Researcher at University of Oxford
Worked with Prof. Andrew Zisserman
Visual Geometry Group (VGG), Oxford, UK
2017 - 2018
PostDoc at TU Graz
Institute of Electrical Measurement and Sensor Systems (EMS), Graz, Austria
2014 - 2017
Visiting Researcher at York University
Worked with Prof. Richard P. Wildes
YorkU Vision Lab, Toronto, Canada
2014 - 2017
TU Graz: PhD
Thesis: Deep Learning for Video Recognition
2013
Visiting Researcher at York University
Worked with Prof. Richard P. Wildes
YorkU Vision Lab, Toronto, Canada
Publications · Google Scholar
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe
Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick,
Piotr Dollár, Christoph Feichtenhofer
International Conference on Learning Representations (ICLR) 2025, Oral
Best Paper Honorable Mention Award
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images
and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video
segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video
processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation,
we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more
accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve
as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model,
the dataset and an interactive demo.

The Llama 3 Herd of Models
Llama team, Meta
Technical report, arXiv, July 2024
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of
foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to
128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable
quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including
pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and
output safety. The paper also presents the results of experiments in which we integrate image, video, and speech
capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the
state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released
as they are still under development.

Demystifying CLIP Data
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh,
Luke
Zettlemoyer, Christoph Feichtenhofer
International Conference on Learning Representations (ICLR) 2024, Spotlight
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and
applications in computer
vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the
success of
CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very
limited
information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data
by
filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in
our pursuit
of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP
takes a
raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata
distribution.
Our experimental study rigorously isolates the model and training settings, concentrating solely on data.
MetaCLIP
applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard
benchmarks. In
zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models.
Scaling to
1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model
sizes,
exemplified by ViT-H achieving 80.5%, without any bells-and-whistles.

Window Attention is Bugged: How not to Interpolate Position Embeddings
Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer
International Conference on Learning Representations (ICLR) 2024
Window attention, position embeddings, and high resolution finetuning are core concepts in the
modern transformer era of
computer vision. However, we find that naively combining these near ubiquitous components can have a
detrimental effect
on performance. The issue is simple: interpolating position embeddings while using window attention is wrong.
We study
two state-of-the-art methods that have these three components, namely Hiera and ViTDet, and find that both do
indeed
suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which
solves the bug
outright in Hiera and allows us to increase both speed and performance of the model in ViTDet. We finally
combine the
two to obtain HieraDet, which achieves 61.7 box mAP on COCO, making it state-of-the-art for models that only
use
ImageNet-1k pretraining. This all stems from what is essentially a 3 line bug fix, which we name "absolute
win".

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Chaitanya Ryali*, Yuan-Ting Hu*, Daniel Bolya*, Chen Wei, Haoqi Fan, Po-Yao
Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury,
Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li*, Christoph Feichtenhofer*
*equal contribution
*equal contribution
International Conference on Machine Learning (ICML) 2023, Oral
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of
supervised
classification performance. While these components lead to effective accuracies and attractive FLOP counts,
the added
complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we
argue that
this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out
all the
bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the
process, we
create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models
while being
significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image
and video
recognition.


MAViL: Masked Audio-Video Learners
Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi
Ghosh, Jitendra Malik,
Christoph Feichtenhofer
Advances in Neural Information Processing Systems (NeurIPS) 2023
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with
three
complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra-
and
inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video
contextualized
features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform
well in
audio-visual classification and retrieval tasks but also improves representations of each modality in
isolation, without
using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new
state-of-the-art
on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model
outperforms ones that use external supervision on these benchmarks.
| ground-truth | DiffMAE | MAE |
Diffusion Models as Masked Autoencoders
Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang,
Cihang Xie, Alan Yuille, Christoph Feichtenhofer
International Conference on Computer Vision (ICCV) 2023
There has been a longstanding belief that generation can facilitate a true understanding of visual data.
In line with
this, we revisit generatively pre-training visual representations in light of recent interest in denoising
diffusion
models. While directly pre-training with diffusion models does not produce strong representations, we
condition
diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our
approach is
capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting
high-quality image
inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art
classification accuracy.
We further perform a comprehensive study on the pros and cons of design choices and build connections
between diffusion
models and masked autoencoders.

CiT: Curation in Training for Effective Vision-Language Data
Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer,
Christoph Feichtenhofer
International Conference on Computer Vision (ICCV) 2023
Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant
training cost
that only large institutions can afford. This paper trades generality for efficiency and presents Curation in
Training
(CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT
automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an
offline
data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT
contains two
loops: an outer loop curating the training data and an inner loop consuming the curated training data. The
text encoder
connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of
image-text pairs,
CiT alternatively selects relevant training data from the pool by measuring the similarity of their text
embeddings and
embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of
magnitude,
especially if the raw data size is large.

The effectiveness of MAE pre-pretraining for billion-scale pretraining
Mannat Singh*, Quentin Duval*, Kalyan Vasudev Alwala*, Haoqi Fan, Vaibhav Aggarwal, Aaron
Adcock, Armand Joulin, Piotr
Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
International Conference on Computer Vision (ICCV) 2023
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual
recognition tasks.
Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets
with
billions of images. We introduce an additional pre-pretraining stage that is simple and uses the
self-supervised MAE
technique to initialize the model. While MAE has only been shown to scale with the size of models, we find
that it
scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both
model and
data size making it applicable for training foundation models. Pre-pretraining consistently improves both the
model
convergence and the downstream transfer performance across a range of model scales (millions to billions of
parameters),
and dataset sizes (millions to billions of images).

Scaling Language-Image Pre-training via Masking
Yanghao Li*, Haoqi Fan*, Ronghang Hu*, Christoph
Feichtenhofer‡, Kaiming He‡
*equal technical contribution, ‡equal advising
*equal technical contribution, ‡equal advising
Conference on Computer Vision and Pattern Recognition (CVPR) 2023
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our
method
randomly masks out and removes a large portion of image patches during training. Masking allows us to learn
from more
image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory
footprint.
It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million
image-text
pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream
tasks,
FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we
explore the
scaling behavior of increasing the model size, data size, or training length, and report encouraging results
and
comparisons. We hope that our work will foster future research on scaling vision-language learning.







Multiview Compressive Coding for 3D Reconstruction
Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari
Conference on Computer Vision and Pattern Recognition (CVPR) 2023
A central goal of visual recognition is to understand objects and scenes from
a single image. 2D recognition has witnessed tremendous progress thanks to
large-scale learning and general-purpose representations. Comparatively, 3D
poses new challenges stemming from occlusions not depicted in the image. Prior
works try to overcome these by inferring from multiple views or rely on scarce
CAD models and category-specific priors which hinder scaling to novel settings.
In this work, we explore single-view 3D reconstruction by learning
generalizable representations inspired by advances in self-supervised learning.
We introduce a simple framework that operates on 3D points of single objects or
whole scenes coupled with category-agnostic large-scale training from diverse
RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress
the input appearance and geometry to predict the 3D structure by querying a
3D-aware decoder. MCC's generality and efficiency allow it to learn from
large-scale and diverse data sources with strong generalization to novel
objects imagined by DALL-E 2 or captured in-the-wild with an iPhone.


Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer,
Judy Hoffman
International Conference on Learning Representations (ICLR) 2023, Oral
We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without
needing to
train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching
algorithm that
is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art
ViT-L @
512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy
drop in
each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for
MAE
fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of
ViT-B on audio
for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over
multiple frames
of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and
audio.


Masked Autoencoders As Spatiotemporal Learners
Christoph Feichtenhofer*, Haoqi Fan*, Yanghao Li, Kaiming He
*equal technical contribution
*equal technical contribution
Advances in Neural Information Processing Systems (NeurIPS) 2022
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal
representation
learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct
them in
pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive
bias on
spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs
the best. We
observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that
this ratio
is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in
wall-clock time or even more. We report competitive results on several challenging video datasets using
vanilla Vision
Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report
encouraging
results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of
masked
autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain
knowledge.
- arXiv
- code
PyTorch code has been open sourced in PySlowFast & PyTorchVideo.


Masked Autoencoders that Listen
Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian
Metze, Christoph Feichtenhofer
Advances in Neural Information Processing Systems (NeurIPS) 2022
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised
representation
learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first
encodes
audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder
layers. The
decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the
input
spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms
are
highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio
on target
datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification
tasks,
outperforming other recent models that use external supervised pre-training.


MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video
Recognition
Chao-Yuan Wu*, Yanghao Li*, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik,
Christoph Feichtenhofer*
*equal technical contribution
*equal technical contribution
Conference on Computer Vision and Pattern Recognition (CVPR) 2022 (Oral)
In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames
at once
like most existing methods, we propose to process videos in an online fashion and cache "memory" at each
iteration.
Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost.
Based on
this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x
longer
than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the
same. On a
wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition
accuracy
consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and
action
anticipation datasets. Code and models will be made publicly available.
- arXiv
- code
PyTorch code will be open sourced in PySlowFast & PyTorchVideo.


Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei*, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille,
Christoph Feichtenhofer*
*equal technical contribution
*equal technical contribution
Conference on Computer Vision and Pattern Recognition (CVPR) 2022
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach
first
randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We
study five
different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature
descriptor, works
particularly well in terms of both performance and efficiency. We observe that the local contrast
normalization in HOG
is essential for good results, which is in line with earlier work using HOG for visual recognition. Our
approach can
learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model
weights or
supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on
Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat
further
generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive
results on
ImageNet.
- arXiv
- code
PyTorch code has been open sourced in PySlowFast & PyTorchVideo.


MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik,
Christoph Feichtenhofer*
*equal technical contribution
*equal technical contribution
Conference on Computer Vision and Pattern Recognition (CVPR) 2022
In this paper, we study Multiscale Vision Transformers (MViT) as a unified architecture for image and video
classification, as well as object detection. We present an improved version of MViT that incorporates
decomposed
relative positional embeddings and residual pooling connections. We instantiate this architecture in five
sizes and
evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms
prior work.
We further compare MViTs' pooling attention to window attention mechanisms where it outperforms the latter in
accuracy/compute. Without bells-and-whistles, MViT has state-of-the-art performance in 3 domains: 88.8%
accuracy on
ImageNet classification, 56.1 box AP on COCO object detection as well as 86.1% on Kinetics-400 video
classification.
Reversible Vision Transformers
Karttikeya Mangalam*, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph
Feichtenhofer*, Jitendra Malik
*equal technical contribution
*equal technical contribution
Conference on Computer Vision and Pattern Recognition (CVPR) 2022 (Oral)
We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By
decoupling
the GPU memory footprint from the depth of the model, Reversible Vision Transformers enable memory efficient
scaling of
transformer architectures. We adapt two popular models, namely Vision Transformer and Multi-scale Vision
Transformers,
to reversible variants and benchmark extensively across both model sizes and tasks of image classification,
object
detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to
15.5x at
identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision
transformers as an
efficient backbone for resource limited training regimes. Finally, we find that the additional computational
burden of
recomputing activations is more than overcome for deeper models, where throughput can increase up to 3.9x over
their
non-reversible counterparts.


A convnet for the 2020s
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining
Xie
Conference on Computer Vision and Pattern Recognition (CVPR) 2022
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which
quickly
superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand,
faces
difficulties when applied to general computer vision tasks such as object detection and semantic segmentation.
It is the
hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making
Transformers
practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of
vision
tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic
superiority of
Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design
spaces
and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the
design of
a vision Transformer, and discover several key components that contribute to the performance difference along
the way.
The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from
standard
ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving
87.8%
ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while
maintaining
the simplicity and efficiency of standard ConvNets.


TrackFormer: Multi-Object Tracking with Transformers
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixé, Christoph Feichtenhofer
Conference on Computer Vision and Pattern Recognition (CVPR) 2022
We present TrackFormer, an end-to-end multi-object tracking and segmentation model based on an encoder-decoder
Transformer architecture. Our approach introduces track query embeddings which follow objects through a video
sequence
in an autoregressive fashion. New track queries are spawned by the DETR object detector and embed the position
of their
corresponding object over time. The Transformer decoder adjusts track query embeddings from frame to frame,
thereby
following the changing object positions. TrackFormer achieves a seamless data association between frames in a
new
tracking-by-attention paradigm by self- and encoder-decoder attention mechanisms which simultaneously reason
about
location, occlusion, and object identity. TrackFormer yields state-of-the-art performance on the tasks of
multi-object
tracking (MOT17) and segmentation (MOTS20). We hope our unified way of performing detection and tracking will
foster
future research in multi-object tracking and video understanding.

Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman et al.
Conference on Computer Vision and Pattern Recognition (CVPR) 2022 (Oral)
Best Paper Finalist
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of
daily-life
activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855
unique
camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed
to uphold
rigorous privacy and ethics standards with consenting participants and robust de-identification procedures
where
relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the
research
community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo,
and/or
synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new
benchmark
challenges centered around understanding the first-person visual experience in the past (querying an episodic
memory),
present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future
(forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to
push the
frontier of first-person perception.


PyTorchVideo: A Deep Learning Library for Video Understanding
Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong,
Nikhila Ravi, Meng Li,
Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph
Feichtenhofer
Proceedings of the 29th ACM International Conference on Multimedia, 2021
We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular,
efficient, and
reproducible components for a variety of video understanding tasks, including classification, detection,
self-supervised
learning, and low-level processing. The library covers a full stack of video understanding tools including
multimodal
data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further
supports
hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and
can be
used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision.


Multiscale Vision Transformers
Haoqi Fan*, Bo Xiong*, Karttikeya Mangalam*, Yanghao Li*, Zhicheng Yan, Jitendra Malik,
Christoph Feichtenhofer*
*equal technical contribution
*equal technical contribution
International Conference on Computer Vision (ICCV) 2021
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal
idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several
channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages
hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale
pyramid of features with early layers operating at high spatial resolution to model simple low-level visual
information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this
fundamental architectural prior for modeling the dense nature of visual signals for a variety of video
recognition tasks where it outperforms concurrent vision transformers that rely on large scale external
pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension
and apply our model for image classification where it outperforms prior work on vision transformers.
- arXiv
- code
- talk
PyTorch code has been open sourced in PySlowFast & PyTorchVideo.



A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He
Conference on Computer Vision and Pattern Recognition (CVPR) 2021
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a
unified perspective on four recent image-based frameworks, we study a simple objective that can easily
generalize all these methods to space-time. Our objective encourages temporally-persistent features in the
same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised
frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a
series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency
can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple
benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised
counterpart.
- arXiv
- code
PyTorch code will be open sourced in PySlowFast & PyTorchVideo.

Multiview Pseudo-Labeling for Semi-supervised Learning from Video
Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer
International Conference on Computer Vision (ICCV) 2021
We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary
views in the form of appearance and motion information for semi-supervised learning in video. The
complementary views help obtain more reliable pseudo-labels on unlabeled video, to learn stronger video
representations than from purely supervised data. Though our method capitalizes on multiple views, it
nonetheless trains a model that is shared across appearance and motion input and thus, by design, incurs no
additional computation overhead at inference time. On multiple video recognition datasets, our method
substantially outperforms its supervised counterpart, and compares favorably to previous work on standard
benchmarks in self-supervised video representation learning.

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke
Zettlemoyer,
Christoph Feichtenhofer
Empirical Methods in Natural Language Processing (EMNLP) 2021 (Oral)
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text
understanding,
without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting
temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our
experiments on
a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level
action
localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some
cases even
outperforming supervised approaches.


Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick, Dylan Campbell, Yuki M Asano, Ishan Misra Florian Metze, Christoph
Feichtenhofer, Andrea Vedaldi, Jo
Henriques
Advances in Neural Information Processing Systems (NeurIPS) 2021 (Oral)
In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.
However, in a
scene where objects or the camera may move, a physical point imaged at one location in frame may be entirely
unrelated
to what is found at that location in frame . These temporal correspondences should be modeled to facilitate
learning
about dynamic scenes. To this end, we propose a new drop-in block for video transformers -- trajectory
attention -- that
aggregates information along implicitly determined motion paths. We additionally propose a new method to
address the
quadratic dependence of computation and memory on the input size, which is particularly important for high
resolution or
long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video
action
recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something--Something
V2, and
Epic-Kitchens datasets.

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph
Feichtenhofer, Florian Metze, Luke
Zettlemoyer
Findings of the Association for Computational Linguistics (ACL) 2021
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text
input, or
both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single
cross-modal encoder
that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask
learning with
two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes
that
better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while
also
maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input).
Experimental
results show strong performance across a wider range of tasks than any previous methods, often outperforming
task-specific pre-training

X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
Conference on Computer Vision and Pattern Recognition (CVPR) 2020 (Oral)
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks.

Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
Technical report, arXiv, 2020
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.

Feature Pyramid Grids
Kai Chen, Yuhang Cao, Chen Change Loy, Dahua Lin, Christoph Feichtenhofer
Technical report, arXiv, 2020
Feature pyramid networks (FPN) have been widely adopted in the object detection literature to improve feature representations for better handling of variations in scale. In this paper, we present Feature Pyramid Grids (FPG), a simple extension to FPN, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connections between them. FPG is simple and flexible, which only adds a small overhead to regular, single pathway FPN while significantly increasing its performance. In addition to its general and simple structure, over complicated structures that have been found with neural architecture search, it also compares favorably against such approaches, providing higher accuracy and speed. We hope that FPG with its simple and effective nature can serve as a strong baseline for future work in object recognition.

A Multigrid Method for Efficiently Training Video Models
Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl
Conference on Computer Vision and Pattern Recognition (CVPR) 2020 (Oral)
Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training assumes a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but they are inaccurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8% absolute) on Kinetics-400 compared to the baseline training method.
EGO-TOPO: Environment Affordances from Egocentric Video
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman
Conference on Computer Vision and Pattern Recognition (CVPR) 2020 (Oral)
First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.


SlowFast Networks for Video Recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
International Conference on Computer Vision (ICCV) 2019 (Oral)
Winner of the AVA video activity detection challenge at CVPR 2019.
PyTorch code is open sourced as PySlowFast.
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report 79.0% accuracy on the Kinetics dataset without using any pre-training, largely surpassing the previous best results of this kind. On AVA action detection we achieve a new state-of-the-art of 28.3 mAP.

Modeling Human Motion with Quaternion-based Neural Networks
Dario Pavllo, Christoph Feichtenhofer, Michael Auli, David Grangier
International Journal on Computer Vision (IJCV), 2019
Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. QuaterNet represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. We investigate both recurrent and convolutional architectures and evaluate on short-term prediction and long-term generation. For the latter, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature. Our experiments compare quaternions to Euler angles as well as exponential maps and show that only a very short context is required to make reliable future predictions. Finally, we show that the standard evaluation protocol for Human3.6M produces high variance results and we propose a simple solution.



Grounded Human-Object Interaction Hotspots from Video
Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman
International Conference on Computer Vision (ICCV) 2019
Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching videos of real human behavior and anticipating afforded actions. Given a novel image or video, our model infers a spatial hotspot map indicating how an object would be manipulated in a potential interaction-- even if the object is currently at rest. Through results with both first and third person video, we show the value of grounding affordances in real human-object interactions. Not only are our weakly supervised hotspots competitive with strongly supervised affordance methods, but they can also anticipate object interaction for novel object categories.


Learning Temporal Pose Estimation from Sparsely-Labeled Videos
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
Advances in Neural Information Processing Systems (NeurIPS) 2019
Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames---a labeled Frame A and an unlabeled Frame B---we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually-labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs 39M parameters), and also more accurate (88.7% mAP vs 83.8% mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented dataset obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows our system to achieve state-of-the-art pose detection results on the PoseTrack2017 dataset.


Long-Term Feature Banks for Detailed Video Understanding
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick
Conference on Computer Vision and Pattern Recognition (CVPR) 2019 (Oral)
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.


3D human pose estimation in video with temporal convolutions and semi-supervised training
Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli
Conference on Computer Vision and Pattern Recognition (CVPR) 2019
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce.



Deep insights into convolutional networks for video recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman
International Journal on Computer Vision (IJCV), 2019
What have we learned from deep representations for action recognition?
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman
Conference on Computer Vision and Pattern Recognition (CVPR), 2018
As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncracies of training data and to explain failure cases of the system.


Learning Discriminative Motion Features Through Detection
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
Technical report, arXiv, December 2018
Despite huge success in the image domain, modern detection models such as Faster R-CNN have not been used nearly as much for video analysis. This is arguably due to the fact that detection models are designed to operate on single frames and as a result do not have a mechanism for learning motion representations directly from video. We propose a learning procedure that allows detection models such as Faster R-CNN to learn motion features directly from the RGB video data while being optimized with respect to a pose estimation task. In our experiments we show that our training scheme helps learn effective motion cues, which can be used to estimate and localize salient human motion. Furthermore, we demonstrate that as a byproduct, our model also learns features that lead to improved pose detection in still-images, and better keypoint tracking. Finally, we show how to leverage our learned model for the tasks of spatiotemporal action localization and fine-grained action recognition.


Camera-based vehicle velocity estimation from monocular video
Moritz Kampelmühler, Michael G. Müller, Christoph Feichtenhofer
Computer Vision Winter Workshop (CVWW), 2018
Best Student Paper Award
This paper documents the winning entry at
the CVPR2017 vehicle velocity estimation challenge.
Velocity estimation is an emerging task in autonomous
driving which has not yet been thoroughly
explored. The goal is to estimate the relative velocity
of a specific vehicle from a sequence of images. In
this paper, we present a light-weight approach for directly
regressing vehicle velocities from their trajectories
using a multilayer perceptron. Another contribution
is an explorative study of features for monocular
vehicle velocity estimation. We find that lightweight
trajectory based features outperform depth
and motion cues extracted from deep ConvNets, especially
for far-distance predictions where current disparity
and optical flow estimators are challenged significantly.
Our light-weight approach is real-time capable
on a single CPU and outperforms all competing
entries in the velocity estimation challenge. On
the test set, we report an average error of 1.12 m/s
which is comparable to a (ground-truth) system that
combines LiDAR and radar techniques to achieve an
error of around 0.71 m/s.


Detect to Track and Track to Detect
Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
International Conference on Computer Vision (ICCV) 2017 (spotlight)
Recent approaches for high accuracy detection and tracking of object
categories in video consist
of complex multistage solutions that become more cumbersome each
year. In this paper we propose a ConvNet architecture that jointly
performs detection and tracking, solving the task in a simple and
effective way.
Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce novel correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections
at the video level. Our ConvNet architecture for spatiotemporal
object detection is evaluated on the large-scale ImageNet VID dataset
where it achieves state-of-the-art results. Our approach provides
better single model performance than the winning method of the last
ImageNet challenge while being conceptually much simpler. Finally, we
show that by increasing the temporal stride we can dramatically
increase the tracker speed.

Spatiotemporal Multiplier Networks for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Conference on Computer Vision and Pattern Recognition (CVPR) 2017
This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. We theoretically motivate multiplicative gating functions for residual networks and empirically study their effect on classification accuracy. To capture long-term dependencies we inject identity mapping kernels for learning temporal relationships. Our architecture is fully convolutional in spacetime and able to evaluate a video in a single forward pass. Empirical investigation reveals that our model produces state-of-the-art results on two standard action recognition datasets.


Temporal Residual Networks for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Conference on Computer Vision and Pattern Recognition (CVPR) 2017
This paper combines three contributions to establish a new state-of-the-art in dynamic scene recognition. First, we present a novel ConvNet architecture based on temporal residual units that is fully convolutional in spacetime. Our model augments spatial ResNets with convolutions across time to hierarchically add temporal residuals as the depth of the network increases. Second, existing approaches to video-based recognition are categorized and a baseline of seven previously top performing algorithms is selected for comparative evaluation on dynamic scenes. Third, we introduce a new and challenging video database of dynamic scenes that more than doubles the size of those previously available.This dataset is explicitly split into two subsets of equal size that contain videos with and without camera motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se. Our evaluations verify the particular strengths and weaknesses of the baseline algorithms with respect to various scene classes and camera motion parameters. Finally, our temporal ResNet boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.

Spatiotemporal Residual Networks for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Advances in Neural Information Processing Systems (NIPS) 2016
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

Convolutional Two-Stream Network Fusion for Video Action Recognition
Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
Conference on Computer Vision and Pattern Recognition (CVPR) 2016
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.
Dynamic Scene Recognition with Complementary Spatiotemporal Features
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2016
This paper presents Dynamically Pooled Complementary Features, a unified approach to dynamic scene recognition that analyzes a short video clip in terms of its spatial, temporal and color properties. The complementarity of these properties is preserved through all main steps of processing, including primitive feature extraction, coding and pooling. In the feature extraction step, spatial orientations capture static appearance, spatiotemporal oriented energies capture image dynamics and color statistics capture chromatic information. Subsequently, primitive features are encoded into a mid-level representation that has been learned for the task of dynamic scene recognition. Finally, a novel dynamic spacetime pyramid is introduced. This dynamic pooling approach can handle both global as well as local motion by adapting to the temporal structure, as guided by pooling energies. The resulting system provides online recognition of dynamic scenes that is thoroughly evaluated on the two current benchmark datasets and yields best results to date on both datasets. In-depth analysis reveals the benefits of explicitly modeling feature complementarity in combination with the dynamic spacetime pyramid, indicating that this unified approach should be well-suited to many areas of video analysis.
Dynamically Encoded Actions based on Spacetime Saliency
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Conference on Computer Vision and Pattern Recognition (CVPR) 2015
Human actions typically occur over a well localized extent in both space and time. Similarly, as typically captured in video, human actions have small spatiotemporal support in image space. This paper capitalizes on these observations by weighting feature pooling for action recognition over those areas within a video where actions are most likely to occur. To enable this operation, we define a novel measure of spacetime saliency. The measure relies on two observations regarding foreground motion of human actors: They typically exhibit motion that contrasts with that of their surrounding region and they are spatially compact. By using the resulting definition of saliency during feature pooling we show that action recognition performance achieves state-of-the-art levels on three widely considered action recognition datasets. Our saliency weighted pooling can be applied to essentially any locally defined features and encodings thereof. Additionally, we demonstrate that inclusion of locally aggregated spatiotemporal energy features, which efficiently result as a by-product of the saliency computation, further boosts performance over reliance on standard action recognition features alone.
Bags of Spacetime Energies for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
Conference on Computer Vision and Pattern Recognition (CVPR) 2014
This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition. The approach builds on primitive features that uniformly capture spatial and temporal orientation structure of the imagery (e.g., video), as extracted via application of a bank of spatiotemporally oriented filters. Various feature encoding techniques are investigated to abstract the primitives to an intermediate representation that is best suited to dynamic scene representation. Further, a novel approach to adaptive pooling of the encoded features is presented that captures spatial layout of the scene even while being robust to situations where camera motion and scene dynamics are confounded. The resulting overall approach has been evaluated on two standard, publically available dynamic scene datasets. The results show that in comparison to a representative set of alternatives, the proposed approach outperforms the previous state-of-the-art in classification accuracy by 10%.
Fusing RFID and Computer Vision for Probabilistic Tag Localization
Michael Goller, Christoph Feichtenhofer, Axel Pinz
International Conference on RFID (IEEE RFID) 2014
The combination of RFID and computer vision
systems is an effective approach to mitigate the limited tag
localization capabilities of current RFID deployments. In this
paper, we present a hybrid RFID and computer vision system
for localization and tracking of RFID tags. The proposed system
combines the information from the two complementary sensor
modalities in a probabilistic manner and provides a high degree
of flexibility. In addition, we introduce a robust data association
method which is crucial for the application in practical scenarios.
To demonstrate the performance of the proposed system, we
conduct a series of experiments in an article surveillance setup.
This is a frequent application for RFID systems in retail where
previous approaches solely based on RFID localization have
difficulties due to false alarms triggered by stationary tags. Our
evaluation shows that the fusion of RFID and computer vision
provides robustness to false positive observations and allows for
a reliable system operation.
Spacetime Forests with Complementary
Features for Dynamic Scene Recognition
Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes
British Machine Vision Conference (BMVC) 2013 (Oral)
This paper presents spacetime forests defined over complementary spatial and temporal
features for recognition of naturally occurring dynamic scenes. The approach
improves on the previous state-of-the-art in both classification and execution rates. A
particular improvement is with increased robustness to camera motion, where previous
approaches have experienced difficulty. There are three key novelties in the approach.
First, a novel spacetime descriptor is employed that exploits the complementary nature
of spatial and temporal information, as inspired by previous research on the role of orientation
features in scene classification. Second, a forest-based classifier is used to learn
a multi-class representation of the feature distributions. Third, the video is processed in
temporal slices with scale matched preferentially to scene dynamics over camera motion.
Slicing allows for temporal alignment to be handled as latent information in the classifier
and for efficient, incremental processing. The integrated approach is evaluated empirically
on two publically available datasets to document its outstanding performance.
Spatio-Temporal Good Features to Track
Christoph Feichtenhofer, Axel Pinz
Workshop on Computer Vision for Autonomous Driving, International Conference on Computer Vision (ICCV) 2013
This paper presents two fundamental contributions that
can be very useful for any autonomous system that requires
point correspondences for visual odometry. First,
the Spatio-Temporal Monitor (STM) is an efficient method
to identify good features to track by monitoring their spatiotemporal
(x-y-t) appearance without any assumptions about
motion or geometry. The STM may be used with any spatial
(x-y) descriptor, but it performs best when combined with
our second contribution, the Histogram of Oriented Magnitudes
(HOM) descriptor, which is based on spatially oriented
multiscale filter magnitudes. To fulfil the real-time requirements
of autonomous applications, the same descriptor
can be used for both, track generation and monitoring,
to identify low-quality feature tracks at virtually no additional
computational cost. Our extensive experimental validation
on a challenging public dataset demonstrates the
excellent performance of STM and HOM, where we significantly
outperform the well known “Good Features to
Track” method and show that our proposed feature quality
measure highly correlates with the accuracy in structure
and motion estimation.
A Perceptual Image Sharpness Metric Based on
Local Edge Gradient Analysis
Christoph Feichtenhofer, Hannes Fassold, Peter Schallauer
IEEE Signal Processing Letters 2013
In this letter, a no-reference perceptual sharpness
metric based on a statistical analysis of local edge gradients is
presented. The method takes properties of the human visual
system into account. Based on perceptual properties, a relationship
between the extracted statistical features and the metric
score is established to form a Perceptual Sharpness Index (PSI). A
comparison with state-of-the-art metrics shows that the proposed
method correlates highly with human perception and exhibits low
computational complexity. In contrast to existing metrics, the PSI
performs well for a wide range of blurriness and shows a high
degree of invariance for different image contents.