CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://judyye.github.io/CVP/ x-github-request-id: B770:2916CC:97132F:A974F0:69530938 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 23:05:28 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210029-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767049528.163828,VS0,VE198 vary: Accept-Encoding x-fastly-request-id: 191373e3fa0522d43d2ca4fa2bfa800e2cd623b2 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 26 Sep 2019 18:03:09 GMT access-control-allow-origin: * etag: W/"5d8cfd5d-3881" expires: Mon, 29 Dec 2025 23:15:28 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 22B3:36A0B4:95F8C9:A85AB5:69530938 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 23:05:28 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210029-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767049528.375106,VS0,VE206 vary: Accept-Encoding x-fastly-request-id: 45fa2ed9e0935b5b01f9d44dbcf345f61071afcb content-length: 3645

Compositional Video Prediction

Yufei Ye Maneesh Singh Abhinav Gupta* Shubham Tulsiani*

Carnegie Mellon University Facebook AI Research Verisk Analytics

in ICCV 2019

Paper | Code | Poster | Bibtex

We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight by implicitly predicting future states of independent entities while reasoning about interactions among them, and composing future video frames using predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variables, and show this allow us to sample more diverse and plausible futures compared to commonly used per-timestep latent variables models. We empirically validate our approach against alternate representations choices and ways of incorporating multi-modality. We examine two datasets, one comprising of stacked objects that may fall, and another containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings.

Method Overview

Our model takes as input an image with known/detected location of entities. Each entity is represented as its location and an implicit feature. Given the current entity representations and a sampled latent variable, our prediction module predicts the representations at the next time step. Our learned decoder composes the predicted representations to an image representing the predicted future. During training, a latent encoder module is used to infer the distribution over the latent variables using the initial and final frames.

Paper

arxiv, 2019.

Citation

Yufei Ye, Maneesh Singh, Abhinav Gupta, and Shubham Tulsiani.
"Compositional Video Prediction", in ICCV, 2019. Bibtex

Code

ShapeStacks Results

Results by Entity Predictors

(Click to view full resolution)

Generalization to more blocks (train with 3 blocks)

(Click to view full resolution)

Visualization of five randomly sampled future predictions

(Click to view full resolution)

Penn Action Results

(Click to view full resolution)

Visualization of three randomly sampled future predictions

(Click to view full resolution)

Acknowledgements

We would like to thank the members of the CMU Visual and Robot Learning group for fruitful discussion and helpful comments. This webpage template was borrowed from some GAN folks.

Original Source | Taken Source