| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://shubhtuls.github.io/PixelTransformer/
x-github-request-id: 67C3:21D6A4:925A1D:A4002F:6952A8CD
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 16:14:06 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210048-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767024846.850109,VS0,VE197
vary: Accept-Encoding
x-fastly-request-id: 9b2e32a40eaaf8212bd75d5a64802e138b048907
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Mon, 23 Aug 2021 14:38:45 GMT
access-control-allow-origin: *
etag: W/"6123b2f5-4002"
expires: Mon, 29 Dec 2025 16:24:06 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 8897:444BC:91BA03:A36098:6952A8CD
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 16:14:06 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210048-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767024846.068594,VS0,VE217
vary: Accept-Encoding
x-fastly-request-id: 13484fa76c3c0abe379b19eae2c3d3f15a0fbc0c
content-length: 3225
PixelTransformer
PixelTransformer: Sample Conditioned Signal Generation
Overview. Left: Our model takes as input a set of observed pixel locations and values ({(xk,vk)}) and can then predict the value distribution for any query position x. Right: This model can be trained in a self-supervised manner by drawing random samples from training images and maximizing likelihood of true values for random query locations. Our approach allows us to autoregressively model distributions for many spatial signals (e.g. images, shapes, videos, polynomials) conditioned on a sparse set of sample observations (e.g. pixels)
Image Completion. Top: Ground-truth Image. Bottom: Three random samples generated by our approach given 32 observed pixels (visualized in initial frame of animation).
Shape Generation. Left: Ground-truth 3D Shape and locations of the 32 input SDF samples. Right: Conditionally generated sample shapes using our approach.
Polynomial Prediction. Given evaluations of a degree-6 polynomial (green) at a sparse set of points (red), our model allows sampling diverse possible functions (yellow).
Video Synthesis. Given a total of 1024 pixels across 30 frames, our model allows generating plausible videos that capture the coarse motion.
We would like to thank Deepak Pathak and the members of the CMU Visual Robot Learning lab for helpful discussions and feedback. This webpage template was borrowed from some colorful folks.
|
|
|
|
|
|
|
|
|
|
Overview. Left: Our model takes as input a set of observed pixel locations and values ({(xk,vk)}) and can then predict the value distribution for any query position x. Right: This model can be trained in a self-supervised manner by drawing random samples from training images and maximizing likelihood of true values for random query locations. Our approach allows us to autoregressively model distributions for many spatial signals (e.g. images, shapes, videos, polynomials) conditioned on a sparse set of sample observations (e.g. pixels)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| GT |
|
| Nearest Neighbor Visualization of Initially Observed Pixels |
|
| Generated Videos |
|
|
|