I am exploring spatial intelligence at World Labs.
I am broadly interested in computer vision, machine learning, and cognitive science.
My goal is to understand how visual agents learn to represent their world with minimal supervision and easily generalize to novel objects and scenes.
TL;DR: Visual foundation models can classify, delineate, and localize objects in 2D. We study how well these models represent the 3D world that images depict?
Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models.
@inproceedings{elbanani2024probe3d,title={{Probing the 3D Awareness of Visual Foundation Models}},author={El Banani, Mohamed and Raj, Amit and Maninis, Kevis-Kokitsi and Kar, Abhishek and Li, Yuanzhen and Rubinstein, Michael and Sun, Deqing and Guibas, Leonidas and Johnson, Justin and Jampani, Varun},booktitle={CVPR},year={2024},}
Learning Visual Representations via Language-Guided Sampling
TL;DR: A picture is worth a thousand words, but a caption can describe a thousand images. We use language models to find image pairs with similar captions, and use them for stronger contrastive learning.
Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.
@inproceedings{elbanani2023lgssl,title={{Learning Visual Representations via Language-Guided Sampling}},author={El Banani, Mohamed and Desai, Karan and Johnson, Justin},booktitle={CVPR},year={2023},}
Self-Supervised Correspondence Estimation via Multiview Registration
TL;DR: Self-supervised correspondence estimation struggles with wide-baseline images. We use multiview registration and SE(3) transformation synchronization to leverage long-term consistency in RGB-D video
Video provides us with the spatio-temporal consistency needed for visual learning. Recent approaches have utilized this signal to learn correspondence estimation from close-by frame pairs. However, by only relying on close-by frame pairs, those approaches miss out on the richer long-range consistency between distant overlapping frames. To address this, we propose a self-supervised approach for correspondence estimation that learns from multiview consistency in short RGB-D video sequences. Our approach combines pairwise correspondence estimation and registration with a novel SE(3) transformation synchronization algorithm. Our key insight is that self-supervised multiview registration allows us to obtain correspondences over longer time frames; increasing both the diversity and difficulty of sampled pairs. We evaluate our approach on indoor scenes for correspondence estimation and RGB-D pointcloud registration and find that we perform on-par with supervised approaches.
@inproceedings{elbanani2023syncmatch,author={El Banani, Mohamed and Rocco, Ignacio and Novotny, David and Vedaldi, Andrea and Neverova, Natalia and Johnson, Justin and Graham, Ben},title={Self-Supervised Correspondence Estimation via Multiview Registration},booktitle={WACV},year={2023},}
TL;DR: Good features get us accurate correspondence, accurate correspondence is good for feature learning. We leverage this to learn {visual, geometric} features via self-supervised point cloud registration.
Geometric feature extraction is a crucial component of point cloud registration pipelines. Recent work has demonstrated how supervised learning can be leveraged to learn better and more compact 3D features. However, those approaches’ reliance on ground-truth annotation limits their scalability. We propose BYOC: a self-supervised approach that learns visual and geometric features from RGB-D video without relying on ground-truth pose or correspondence. Our key observation is that randomly-initialized CNNs readily provide us with good correspondences; allowing us to bootstrap the learning of both visual and geometric features. Our approach combines classic ideas from point cloud registration with more recent representation learning approaches. We evaluate our approach on indoor scene datasets and find that our method outperforms traditional and learned descriptors, while being competitive with current state-of-the-art supervised approaches.
@inproceedings{elbanani2021bootstrap,title={Bootstrap your own correspondences},author={El Banani, Mohamed and Johnson, Justin},booktitle={ICCV},year={2021},}
UnsupervisedR&R: Unsupervised Point Cloud Registration via Differentiable Rendering
TL;DR: Can we learn point cloud registration from RGB-D video? We propose a register and render approach that learns via minimizing photometric and geometric losses between close-by frames.
Aligning partial views of a scene into a single whole is essential to understanding one’s environment and is a key component of numerous robotics tasks such as SLAM and SfM. Recent approaches have proposed end-to-end systems that can outperform traditional methods by leveraging pose supervision. However, with the rising prevalence of cameras with depth sensors, we can expect a new stream of raw RGB-D data without the annotations needed for supervision. We propose UnsupervisedR&R: an end-to-end unsupervised approach to learning point cloud registration from raw RGB-D video. The key idea is to leverage differentiable alignment and rendering to enforce photometric and geometric consistency between frames. We evaluate our approach on indoor scene datasets and find that we outperform existing traditional approaches with classic and learned descriptors while being competitive with supervised geometric point cloud registration approaches.
@inproceedings{elbanani2021unsupervisedrr,title={UnsupervisedR\&R: Unsupervised Point Cloud Registration via Differentiable Rendering},author={El Banani, Mohamed and Gao, Luya and Johnson, Justin},booktitle={CVPR},year={2021},}
Novel Object Viewpoint Estimation through Reconstruction Alignment
TL;DR: Humans can not help but see 3D structure of novel objects, so aligning their viewpoints becomes very easy. We propose a reconstruct-and-align approach for novelobject viewpoint estimation.
The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not have an explicit 3D model or a predefined canonical pose, we can still learn to estimate the object’s shape in the viewer’s frame and then use an image to provide our reference model or canonical pose. In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss; the second learns whether two instances of features are aligned. At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image. We evaluate our method on novel object viewpoint estimation by generalizing across different datasets, analyzing the impact of our different modules, and providing a qualitative analysis of the learned features to identify what representations are being learnt for alignment.
@inproceedings{elbanani2020novel,title={Novel Object Viewpoint Estimation through Reconstruction Alignment},author={El Banani, Mohamed and Corso, Jason J and Fouhey, David F},booktitle={CVPR},year={2020},}
A Computational Exploration of Problem-Solving Strategies and Gaze Behaviors on the Block Design Task.
TL;DR: We present a computational architecture to model problem-solving strategies on the block design task. We generate detailed behavioral predictions and analyze cross-strategy error patterns.
The block design task, a standardized test of nonverbal reasoning, is often used to characterize atypical patterns of cognition in individuals with developmental or neurological conditions. Many studies suggest that, in addition to looking at quantitative differences in block design speed or accuracy, observing qualitative differences in individuals’ problem-solving strategies can provide valuable information about a person’s cognition. However, it can be difficult to tie theories at the level of problem-solving strategy to predictions at the level of externally observable behaviors such as gaze shifts and patterns of errors. We present a computational architecture that is used to compare different models of problem-solving on the block design task and to generate detailed behavioral predictions for each different strategy. We describe the results of three different modeling experiments and discuss how these results provide greater insight into the analysis of gaze behavior and error patterns on the block design task.
@inproceedings{kunda2016computational,title={A Computational Exploration of Problem-Solving Strategies and Gaze Behaviors on the Block Design Task.},author={Kunda, Maithilee and El Banani, Mohamed and Rehg, James M},booktitle={CogSci},volume={38},year={2016},}
A Pilot Study of a Modified Bathroom Scale To Monitor Cardiovascular Hemodynamic in Pregnancy
Odayme Quesada, Mohamed El Banani, James Heller, Shire Beach, Mozziyar Etemadi, Shuvo Roy, Omer Inan, Juan Gonzalez, and Liviu Klein
Journal of the American College of Cardiology, 2016
TL;DR: We use ballistocardiogram measurements extracted from a modified bathroom scale to analyze maternal cardiovascular adaptation during pregnancy for low-cost detection of preeclampsia.
@article{quesada2016pilot,title={A Pilot Study of a Modified Bathroom Scale To Monitor Cardiovascular Hemodynamic in Pregnancy},author={Quesada, Odayme and El Banani, Mohamed and Heller, James and Beach, Shire and Etemadi, Mozziyar and Roy, Shuvo and Inan, Omer and Gonzalez, Juan and Klein, Liviu},journal={Journal of the American College of Cardiology},volume={67},number={13S},year={2016},publisher={American College of Cardiology Foundation Washington, DC},}
Three-dimensional particle tracking in microfluidic channel flow using in and out of focus diffraction
Bushra Tasadduq, Gonghao Wang, Mohamed El Banani, Wenbin Mao, Wilbur Lam, Alexander Alexeev, and Todd Sulchek
Flow Measurement and Instrumentation, 2015
TL;DR: We use defocusing patterns to extract 3D particle motion trajectories in 2D bright field videos of microfluidic devices.
Three-dimensional particle tracking is important to accurately understand the motion of particles within complex flow fields. We show that three-dimensional trajectories of particles within microfluidic flow can be extracted from two-dimensional bright field video microscopy. The method utilizes the defocusing that occurs as particles move out of the objective focal plane when viewed through a high numerical aperture objective lens. A fast and simple algorithm based on cross correlation to a set of reference images taken at prescribed amounts of defocus is used to extract out-of-plane particle position. In-plane particle position is determined through center point detection and therefore the particle position in all three dimensions can be constructed at each time point. Particle trajectories at high flow velocity of greater than 2 mm/s can be tracked by utilizing a high speed camera to obtain unblurred images. Three dimensional computational fluid simulations are used to validate the particle tracking methods.
@article{tasadduq2015three,title={Three-dimensional particle tracking in microfluidic channel flow using in and out of focus diffraction},author={Tasadduq, Bushra and Wang, Gonghao and El Banani, Mohamed and Mao, Wenbin and Lam, Wilbur and Alexeev, Alexander and Sulchek, Todd},journal={Flow Measurement and Instrumentation},volume={45},year={2015},publisher={Elsevier},}