| CARVIEW |
Hello, I'm Sifei Liu
I currently hold the position of a staff-level Senior Research Scientist at NVIDIA, where I am part of the LPR team led by Jan Kautz. My work primarily revolves around the development of generalizable visual representation learning for images, videos, and 3D content. Prior to this, I pursued my Ph.D. at the VLLAB, under the guidance of Ming-Hsuan Yang.
Over the years, I’ve been fortunate to receive several prestigious awards and recognitions. In 2013, I was honored with the Baidu Graduate Fellowship. This was followed by the NVIDIA Pioneering Research Award in 2017, and the Rising Star EECS accolade in 2019. Additionally, I was nominated for the VentureBeat Women in AI Award in 2020.
News
- Mar 2025: SpatialRGPT was demoed at GTC 2025 as a part of Agentic AI for Physical Operations!
- Feb 2025: We release the GSPN, a fast vision attention module that accelerates Stable Diffusion inference 84x. Stay tuned for more details!
- Feb 2025: 5 papers was accepted to CVPR 2025! Stay tuned for more updates!
- Jan 2025: We released the NaVILA, a navigation agent that can navigate in a 3D environment with a language instruction.
- Dec 2024: We presented CosAE at NeurIPS 2024! Stay tuned for code release.
- Oct 2024: We released the SpatialRGPT code, datasets, and models! Welcome to try demos!
Recent Research
Full publications can be found at Google Scholar and CV
Parallel Sequence Modeling via Generalized Spatial Propagation Network
GSPN is a fast vision attention module that accelerates generic vision foundation models for high-resolution input images.
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.
NVILA: Efficient Frontier Visual Language Models
Efficient frontier VLM models with efficient training and inference.
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
SpatialRGPT is a grounded spatial reasoning model that can reason about spatial relationships in images.
TUVF: Learning Generalizable Texture UV Radiance Fields
The paper introduces TUVF, a method for learning generalizable texture UV radiance fields.
Open-vocabulary panoptic segmentation with text-to-image diffusion models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation.
Learning Continuous Image Representation with Local Implicit Image Function
The paper presents a method for learning continuous image representation with local implicit image function.
Learning 3D Dense Correspondence via Canonical Point Autoencoder
The paper presents a method for learning 3D dense correspondence using a canonical point autoencoder.
Joint-task self-supervised learning for temporal correspondence
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner.