I am a Senior Applied Research Scientist at Nvidia TAO team. My mission is to build 3D into AI foundations, with current focus on grounding 3D-VLMs into Embodied AI domain. Previously I was an Applied Scientist in Amazon AGI, working on 3D vision problems and diffusion-based video generation. I obtained Ph.D. in computer engineering at Virginia Tech, advised by Prof. A. Lynn Abbott, with research focused on deep 3D representations learning for dynamic scene understanding. I’m interested in AR/VR, Embodied AI, robotics.
During the summer 2019, I am lucky to work with Prof. Shuran Song(now Stanford University), Dr. He Wang(now Peking University), Dr. Li Yi (Google Research, now Tsinghua University), and Johnny Chung Lee(Google Brain Robotics) as a student ML researcher in Google Brain Robotics, Mountain View; in 2020 spring, I did a research internship on 3D perception in MERL, mentored by Prof. Siheng Chen(now Shanghai Jiaotong University), Dr. Alan Sullivan(MERL); in 2021 summer, I worked with Dr. Ishani Chakraborty(Hololens), Dr. Yale Song(MSR), Dr. Bugra Tekin(Hololens) in a research internship. I have also worked with Prof. Yunhui Zhu(VT 3D Optics Group) on X-ray phase imaging.
News
May 16, 2023
Named as Outstanding Reviewer for CVPR 2023
Jun 27, 2022
Joined AWS AI as an applied scientist working on 3D Vision!
Sep 28, 2021
My first submission to NeurIPS 2021 accepted, check paper here!
May 17, 2021
Starting my research internship in Hololens, Microsoft
Sep 21, 2020
Our method ranked 3rd on SemanticKitti Multi-sweep Semantic Segmentation Challenge!
Mar 13, 2020
One paper accepted to CVPR 2020 as Oral presentation!
We propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets
that can accurately follow complex, compositional text
prompts while achieving high fidelity by using a pre-trained
multi-view diffusion model..
We tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0.
Category-level object pose estimation aims to find 6D object poses of previously unseen object instances from known categories without access to object CAD models. To reduce the huge amount of pose annotations needed for category-level learning, we propose for the first time a self-supervised learning framework to estimate category-level 6D object pose from single 3D point clouds. During training, our method assumes no ground-truth pose annotations, no CAD models, and no multi-view supervision. The key to our method is to disentangle shape and pose through an invariant shape reconstruction module and an equivariant pose estimation module, empowered by SE(3) equivariant point cloud networks. The invariant shape reconstruction module learns to perform aligned reconstructions, yielding a category-level reference frame without using any annotations. In addition,the equivariant pose estimation module achieves category-level pose estimation accuracy that is comparable to some fully supervised methods. Extensive experiments demonstrate the effectiveness of our approach on both complete and partialdepth point clouds from the ModelNet40 benchmark, and on real depth point cloudsfrom the NOCS-REAL 275 dataset.
Category-level object pose estimation aims to find 6D object poses of previously unseen object instances from known categories without access to object CAD models. To reduce the huge amount of pose annotations needed for category-level learning, we propose for the first time a self-supervised learning framework to estimate category-level 6D object pose from single 3D point clouds. During training, our method assumes no ground-truth pose annotations, no CAD models, and no multi-view supervision. The key to our method is to disentangle shape and pose through an invariant shape reconstruction module and an equivariant pose estimation module, empowered by SE(3) equivariant point cloud networks. The invariant shape reconstruction module learns to perform aligned reconstructions, yielding a category-level reference frame without using any annotations. In addition,the equivariant pose estimation module achieves category-level pose estimation accuracy that is comparable to some fully supervised methods. Extensive experiments demonstrate the effectiveness of our approach on both complete and partialdepth point clouds from the ModelNet40 benchmark, and on real depth point cloudsfrom the NOCS-REAL 275 dataset.
This paper addresses the task of category-level pose
estimation for articulated objects from a single depth image.
We present a novel category-level approach that correctly
accommodates object instances previously unseen during
training. We introduce Articulation-aware Normalized
Coordinate Space Hierarchy (ANCSH) – a canonical
representation for different articulated objects in a given
category. As the key to achieve intra-category general-
ization, the representation constructs a canonical object
space as well as a set of canonical part spaces. The
canonical object space normalizes the object orientation,
scales and articulations (e.g. joint parameters and states)
while each canonical part space further normalizes its part
pose and scale. We develop a deep network based on
PointNet++ that predicts ANCSH from a single depth point
cloud, including part segmentation, normalized coordi-
nates, and joint parameters in the canonical object space.
By leveraging the canonicalized joints, we demonstrate: 1)
improved performance in part pose and scale estimations
using the induced kinematic constraints from joints; 2) high
accuracy for joint parameter estimation in camera space
Diabetic Retinopathy (DR) is the most common cause of avoidable vision loss, predominantly affecting the working-age population across the globe. Screening for DR, coupled with timely consultation and treatment, is a globally trusted policy to avoid vision loss. However, implementation of DR screening programs is challenging due to the scarcity of medical professionals able to screen a growing global diabetic population at risk for DR. Computer-aided disease diagnosis in retinal image analysis could provide a sustainable approach for such large-scale screening effort. The recent scientific advances in computing capacity and machine learning approaches provide an avenue for biomedical scientists to reach this goal. Aiming to advance the state-of-the-art in automatic DR diagnosis, a grand challenge on “Diabetic Retinopathy – Segmentation and Grading” was organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI - 2018). In this paper, we report the set-up and results of this challenge that is primarily based on Indian Diabetic Retinopathy Image Dataset (IDRiD). There were three principal sub-challenges: lesion segmentation, disease severity grading, and localization of retinal landmarks and segmentation. These multiple tasks in this challenge allow to test the generalizability of algorithms, and this is what makes it different from existing ones. It received a positive response from the scientific community with 148 submissions from 495 registrations effectively entered in this challenge. This paper outlines the challenge, its organization, the dataset used, evaluation methods and results of top-performing participating solutions. The top-performing approaches utilized a blend of clinical information, data augmentation, and an ensemble of models. These findings have the potential to enable new developments in retinal image analysis and image-based DR screening in particular.