I'm broadly interested in video and 3D vision, with the goal of making the agents perceive the physical world and reason based on its perception. I'm also excited to explore other interesting topics (such as generative models).
3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene.
In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform.
We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting.
We propose TranSTR that features a spatio-temporal rationalization (STR) module together with a more reasonable candidate answer modeling strategy. The answer modeling strategy is independently verified to be effective in boosting other existing VideoQA models.
We propose RaFormer, a fully transformer-based VideoQA model that avoids neighboring-frame redundancy by highlighting object-level change in adjacent frames and the out-of-neighborhood message passing at frame-level. In addition, it also handles the cross-modal redundancy via a novel adaptive sampling module.