Hello, I am a fourth-year PhD student in Computer Science at the University of North Carolina at Charlotte advised by Dr. Srijan Das, and am a member of the Charlotte Machine Learning Lab (CharMLab). My current focus is on multi-modal learning in Vision Language Models (VLMs) for video understanding and robotic control. I have worked on tasks including ego-exo viewpoint transfer, cross-modal domain adaptation, and fine-grained action understanding. I am interested in developing simple methods that are generalizable and scalable.
Large Vision Language Models (VLMs) excel at general visual reasoning tasks, but their performance degrades sharply when deployed in novel domains with substantial distribution shifts compared to what was seen during pretraining. Existing approaches to adapt VLMs to novel target domains rely on finetuning standard VLM components. Depending on which components are finetuned, these approaches either limit the VLM’s ability to learn domain-specific features or lead to catastrophic forgetting of pre-existing capabilities. To address this, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM’s vision encoder with a compact set of learnable visual probes, enabling domain-specific features to be learned with only minimal updates to the pretrained VLM components. We evaluate VisCoP across three challenging domain adaptation scenarios: cross-view (exocentric → egocentric), cross-modal (RGB → depth), and cross-task (human understanding → robot control). Our experiments demonstrate that VisCoP consistently outperforms existing domain adaptation strategies, achieving superior performance on the target domain while better retaining capabilities from the source domain.
×
From My View to Yours: Ego-to-Exo Transfer in VLMs for Understanding Activities of Daily Living
Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das
Vision Language Models (VLMs) have achieved strong performance across diverse video understanding tasks. However, their viewpoint invariant training limits their ability to understand egocentric properties (e.g., human object interactions) from exocentric video observations. This limitation is critical for many applications, such as Activities of Daily Living (ADL) monitoring, where the understanding of egocentric properties is essential, and egocentric cameras are impractical to deploy. To address this limitation, we propose Ego2ExoVLM, a VLM that learns to infer egocentric properties from exocentric videos by leveraging time-synchronized ego-exo videos during training. Ego2ExoVLM accomplishes this through the use of two components: Ego2Exo Sequence Distillation, which transfers knowledge from an egocentric teacher to an exocentric student, and Ego Adaptive Visual Tokens, designed to enhance the effectiveness of this knowledge transfer. To measure this capability, we introduce Ego-in-Exo Perception, a benchmark of 3.9K questions curated to explicitly measure the understanding of egocentric properties from exocentric videos. Ego2ExoVLM is evaluated on 10 tasks across Ego-in-Exo Perception and existing ADL benchmarks, achieving state-of-the-art results on the ADL-X benchmark suite and outperforming strong baselines on our proposed benchmark. Code, models, and data will be open-sourced.
×
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, and
4 more authors
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL’s complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available.
×
Just Add π! Pose Induced Video Transformers for Understanding Activities of Daily Living
Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or π-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of π-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows π-ViT to discard the modules during inference. Notably, π-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.
×
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders
Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, and
5 more authors
Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task (SSAT) is surprisingly beneficial when the amount of training data is limited. We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective. Our findings reveal that SSAT is a powerful technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks, achieving better performance than typical ViTs pre-training with SSL and fine-tuning sequentially. Our experiments, conducted on 10 datasets, demonstrate that SSAT significantly improves ViT performance while reducing carbon footprint. We also confirm the effectiveness of SSAT in the video domain for deepfake detection, showcasing its generalizability.