Dr. Karan Sikka is a Senior Computer Vision Scientist at [SRI International](https://www.sri.com/computer-vision/), Princeton, USA Research Scientist at Meta(AI). He completed his PhD from University of California San Diego in 2016 (advised by Dr. Marian Bartlett). He completed his bachelors in ECE from IIT Guwahati, India in 2010.
Dr. Sikka’s doctoral thesis centered on developing machine learning models for action classification in videos, specifically under conditions of weak supervision. Upon joining SRI, his research trajectory shifted towards multimodal learning, emphasizing approaches for learning under few/zero-shot settings. He explored the use of diverse modalities to enhance various tasks, spanning from visual grounding to social media-analysis and geo-localization. His present research focuses on leveraging large language models (Generative AI), across a spectrum of applications including robotics, visual understanding, personalized content generation. Furthermore, he is also interested in enhancing the consistency of these models and mitigating issues such as hallucination. His work has been published at high quality venues such as CVPR, ICCV, ECCV, and also won multiple awards.
Please check the following links for more details.
We investigate chain-of-thought reasoning in vision-language models, proposing metrics to measure reasoning consistency and methods to improve the reliability of these models’ reasoning processes.
PNAS
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments
Abhinav Rajvanshi , Karan Sikka, Xiao Lin , and 3 more authors
We present SayNav, a framework that grounds large language models for robot navigation in novel environments through dynamic planning, enabling robots to understand natural language commands and navigate effectively in previously unseen spaces.
CVPR
Dress: Instructing large vision-language models to align and interact with humans via natural language feedback
Yangyi Chen , Karan Sikka, Michael Cogswell , and 2 more authors
DRESS is a method for aligning vision-language models with human preferences using natural language feedback, enabling more effective human-AI interaction and improving model behavior through iterative refinement.