You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I build retrieval systems that understand images, text, video, and sound - not just literal matches.
I'm a PhD researcher at Virginia Tech, working on vision-language models (VLMs), RAG, and ranking/reranking. My focus is multi-prompt (multi-vector) embeddings: many small, controllable "views" of meaning that make search richer, more interpretable, and less prone to collapse.
What I work on
Reasoning in vision-language models (VLMs).
Cross-modal retrieval across images, text, video, and audio.
Structured information extraction from multimodal data.
Knowledge representation for multimodal reasoning.
Exploring room acoustics (RIRs) as spatial signals for learning geometry-aware representations
Why it matters
Real-world queries are polysemous: idioms, metaphor, culture, and context often matter more than surface similarity. I design retrieval pipelines that surface the right connections, not only the nearest neighbor.
Projects (quick view)
Multi-Prompt Embedding for Retrieval
One input -> multiple focused embeddings to boost recall and reduce length/bias collapse.
RAG + Reranker for Multimodal Search
Lightweight bi-encoder retrieval + VLM reader + cross-encoder reranker for better final ranking.
Diversity-Aware VLM Retrieval
Retrieves multiple perspectives (literal/figurative/emotional/abstract/background) instead of forcing a single vector.
Tech I use (most often)
Languages
ML / Data
Systems / Tools
Open to collaborations
If you are working on diversity-aware retrieval, interpretable VLMs, or multimodal reasoning benchmarks, lets talk.