By Mankeerat Sidhu, Hetarth Chopra, Ansel Blume, Jeonghwan Kim, Revanth Gangi Reddy and Heng Ji
The arxiv can be found here - SearchDet
This repository contains the official code for SearchDet, a training-free framework for long-tail, open-vocabulary object detection. SearchDet leverages web-retrieved positive and negative support images to dynamically generate query embeddings for precise object localization—all without additional training.
The Architecture Diagram of our process. We compare the adjusted embeddings, produced by the DINOv2 model, of the positive and negative support images, with the relevant masks extracted using the SAM model to provide an initial estimate of our segmentation BBox. We again use DINOv2 for generating pixel-precise heatmaps which provide another estimate for the segmentation. We combine both these estimates using a binarized overlap to get the final segmentation mask.
- ✅ Enhance Open-Vocabulary Detection: Improve detection performance on long-tail classes by retrieving and leveraging web images.
- ✅ Operate Training-Free: Eliminate the need for costly fine-tuning and continual pre-training by computing query embeddings at inference time.
- ✅ Utilize State-of-the-Art Models: Integrate off-the-shelf models like DINOv2 for robust image embeddings and SAM for generating region proposals.
Our method demonstrates substantial mAP improvements over existing approaches on challenging datasets—all while keeping the inference pipeline lightweight and training-free.
- Web-Based Exemplars: Retrieve positive and negative support images from the web to create dynamic, context-sensitive query embeddings.
- Attention-Based Query Generation: Enhance detection by weighting support images based on cosine similarity with the input query.
- Robust Region Proposals: Use SAM to generate high-quality segmentation proposals that are refined via similarity heatmaps.
- Adaptive Thresholding: Apply frequency-based thresholding to automatically select the most relevant region proposals.
- Scalable Inference: Achieve strong performance with just a few support images—ideal for long-tailed object detection scenarios.
Figure 3. Illustration of our method providing more precise masks after including the negative support image samples. The negative query (e.g., “waves”) helps avoid irrelevant areas and focus on the intended concept (e.g., “surfboard”).
We compare not just the accuracy of our methodology, but also compare OWOD models' performance vs. inference time on LVIS. SearchDet with caching has a comparable speed to GroundingDINO and is faster than T-Rex, two state-of-the-art methods.
Here are some images as well that present SearchDet's performance on the benchmarks
|
|
|
|
|
|
You need to run pip install -r requirements.txt in your virtual environment. If you plan to use GPU for running this code kindly first install pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118 depending on your CUDA version, comment out torch and torchvision in the requirements, and then run pip install -r requirements.txt.
The entire design philosophy of SearchDet is that any developer can replace components of our system, according to their desired needs.
- If more precision is needed in the mask - one can use a bigger version of SAM (like SemanticSAM etc.) and if more inference speed is needed one can use a faster implementation of SAM (like FastSAM or PyTorch Implementation of SAM).
- If more precision is needed in the retrieval quality of the mask - one can use other alternatives suitable for your use-case such as CLIP etc.
- It is encouraged that one should experiment if their use-case needs the negative exemplar images, and hence modifying
adjust_embedding(line 167) inmask_with_search.pyis encouraged. Users can test with and without negative images, whichever scenario suits them the best. - The web crawler that we use is a naive implementation using Selenium without parallelization. It is encouraged to spin multiple threads for doing this.











