🔥 SearchDet: Training-Free Long Tail Object Detection via Web-Image Retrieval (CVPR 2025!) 🔥

By Mankeerat Sidhu, Hetarth Chopra, Ansel Blume, Jeonghwan Kim, Revanth Gangi Reddy and Heng Ji

The arxiv can be found here - SearchDet

This repository contains the official code for SearchDet, a training-free framework for long-tail, open-vocabulary object detection. SearchDet leverages web-retrieved positive and negative support images to dynamically generate query embeddings for precise object localization—all without additional training.

The Architecture Diagram of our process. We compare the adjusted embeddings, produced by the DINOv2 model, of the positive and negative support images, with the relevant masks extracted using the SAM model to provide an initial estimate of our segmentation BBox. We again use DINOv2 for generating pixel-precise heatmaps which provide another estimate for the segmentation. We combine both these estimates using a binarized overlap to get the final segmentation mask.

SearchDet is designed to:

✅ Enhance Open-Vocabulary Detection: Improve detection performance on long-tail classes by retrieving and leveraging web images.
✅ Operate Training-Free: Eliminate the need for costly fine-tuning and continual pre-training by computing query embeddings at inference time.
✅ Utilize State-of-the-Art Models: Integrate off-the-shelf models like DINOv2 for robust image embeddings and SAM for generating region proposals.

Our method demonstrates substantial mAP improvements over existing approaches on challenging datasets—all while keeping the inference pipeline lightweight and training-free.

Key Features

Web-Based Exemplars: Retrieve positive and negative support images from the web to create dynamic, context-sensitive query embeddings.
Attention-Based Query Generation: Enhance detection by weighting support images based on cosine similarity with the input query.
Robust Region Proposals: Use SAM to generate high-quality segmentation proposals that are refined via similarity heatmaps.
Adaptive Thresholding: Apply frequency-based thresholding to automatically select the most relevant region proposals.
Scalable Inference: Achieve strong performance with just a few support images—ideal for long-tailed object detection scenarios.

Reason to use Positive and Negative Exemplars

Figure 3. Illustration of our method providing more precise masks after including the negative support image samples. The negative query (e.g., “waves”) helps avoid irrelevant areas and focus on the intended concept (e.g., “surfboard”).

Results

We compare not just the accuracy of our methodology, but also compare OWOD models' performance vs. inference time on LVIS. SearchDet with caching has a comparable speed to GroundingDINO and is faster than T-Rex, two state-of-the-art methods.

Here are some images as well that present SearchDet's performance on the benchmarks

Installation

You need to run pip install -r requirements.txt in your virtual environment. If you plan to use GPU for running this code kindly first install pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118 depending on your CUDA version, comment out torch and torchvision in the requirements, and then run pip install -r requirements.txt.

Usage

The entire design philosophy of SearchDet is that any developer can replace components of our system, according to their desired needs.

If more precision is needed in the mask - one can use a bigger version of SAM (like SemanticSAM etc.) and if more inference speed is needed one can use a faster implementation of SAM (like FastSAM or PyTorch Implementation of SAM).
If more precision is needed in the retrieval quality of the mask - one can use other alternatives suitable for your use-case such as CLIP etc.
It is encouraged that one should experiment if their use-case needs the negative exemplar images, and hence modifying adjust_embedding (line 167) in mask_with_search.py is encouraged. Users can test with and without negative images, whichever scenario suits them the best.
The web crawler that we use is a naive implementation using Selenium without parallelization. It is encouraged to spin multiple threads for doing this.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
resources		resources
LICENSE		LICENSE
README.md		README.md
binning.py		binning.py
heatmap_generation.py		heatmap_generation.py
mask_withsearch.py		mask_withsearch.py
negative_keywords_COCO_phi4_mini4k.txt		negative_keywords_COCO_phi4_mini4k.txt
negative_keywords_LVIS_phi3_mini4k		negative_keywords_LVIS_phi3_mini4k
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥 SearchDet: Training-Free Long Tail Object Detection via Web-Image Retrieval (CVPR 2025!) 🔥

SearchDet is designed to:

Key Features

Reason to use Positive and Negative Exemplars

Results

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Mankeerat/SearchDet

Folders and files

Latest commit

History

Repository files navigation

🔥 SearchDet: Training-Free Long Tail Object Detection via Web-Image Retrieval (CVPR 2025!) 🔥

SearchDet is designed to:

Key Features

Reason to use Positive and Negative Exemplars

Results

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages