Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Method

Our method includes three training routes: Route-1, Route-2, and Route-3. All three routes share the same object queries and detection heads for classification and regression. Route-2 serves as the primary route for one-to-one prediction, identical to the baseline models. Route-1 shares self-attention and cross-attention but uses an independent feed-forward network (o2m FFN) for one-to-many prediction. Route-3, sharing all components with the primary route, introduces a novel instructive self-attention, implemented by adding a learnable instruction token to the object queries to guide them and the subsequent network for one-to-many prediction. During inference, the auxiliary routes, Route-1 and Route-3, are discarded.

Description of image
Illustration of our proposed multi-route training method

Quantatitve Results

Description of image
The performance on the COCO 2017 validation set.All models are based on the ResNet-50 backbone.
Description of image
The performance on the COCO 2017 validation set based on the Swin-L backbone.

Extension to Instance Segmentation

Epochs w/ Mr. DETR Mask mAP Box mAP
12 32.4 46.5
12 36.0 (+3.6) 49.5 (+3.0)
24 35.1 48.6
24 37.6 (+2.5) 50.3 (+1.7)

Instance segmentation results on the COCO 2017 validation set. All experiments are based on the Deformable-DETR++ with 300 queries and ResNet-50 as backbone.

Effectiveness of our Instructive Self-Attention

Route-1 Route-2 Route-3 mAP AP50 AP75
47.6 65.8 51.8
49.6 (+2.0) 67.4 54.2
50.4 (+2.8) 67.9 55.3
50.7 (+3.1) 68.2 55.4

The ablation study of different routes in our method. 'Route-1': the auxiliary training route with independent FFN. 'Route-2': the primary route for one-to-one prediction. 'Route-3': the auxiliary training route with instructive self-attention.

We further visualize the attention maps of the instructive self-attention, which reveals that when the 300 object queries act as query and the 10 instruction tokens as key, nearly all 300 object queries exhibit strong activation with the instruction tokens. This indicates that instruction tokens effectively convey information to object queries and subsequent network layers, aiding the model in achieving one-to-many predictions.

Description of image
Visualization of attention maps for instructive self-attention.We use Deformable-DETR++ with 300 object queries and 10 instruction tokens for this visualization. The first 10 tokens are instruction tokens. The vertical and horizontal axes represent the Query and Key, respectively.

BibTeX

@inproceedings{zhang2024mr,
  title={Mr. DETR: Instructive Multi-Route Training for Detection Transformers},
  author={Zhang, Chang-Bin and Zhong, Yujie and Han, Kai},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}
 
Original Source | Taken Source