| CARVIEW |
Method
Our method includes three training routes: Route-1, Route-2, and Route-3. All three routes share the same object queries and detection heads for classification and regression. Route-2 serves as the primary route for one-to-one prediction, identical to the baseline models. Route-1 shares self-attention and cross-attention but uses an independent feed-forward network (o2m FFN) for one-to-many prediction. Route-3, sharing all components with the primary route, introduces a novel instructive self-attention, implemented by adding a learnable instruction token to the object queries to guide them and the subsequent network for one-to-many prediction. During inference, the auxiliary routes, Route-1 and Route-3, are discarded.
Quantatitve Results
Extension to Instance Segmentation
| Epochs | w/ Mr. DETR | Mask mAP | Box mAP |
|---|---|---|---|
| 12 | 32.4 | 46.5 | |
| 12 | ✔ | 36.0 (+3.6) | 49.5 (+3.0) |
| 24 | 35.1 | 48.6 | |
| 24 | ✔ | 37.6 (+2.5) | 50.3 (+1.7) |
Instance segmentation results on the COCO 2017 validation set. All experiments are based on the Deformable-DETR++ with 300 queries and ResNet-50 as backbone.
Effectiveness of our Instructive Self-Attention
| Route-1 | Route-2 | Route-3 | mAP | AP50 | AP75 |
|---|---|---|---|---|---|
| ✔ | 47.6 | 65.8 | 51.8 | ||
| ✔ | 49.6 (+2.0) | 67.4 | 54.2 | ||
| ✔ | 50.4 (+2.8) | 67.9 | 55.3 | ||
| ✔ | ✔ | ✔ | 50.7 (+3.1) | 68.2 | 55.4 |
The ablation study of different routes in our method. 'Route-1': the auxiliary training route with independent FFN. 'Route-2': the primary route for one-to-one prediction. 'Route-3': the auxiliary training route with instructive self-attention.
We further visualize the attention maps of the instructive self-attention, which reveals that when the 300 object queries act as query and the 10 instruction tokens as key, nearly all 300 object queries exhibit strong activation with the instruction tokens. This indicates that instruction tokens effectively convey information to object queries and subsequent network layers, aiding the model in achieving one-to-many predictions.