CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://visual-ai.github.io/mrdetr/ access-control-allow-origin: * strict-transport-security: max-age=31556952 expires: Mon, 29 Dec 2025 04:34:26 GMT cache-control: max-age=600 x-proxy-cache: MISS x-github-request-id: 7709:444BC:84BF71:9525DB:6952027A accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 04:24:26 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210024-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766982267.685149,VS0,VE201 vary: Accept-Encoding x-fastly-request-id: e651800259ff734702d51ea68c9b3ad33bfa57c0 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 16 Dec 2025 05:43:10 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6940f16e-6d1f" expires: Mon, 29 Dec 2025 04:34:27 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 67A4:318CF6:8334E6:93999A:6952027A accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 04:24:27 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210024-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766982267.905322,VS0,VE208 vary: Accept-Encoding x-fastly-request-id: 1b4d5ee10c4d1f36fe03e5c591036892ab16f3cc content-length: 5528 Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Chang-Bin Zhang¹, Yujie Zhong², Kai Han¹

¹Visual AI Lab, The University of Hong Kong
²Meituan Inc.

Paper arXiv HF Demo Code BibTex

Performance

Qualitative Results

Our Findings

Based on the multi-task training framework that is required to achieve one-to-one and one-many prediction, any independent component significantly benefits the primary route of one-to-one prediction, even when other components are shared.

No	Configurations	Routes	o2o	o2m w/ NMS
(1)	One-to-one only	1	47.6	-
(2)	Share All	1	41.6 (-6.0)	41.6
(3)	Not shared Self-Attention	2	49.7 (+2.1)	50.3
(4)	Not shared Cross-Attention	2	49.2 (+1.6)	50.0
(5)	Not shared FFN	2	49.6 (+2.0)	50.1
(6)	Shared Self-Attention	2	49.4 (+1.8)	50.3
(7)	Shared Cross-Attention	2	49.4 (+1.8)	50.0
(8)	Shared FFN	2	49.2 (+1.6)	50.0
(9)	(3) + (4)	3	49.4 (+1.8)	49.9
(10)	(3) + (5)	3	50.0 (+2.4)	50.8
(11)	(4) + (5)	3	49.0 (+1.4)	49.6
(12)	(3) + (4) + (5)	4	49.6 (+2.0)	50.2

Method

Our method includes three training routes: Route-1, Route-2, and Route-3. All three routes share the same object queries and detection heads for classification and regression. Route-2 serves as the primary route for one-to-one prediction, identical to the baseline models. Route-1 shares self-attention and cross-attention but uses an independent feed-forward network (o2m FFN) for one-to-many prediction. Route-3, sharing all components with the primary route, introduces a novel instructive self-attention, implemented by adding a learnable instruction token to the object queries to guide them and the subsequent network for one-to-many prediction. During inference, the auxiliary routes, Route-1 and Route-3, are discarded.

Quantatitve Results

Extension to Instance Segmentation

Epochs	w/ Mr. DETR	Mask mAP	Box mAP
12		32.4	46.5
12	✔	36.0 (+3.6)	49.5 (+3.0)
24		35.1	48.6
24	✔	37.6 (+2.5)	50.3 (+1.7)

Instance segmentation results on the COCO 2017 validation set. All experiments are based on the Deformable-DETR++ with 300 queries and ResNet-50 as backbone.