| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Thu, 26 Dec 2024 01:42:03 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"676cb46b-2e92"
expires: Tue, 30 Dec 2025 03:41:56 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 9E22:15317B:9889B8:AB6320:695347AC
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 03:31:56 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210066-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767065517.784253,VS0,VE202
vary: Accept-Encoding
x-fastly-request-id: d47bf5d11f319f9838ec23152140d87faaf69508
content-length: 3673
Scratching Visual Transformer's Back with Uniform Attention
Scratching Visual Transformer's Back
Scratching Visual Transformer's Back
with Uniform Attention
Naver AI Lab, POSTECH, University of Tubingen
International Conference on Computer Vision (ICCV) 2023
Abstract
The favorable performance of Vision Transformers
(ViTs) is often attributed to the multi-head self-attention
(MSA), which enables global interactions at each layer of
a ViT model. Previous works acknowledge the property of
long-range dependency for the effectiveness in MSA.
In this
work, we study the role of MSA in terms of the different axis,
density. Our preliminary analyses suggest that the spatial
interactions of learned attention maps are close to dense
interactions rather than sparse ones. This is a curious phenomenon
because dense attention maps are harder for the
model to learn due to softmax. We interpret this opposite
behavior against softmax as a strong preference for
the ViT models to include dense interaction. We thus manually
insert the dense uniform attention to each layer of the
ViT models to supply the much-needed dense interactions.
We call this method Context Broadcasting, CB.
Our study demonstrates the inclusion of CB takes the role of dense
attention, and thereby reduces the degree of density in the
original attention maps by complying softmax in MSA. We
also show that, with negligible costs of CB (1 line in your
model code and no additional parameters), both the capacity
and generalizability of the ViT models are increased.
Motivation


- A majority of the attention in ViTs has such high entropy values
- The steeper gradient for the MSA layer with denser attention maps
- Hard to learn Dense attention maps, but vital to ViTs
Method
Decide to inject uniform attention because
(1) uniform attention is the densest attention and is unstable in terms of gradient view
(2) but, humans can supply uniform attention easily
(3) uniform attention requires no additional parameters and small computation costs.
We do this through the broadcasting context with the CB module.
Characteristics
- The insertion of CB module lowers the entropy values significantly
- Injecting the dense global interactions into ViT does not hurt the range of interactions
- The upper layers prefer dense interactions more than the lower layers
- More effective in a lower number of heads rather than the large number of heads
BibTeX
@inproceedings{hyeon2022scratching,
title={Scratching Visual Transformer's Back with Uniform Attention},
author={Hyeon-Woo, Nam and Yu-Ji, Kim and Heo, Byeongho and Han, Doonyoon and Oh, Seong Joon and Oh, Tae-Hyun},
booktitle = {ICCV},
year={2023}
}
