You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The supported code and models for LV-ViT are provided.
Introduction
SiT (Self-slimmed Vision Transformer) is introduce in arxiv and serves as a generic self-slimmed learning method for vanilla vision transformers. Our concise TSM (Token Slimming Module) softly integrates redundant tokens into fewer informative ones. For stable and efficient training, we introduce a novel FRD framework to leverage structure knowledge, which can densely transfer token information in a flexible auto-encoder manner.
Our SiT can speed up ViTs by 1.7x with negligible accuracy drop, and even speed up ViTs by 3.6x while maintaining 97% of their performance. Surprisingly, by simply arming LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet, surpassing all the recent CNNs and ViTs.
Main results on LV-ViT
We follow the settings of LeViT for inference speed evaluation.
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{zong2021self,
title={Self-slimmed Vision Transformer},
author={Zhuofan Zong and Kunchang Li and Guanglu Song and Yali Wang and Yu Qiao and Biao Leng and Yu Liu},
year={2021},
eprint={2111.12624},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
License
This project is released under the MIT license. Please see the LICENSE file for more information.
About
Official implementation of "Self-slimmed Vision Transformer" (ECCV2022)