You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ECCV2024] VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Introduction
Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding.
If you find VisionLLaMA useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:
@inproceedings{chu2024visionllama,
title={VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks},
author={Chu, Xiangxiang and Su, Jianlin and Zhang, Bo and Shen, Chunhua},
booktitle={European Conference on Computer Vision},
year={2024}
}
About
VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks