Carview!

Open LLaMA Eyes to See the World

This project aims to optimize LLaMA model for visual information understanding like GPT-4 and further explore the potentional of large language model.

Generally, we use CLIP vision encoder to extract image features, then image features are projected with MLP-based or Transformer-based connection network into text embedding dimensionality. Then, visual representation (including additional special tokens [boi] and [eoi]) is concatenated with text representation to learn in a autoregressive manner. The framework is similar to kosmos-1 and PaLM-E.

Code adjustation to support for multi-modal generation. Download clip and LLaMA models from huggingface. Meantime, we test the scripts are also compatible with other LLaMA model size. Please use script preprocess.py to deal with the data.
Supervised training stage: freeze llama and clip-encoder models and only optimize the connection network. In this stage, we use COCO, CC-3M and COYO-700M datasets with training scripts train.py. We provide the training hyper-parameter used in our experiemnts on A100 GPU(80G). We also evaluate the image captioning performance in COCO testing set.

Argument Values

batch size 1 * 8 * 8

epochs 3

cut length 256

learning rate 4e-3

image sequence length 10
Instructing tuning stage: fine-tuning full model with mixed VQA and language-only instructing dataset. We use lora strategy to optimize the entire model with fine-tuning scripts finetune.py.

Argument Values

batch size 1024

epochs 3

cut length 256

learning rate 2e-5

image sequence length 10
Open source trained ckpt on huggingface and gradio interface for multi-model generation.

Reference

[1] https://github.com/facebookresearch/llama

[2] https://github.com/tloen/alpaca-lora

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
figures		figures
llama		llama
utils		utils
README.md		README.md
finetune.py		finetune.py
infer.py		infer.py
model.py		model.py
preprocess.py		preprocess.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open LLaMA Eyes to See the World

Reference

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Argument	Values
`batch size`	1 * 8 * 8
`epochs`	3
`cut length`	256
`learning rate`	4e-3
`image sequence length`	10

Argument	Values
`batch size`	1024
`epochs`	3
`cut length`	256
`learning rate`	2e-5
`image sequence length`	10

feizc/Visual-LLaMA

Folders and files

Latest commit

History

Repository files navigation

Open LLaMA Eyes to See the World

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages