| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://fingerrec.github.io/cosmo/
access-control-allow-origin: *
strict-transport-security: max-age=31556952
expires: Sun, 28 Dec 2025 15:39:05 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: 7461:444BC:7CCAC1:8BD284:69514CC1
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 15:29:05 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210076-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766935746.673999,VS0,VE208
vary: Accept-Encoding
x-fastly-request-id: 4b84af52cf5bab22cb3919c3aad3c1a8f1fba13c
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Sat, 08 Nov 2025 04:35:31 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"690ec893-4174"
expires: Sun, 28 Dec 2025 15:39:06 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 39DA:15317B:7C590F:8B616B:69514CBF
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 15:29:06 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210076-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766935746.895652,VS0,VE210
vary: Accept-Encoding
x-fastly-request-id: c70a27a0c9498c650de1bbeb9ae370ccd90c0e4e
content-length: 3992
CosMo
CosMo
For Interleaved Vision-language Pre-training
Alex Jinpeng Wang
Linjie Li
Kevin Qinhong Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
Please scroll down to continue
In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. We introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (CosMo), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
Howto-Interlink7M Dataset Download.
Download data directly from Huggingface ListView. .HowTo100M source video download.
The source video can be found here.Dataset Statics.
More details about dataset statics can be found at here.Model Card
| Method | Language Model | Vision Model | Samples | Model Weight |
| CosMo2.1B | OPT-IML1.8B | 130M | VIT-L | Pretrained Weight |
| CosMo3.4B | RedPajama-3B | 180M | VIT-L | Pretrained Weight |
| CosMo8.1B | Mistral7B | 180M | VIT-L | Pretrained Weight |
Model Explorison
Our codebase also support training following models on A100 GPUS:| Language Model | Size | Batch Size | GPU Memory |
| Vicuna | 7B | 196 | 70G |
| LLaMA | 7B | 196 | 70G |
| Mixtral7x8b | 42B | 32 | 80G |
Acknowledgement
This work is mainly based on:
-MMC4
Others
We thanks:
-Ziteng Gao for discussing the training stability of Multi-Node.
-Henry Zhao for his insights on the design of the lightweight cross-attention model.
⇧








