DIP: Unsupervised Dense In-Context Post-training of Visual Representations

ICCV 2025

Sophia Sirko-Galouchenko Spyros Gidaris
Antonin Vobecky Andrei Bursuc Nicolas Thome

Abstract

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by metalearning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world incontext scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.

Environment

git clone https://github.com/sirkosophia/DIP.git
cd dip 
conda create -n dip python=3.10.13 -y -c conda-forge
conda activate dip
pip install -r requirements_dip.txt

Datasets

See Preparing Datasets for DIP for details on how to download the datasets.

Pseudo-labels

Download our COCO dense pseudo labels by running the following commands:

mkdir masks 
cd masks 
wget https://huggingface.co/datasets/SophiaSirko/DIP_COCO_pseudolabels/resolve/main/dip_COCO_masks.zip
wget https://huggingface.co/datasets/SophiaSirko/DIP_COCO_pseudolabels/resolve/main/dip_COCO_masks_base.zip
unzip dip_COCO_masks.zip 
unzip dip_COCO_masks_base.zip
rm dip_COCO_masks.zip 
rm dip_COCO_masks_base.zip

Post-training

To post-train DINOv2R ViT small on COCO dataset execute the following command:

torchrun  posttraindip.py --config configs/dip_coco.yaml

To post-train DINOv2R ViT base on COCO dataset execute the following command:

torchrun  posttraindip.py --config configs/dip_coco_base.yaml

Post-trained models

Backbone	Method	PascalVOC	ADE20K	Link
ViT-S/14	DINOv2R	79.4	39.3
ViT-S/14	NeCo	81.0	38.9
ViT-S/14	DIP	81.0	39.7	Download
-----------	--------	-----------	--------	------
ViT-B/14	DINOv2R	79.0	40.8
ViT-B/14	NeCo	82.4	41.2
ViT-B/14	DIP	82.1	42.6	Download

Download Post-trained Weights

# Create the output directory if it doesn't exist
mkdir -p output
wget https://github.com/your-username/your-repo/releases/download/v1.0/dip_coco_basecheckpoint-4.pth -O output/dip_coco_basecheckpoint-4.pth
wget https://github.com/your-username/your-repo/releases/download/v1.0/dip_coco_smallcheckpoint-4.pth -O output/dip_coco_smallcheckpoint-4.pth

Evaluation

PascalVOC:

python hummingbird/launch_humm.py -n oneshot -ae 2 -dn voc -ms 10240000  -is 504 --beta 0.07 -bs 2 -ib small  -mlpout 6144   -mlpr 7 -mw output/dip_coco_smallcheckpoint-4.pth
python hummingbird/launch_humm.py -n oneshot -ae 2 -dn voc  -ms 10240000  -is 504 --beta 0.07 -bs 2 -ib base  -mlpout 6144   -mlpr 7 -mw output/dip_coco_basecheckpoint-4.pth

ADE20K:

python hummingbird/launch_humm.py -n oneshot -ae 2 -dn ade20k  -ms 10240000  -is 504 --beta 0.07 -bs 2 -ib small  -mlpout 6144   -mlpr 7 -mw output/dip_coco_smallcheckpoint-4.pth
python hummingbird/launch_humm.py -n oneshot -ae 2 -dn ade20k  -ms 10240000  -is 504 --beta 0.07 -bs 2 -ib base  -mlpout 6144   -mlpr 7 -mw output/dip_coco_basecheckpoint-4.pth

Citation

@misc{sirkogalouchenko2025dipunsuperviseddenseincontext,
      title={DIP: Unsupervised Dense In-Context Post-training of Visual Representations}, 
      author={Sophia Sirko-Galouchenko and Spyros Gidaris and Antonin Vobecky and Andrei Bursuc and Nicolas Thome},
      year={2025},
      eprint={2506.18463},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.18463}, 
}

Acknowledgements

This repo relies on the following projects:

Reproduction of Towards In-context Scene Understanding

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
configs		configs
datasets		datasets
docs		docs
hummingbird		hummingbird
models		models
pairs		pairs
utils_dip		utils_dip
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
humm_eval.py		humm_eval.py
posttraindip.py		posttraindip.py
requirements_dip.txt		requirements_dip.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

ICCV 2025

Sophia Sirko-Galouchenko Spyros Gidaris
Antonin Vobecky Andrei Bursuc Nicolas Thome

Abstract

Environment

Datasets

Pseudo-labels

Post-training

Post-trained models

Download Post-trained Weights

Evaluation

Citation

Acknowledgements

About

Uh oh!

Releases 1

Packages

Languages

License

sirkosophia/DIP

Folders and files

Latest commit

History

Repository files navigation

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

ICCV 2025 Sophia Sirko-Galouchenko Spyros Gidaris Antonin Vobecky Andrei Bursuc Nicolas Thome

Abstract

Environment

Datasets

Pseudo-labels

Post-training

Post-trained models

Download Post-trained Weights

Evaluation

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

ICCV 2025

Sophia Sirko-Galouchenko Spyros Gidaris
Antonin Vobecky Andrei Bursuc Nicolas Thome

Packages