OoD-Bench is a benchmark for both datasets and algorithms of out-of-distribution generalization. It positions datasets along two dimensions of distribution shift: diversity shift and correlation shift, unifying the disjoint threads of research from the perspective of data distribution. OoD algorithms are then evaluated and compared on two groups of datasets, each dominanted by one kind of the distribution shift. See our paper (CVPR 2022 oral) for more details.
This repository contains the code to produce the benchmark, which has two main components:
- a framework for quantifying distribution shift that benchmarks the datasets, and
- a modified version of DomainBed that benchmarks the algorithms.
We are extending our work to OoD-Bench+ with many exciting updates:
- A new quantification formula for correlation shift. The sum of diversity and correlation shift now captures the squared Hellinger distance between any two joint distributions of the target variable and the non-causal feature variable. You can still use the old quantification formula by passing
--legacy_mode
to the quantification script. - Improved numerical stability for the quantification of correlation shift under very small sample size. Now, classes with too few samples are all ignored with warning. For ImageNet-V2, which is of extremely small sample size of each class, we merge the 1000 classes into 400 super-classes following the wordnet synsets hierarchy. This gives us a more accurate estimation of correlation shift between ImageNet and ImageNet-V2.
- Quantification results on two additional datasets: fMoW-WILDS and SVIRO.
- More updates are coming soon.
- Python 3.6 or above
- The packages listed in
requirements.txt
. You can install them viapip install -r requirements.txt
.
Clone submodules and add them to PYTHONPATH
:
cd OoD-Bench
git submodule update --init --recursive
export PYTHONPATH="$PYTHONPATH:$(pwd)/external/DomainBed/"
export PYTHONPATH="$PYTHONPATH:$(pwd)/external/wilds/"
Please follow this instruction.
The quantification process consists of three main steps:
(1) training an environment classifier,
(2) extracting features from the trained classifier, and
(3) measuring the shifts with the extracted features.
The module ood_bench.scripts.main
will handle the whole process for you.
For example, to quantify the distribution shift between the training environments (indexed by 0 and 1) and the test environment (indexed by 2) of Colored MNIST with 16 trials, you can simply run:
python -m ood_bench.scripts.main\
--n_trials 16\
--data_dir /path/to/my/data\
--dataset ColoredMNIST_IRM\
--envs_p 0 1\
--envs_q 2\
--backbone mlp\
--output_dir /path/to/store/outputs
In other cases where pretrained models are used, --pretrained_model_path
must be specified.
For models in torchvision model zoo, you can pass auto
to the argument and the pretrained model will be downloaded automatically.
These two optional arguments are also useful:
--parallel
: utilize multiple GPUs to conduct the trials in parallel. The maximum number of parallel trials is the number of visible GPUs which can be controlled by settingCUDA_VISIBLE_DEVICES
.--calibrate
: calibrate the thresholdseps_div
andeps_cor
so that the estimated diversity and correlation shift are ensured to be within a range close to 0 under i.i.d. condition.
The following results are produced by the scripts under ood_bench/examples
, all being automatically calibrated.
Dataset | Div. shift | Cor. shift |
---|---|---|
PACS | 0.66 ± 0.05* | 0.08 ± 0.03* |
Office-Home | 0.07 ± 0.01* | 0.07 ± 0.03* |
Terra Incognita | 1.00 ± 0.06* | 0.00 ± 0.00* |
WILDS-Camelyon | 0.96 ± 0.19* | 0.00 ± 0.00* |
DomainNet | 0.37 ± 0.03 | 0.25 ± 0.04 |
Colored MNIST | 0.00 ± 0.00 | 0.43 ± 0.03 |
CelebA | 0.01 ± 0.00 | 0.17 ± 0.06 |
NICO | 0.02 ± 0.02 | 0.20 ± 0.10 |
ImageNet-A † | 0.04 ± 0.01 | 0.03 ± 0.03 |
ImageNet-R † | 0.10 ± 0.02 | 0.18 ± 0.05 |
ImageNet-V2-Super400 † | 0.01 ± 0.00 | 0.06 ± 0.02 |
fMoW-WILDS | 0.21 ± 0.03 | 0.09 ± 0.02 |
SVIRO | 0.87 ± 0.09 | 0.00 ± 0.00 |
* averaged over all leave-out-domain-out splits † with respect to the original ImageNet
These results can be obtained by setting --legacy_mode
in the scripts under ood_bench/examples
.
Dataset | Diversity shift | Correlation shift |
---|---|---|
PACS | 0.6715 ± 0.0392* | 0.0338 ± 0.0156* |
Office-Home | 0.0657 ± 0.0147* | 0.0699 ± 0.0280* |
Terra Incognita | 0.9846 ± 0.0935* | 0.0002 ± 0.0003* |
DomainNet | 0.3740 ± 0.0343* | 0.1061 ± 0.0181* |
WILDS-Camelyon | 0.9632 ± 0.1907 | 0.0000 ± 0.0000 |
Colored MNIST | 0.0013 ± 0.0006 | 0.5468 ± 0.0278 |
CelebA | 0.0031 ± 0.0017 | 0.1868 ± 0.0530 |
NICO | 0.0176 ± 0.0158 | 0.1968 ± 0.0888 |
ImageNet-A † | 0.0435 ± 0.0123 | 0.0222 ± 0.0192 |
ImageNet-R † | 0.1024 ± 0.0188 | 0.1180 ± 0.0311 |
ImageNet-V2 † | 0.0079 ± 0.0017 | 0.2362 ± 0.0607 |
Note: there is some difference between the results shown above and those reported in our paper mainly because we reworked the original implementation to ease public use and to improve quantification stability. One of the main improvements is the use of calibration. Previously, the same thresholds that are empirically sound are used across all the datasets studied in our paper (but this may not hold for other datasets).
- New datasets must first be added to
external/DomainBed/domainbed/datasets.py
as a subclass ofMultipleDomainDataset
, for example:
class MyDataset(MultipleDomainDataset):
ENVIRONMENTS = ['env0', 'env1'] # at least two environments
def __init__(self, root, test_envs, hparams):
super().__init__()
# you may change the transformations below
transform = get_transform()
augment_scheme = hparams.get('data_augmentation_scheme', 'default')
augment_transform = get_augment_transform(augment_scheme)
self.datasets = [] # required
for i, env_name in enumerate(self.ENVIRONMENTS):
if hparams['data_augmentation'] and (i not in test_envs):
env_transform = augment_transform
else:
env_transform = transform
# load the environments, not necessarily as ImageFolders;
# you may write a specialized class to load them; the class
# must possess an attribute named `samples`, a sequence of
# 2-tuples where the second elements are the labels
dataset = ImageFolder(Path(root, env_name), transform=env_transform)
self.datasets.append(dataset)
self.input_shape = (3, 224, 224,) # required
self.num_classes = 2 # required
- New network backbones must be first added to
ood_bench/networks.py
as a subclass ofBackbone
, for example:
class MyBackbone(Backbone):
def __init__(self, hdim, pretrained_model_path=None):
self._hdim = hdim
super(MyBackbone, self).__init__(pretrained_model_path)
@property
def hdim(self):
return self._hdim
def _load_modules(self):
self.modules_ = nn.Sequential(
nn.Linear(3 * 14 * 14, self.hdim),
nn.ReLU(True),
)
def forward(self, x):
return self.modules_(x)
Please refer to this repository.
If you find the code useful or find our paper relevant to your research, please consider citing:
@inproceedings{ye2022ood,
title={OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization},
author={Ye, Nanyang and Li, Kaican and Bai, Haoyue and Yu, Runpeng and Hong, Lanqing and Zhou, Fengwei and Li, Zhenguo and Zhu, Jun},
booktitle={CVPR},
year={2022}
}