| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://peter-kocsis.github.io/LowDataGeneralization/
x-github-request-id: E224:3FD64F:9A521D:AD59C0:69535EF0
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 05:11:13 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210051-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767071473.918361,VS0,VE198
vary: Accept-Encoding
x-fastly-request-id: d766ee030013231b77435db5d179347588646d17
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 28 Nov 2025 15:56:49 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"6929c641-780d"
expires: Tue, 30 Dec 2025 05:21:13 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: CE01:292AC1:9A0F46:AD1503:69535EF1
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 05:11:13 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210051-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767071473.130312,VS0,VE219
vary: Accept-Encoding
x-fastly-request-id: 8bb29645f4c503a5c72af56f2a11b17815b563d6
content-length: 6314
Peter Kocsis
We propose a simple yet effective framework for improving the generalization from a small amount of data. In our work, we bring back fully-connected layers at the end of CNN-based architectures. We show that by adding as little as 0.37% extra parameters during training, we can significantly improve the generalization in the low-data regime. Our network architecture consists of two main parts: a convolutional backbone network and our proposed Feature Refiner (FR) based on multi-layer perceptrons. Our method is task and model-agnostic and can be applied to many convolutional networks. In our method, we extract features with the convolutional backbone network. Then, we apply our FR followed by a task-specific head. More precisely, we first reduce the feature dimension dbbf to df rf with a single linear layer to reduce the number of extra parameters. Then we apply a symmetric two-layer multi-layer perceptron wrapped around by normalization layers.
One could argue that using more parameters can improve the performance just because of the increased expressivity of the network. To disprove this argument, we develop an online joint knowledge distillation (OJKD) method. Our OJKD enables us to use the exact same architecture as our baseline networks during inference and utilizes our FR solely during training.
We compare the results of our method with those of ResNet18. On the first training cycle (1000 labels), our method outperforms ResNet18 by 7.6 percentage points (pp). On the second cycle, we outperform ResNet18 by more than 10pp. We keep outperforming ResNet18 until the seventh cycle, where our improvement is half a percentage point. For the remaining iterations, both methods reach the same accuracy. A common tendency for all datasets is that with an increasing number of labeled samples, the gap between our method and the baseline shrinks. Therefore, dropping the fully-connected layers in case of a large labeled dataset does not cause any disadvantage, as was found in [6]. However, that work did not analyze this question in the low-data regime, where using FC layers after CNN architectures is clearly beneficial.
We check if our method can be used with other backbones than ResNet18. The goal of the experiment is to show that our method is backbone agnostic and generalizes both to different versions of ResNet as well as to other types of convolutional neural networks. As we can see, our method significantly outperforms the baselines on both datasets and for all three types of backbones.
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes
NeurIPS 2022
Peter Kocsis
TU Munich
Peter Súkeník
IST Austria
Guillem Brasó
TU Munich
Matthias Nießner
TU Munich
Laura Leal-Taixé
TU Munich
Ismail Elezi
TU Munich
Abstract
Convolutional neural networks were the standard for solving many computer vision tasks until recently, when Transformers of MLP-based architectures have started to show competitive performance. These architectures typically have a vast number of weights and need to be trained on massive datasets; hence, they are not suitable for their use in low-data regimes. In this work, we propose a simple yet effective framework to improve generalization from small amounts of data. We augment modern CNNs with fully-connected (FC) layers and show the massive impact this architectural change has in low-data regimes. We further present an online joint knowledge-distillation method to utilize the extra FC layers at train time but avoid them during test time. This allows us to improve the generalization of a CNN-based model without any increase in the number of weights at test time. We perform classification experiments for a large range of network backbones and several standard datasets on supervised learning and active learning. Our experiments significantly outperform the networks without fully-connected layers, reaching a relative improvement of up to 16% validation accuracy in the supervised setting without adding any extra parameters during inference.Method
Feature Refiner
Online Joint Knowledge Distillation
Experiments
Citation
@inproceedings{kocsis2022lowdataregime,
author = {Peter Kocsis
and Peter S\'{u}ken\'{i}k
and Guillem Bras\'{o}
and Matthias Nie{\ss}ner
and Laura Leal-Taix\'{e}
and Ismail Elezi},
title = {The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes},
booktitle = {Proc. NeurIPS},
year={2022}
}
author = {Peter Kocsis
and Peter S\'{u}ken\'{i}k
and Guillem Bras\'{o}
and Matthias Nie{\ss}ner
and Laura Leal-Taix\'{e}
and Ismail Elezi},
title = {The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes},
booktitle = {Proc. NeurIPS},
year={2022}
}
Citation copied!