HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Thu, 18 Dec 2025 08:40:06 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"6943bde6-6d7f"
expires: Mon, 29 Dec 2025 01:12:59 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 2FD3:3ABDEF:80EE9F:90FBC0:6951D343
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 01:02:59 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210076-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766970179.129172,VS0,VE220
vary: Accept-Encoding
x-fastly-request-id: 453a6a6442259c762f1af34e679cfb0245943db3
content-length: 7834
PURRlab @ IT University of Copenhagen
PURRlab @ IT University of Copenhagen
Pattern Recognition Revisited
About
PURRLab research interests lie within the broad area of trustworthy machine learning and its applications to medical imaging with a focus on datasets. We are particularly interested in understanding the similarity and diversity of datasets, methods for learning with limited labeled data such as transfer learning, and meta-research on machine learning in medical imaging.
We have two open postdoc positions in the CHEETAH project. Read more about the project + apply here (or help spread the word!)
Dec 01, 2025
Beatrix Miranda Ginn Nielsen joins the lab as a postdoctoral researcher, welcome Beatrix!
Oct 01, 2025
Niclas Claßen joins the lab as a research assistant to work on the project “CHEETAH: CHallenges of Evaluating Teams and Algorithms”, welcome Niclas!
Sep 23, 2025
We have two poster presentations at the MICCAI FAIMI workshop: one by Théo on CLIP-based models, and one by MSc students Regitze, Nikolette, and Andreas on robustness in skin lesion classification.
Sep 04, 2025
Théo Sourget starts as a PhD Student to work on the project “CHEETAH: CHallenges of Evaluating Teams and Algorithms”.
The advancement of machine learning algorithms in medical image analysis requires the expansion of training datasets. A popular and cost-effective approach is automated annotation extraction from free-text medical reports, primarily due to the high costs associated with expert clinicians annotating medical images, such as chest X-rays. However, it has been shown that the resulting datasets are susceptible to biases and shortcuts. Another strategy to increase the size of a dataset is crowdsourcing, a widely adopted practice in general computer vision with some success in medical image analysis. In a similar vein to crowdsourcing, we enhance two publicly available chest X-ray datasets by incorporating non-expert annotations. However, instead of using diagnostic labels, we annotate shortcuts in the form of tubes. We collect 3.5k chest drain annotations for NIH-CXR14, and 1k annotations for four different tube types in PadChest, and create the Non-Expert Annotations of Tubes in X-rays (NEATX) dataset. We train a chest drain detector with the non-expert annotations that generalizes well to expert labels. Moreover, we compare our annotations to those provided by experts and show “moderate” to “almost perfect” agreement. Finally, we present a pathology agreement study to raise awareness about the quality of ground truth annotations. We make our dataset available on Zenodo at https://zenodo.org/records/14944064and our code available at https://github.com/purrlab/chestxr-label-reliability.
Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static – they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at https://inthepicture.itu.dk/.