My research focuses on efficient and user-friendly training methods for machine learning. I’m particularly interested in eliminating tedious hyperparameters (e.g., learning rates, schedules) to automate neural network training and make deep learning more accessible. A key aspect of my work is designing rigorous and meaningful benchmarks for training methods, such as AlgoPerf.
Previously, I earned my PhD in Computer Science from the University of Tübingen, supervised by Philipp Hennig, as part of the IMPRS-IS (International Max Planck Research School for Intelligent Systems). Before that, I studied Simulation Technology (B.Sc., M.Sc.) at the University of Stuttgart and Industrial and Applied Mathematics (M.Sc.) at TU/e Eindhoven. My master’s thesis, supervised by Maxim Pisarenco and Michiel Hochstenbach, explored novel preconditioners for structured Toeplitz matrices. This thesis was conducted at ASML (Eindhoven), a company specializing in lithography systems for the semiconductor industry.
The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition’s results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.
@inproceedings{Kasimbeg2025AlgoPerfResults,title={Accelerating neural network training: An analysis of the {AlgoPerf} competition},author={Kasimbeg, Priya and Schneider, Frank and Eschenhagen, Runa and Bae, Juhan and Sastry, Chandramouli Shama and Saroufim, Mark and Boyuan, Feng and Wright, Less and Yang, Edward Z. and Nado, Zachary and Medapati, Sourabh and Hennig, Philipp and Rabbat, Michael and Dahl, George E.},booktitle={International Conference on Learning Representations (ICLR)},year={2025},url={https://openreview.net/forum?id=CtM5xjRSfm},}
The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with weight-sharing. Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC expand and reduce. We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via optimising the marginal likelihood for a Wide ResNet. Finally, we observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer. However, both variations are able to reach a fixed validation metric target in 50-75% of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. This highlights the potential of applying K-FAC to modern neural network architectures.
@inproceedings{Eschenhagen2023KFAC,author={Eschenhagen, Runa and Immer, Alexander and Turner, Richard and Schneider, Frank and Hennig, Philipp},booktitle={Neural Information Processing Systems (NeurIPS)},title={{Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures}},url={https://proceedings.neurips.cc/paper_files/paper/2023/file/6a6679e3d5b9f7d5f09cdb79a5fc3fd8-Paper-Conference.pdf},year={2023},}
When engineers train deep learning models, they are very much ’flying blind’. Commonly used methods for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network’s training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a collection of instruments that enable a closer look into the inner workings of a learning machine, and a more informative and meaningful status report for practitioners. It facilitates the identification of learning phases and failure modes, like ill-chosen hyperparameters. These instruments leverage novel higher-order information about the gradient distribution and curvature, which has only recently become efficiently accessible. We believe that such a debugging tool, which we open-source for PyTorch, is a valuable help in troubleshooting the training process. By revealing new insights, it also more generally contributes to explainability and interpretability of deep nets.
@inproceedings{Schneider2021Cockpit,author={Schneider, Frank and Dangel, Felix and Hennig, Philipp},booktitle={Neural Information Processing Systems (NeurIPS)},editor={Ranzato, M. and Beygelzimer, A. and Dauphin, Y. and Liang, P.S. and Vaughan, J. Wortman},pages={20825--20837},publisher={Curran Associates, Inc.},title={Cockpit: A Practical Debugging Tool for the Training of Deep Neural Networks},url={https://proceedings.neurips.cc/paper_files/paper/2021/file/ae3539867aaeec609a4260c6feb725f4-Paper.pdf},volume={34},year={2021},}
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than 50,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.
@inproceedings{Schmidt2021CrowdedValley,author={Schmidt, Robin M. and Schneider, Frank and Hennig, Philipp},booktitle={International Conference on Machine Learning (ICML)},title={{Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers}},year={2021},url={https://proceedings.mlr.press/v139/schmidt21a.html},}
Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It is written in TensorFlow and available open source.
@inproceedings{Schneider2018DeepOBS,title={Deep{OBS}: A Deep Learning Optimizer Benchmark Suite},author={Schneider, Frank and Balles, Lukas and Hennig, Philipp},booktitle={International Conference on Learning Representations (ICLR)},year={2019},url={https://openreview.net/forum?id=rJg6ssC5Y7},}