CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Wed, 10 Dec 2025 21:03:21 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6939e019-345a" expires: Sun, 28 Dec 2025 15:30:49 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 955F:2F7ECD:7C32D8:8B36B0:69514AD0 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 15:20:49 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210091-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766935249.261229,VS0,VE209 vary: Accept-Encoding x-fastly-request-id: 578439b247f3b84ed82f7d1392ca3451f47a964a content-length: 4223 NeurIPS 2025: The Science of Benchmarking Tutorial

The Science of Benchmarking

What's Measured, What's Missing, What's Next

News: slides available at benchmarking.science/slides.pdf

NeurIPS 2025 Tutorial

Tuesday, December 2, 2025, 1:30pm -> 4:00pm

NeurIPS 2025, San Diego Convention Center, Exhibit Hall G,H

Martin Ziqiao Ma

University of Michigan

Michael Saxon

University of Washington

Xiang Yue

Carnegie Mellon University (Now @ Meta)

https://benchmarking.science

Outline

1. Epistemology, Design & Practice

What should we measure? What makes a good benchmark?

Video coming soon

2. Limitations

What are the main current issues in benchmarking? How is the landscape of models changing to make benchmarks worse? How do people approach it? What should attendees who want to get in to evaluation know?

3. Emerging Paradigms

How are people addressing these problems? Touching on adversarial methods, dynamic benchmarks, arenas and scaled human evals, simulators & sandboxes, applied interpretability. What can attendees work on?

Video coming soon