CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Wed, 12 Nov 2025 16:07:30 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6914b0c2-1973" expires: Mon, 29 Dec 2025 10:23:26 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: E61E:3A7A40:89C39C:9AC877:69525446 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 10:13:26 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210095-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767003207.728279,VS0,VE205 vary: Accept-Encoding x-fastly-request-id: a4f79c026286b441e8dfade6687813d5a1188d52 content-length: 2129 NYU CTF Bench

NYU CTF
Bench

A benchmark of CTF challenges to test LLM capabilities in cybersecurity

NeurIPS'24 Datasets and Benchmarks

Minghao Shao*, Sofija Jancheska*, Meet Udeshi*, Brendan Dolan-Gavitt*,
Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg,
Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

The NYU CTF Bench is designed to evaluate cybersecurity capabilities of LLM agents. We provide difficult real-world CTF challenges to facilitate research in improving LLMs at interactive cybersecurity tasks and complex automated task planning. Evaluating LLM agents on the NYU CTF challenges yields insights into their potential for AI-driven cybersecurity to perform real-world threat management. Check out the paper for details.

@inproceedings{shao2024nyuctfbench,
     author = {Shao, Minghao and Jancheska, Sofija and Udeshi, Meet and Dolan-Gavitt, Brendan and xi, haoran and Milner, Kimberly and Chen, Boyuan and Yin, Max and Garg, Siddharth and Krishnamurthy, Prashanth and Khorrami, Farshad and Karri, Ramesh and Shafique, Muhammad},
     booktitle = {Advances in Neural Information Processing Systems},
     pages = {57472--57498},
     title = {NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security},
     url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/69d97a6493fbf016fff0a751f253ad18-Paper-Datasets_and_Benchmarks_Track.pdf},
     volume = {37},
     year = {2024}
}

Leaderboard

#	Agent	Model	Score	Logs	Link

How to Submit

All submissions are managed at the leaderboard submissions github repository. Follow the README on the repository to make a submission.

Original Source | Taken Source