You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code for evals measuring frontier model capabilities.
PaperBench: End-to-end replication of state-of-the-art AI papers. Paper | Blog
SWE-Lancer: Real freelance software engineering tasks with end-to-end tests. Paper | Blog
Usage
Requirements
We manage environments with uv. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.
Running Evals
Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:
Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.
Development Workflow
Create or activate the environment for the project you are working on with uv. Example for PaperBench:
cd project/paperbench
uv sync
uv run pytest
Code style and linting use Ruff (with autofix profiles in pyproject.toml and project/common/tooling/ruff_autofix_minimal.toml) and Black. Run uv run ruff check --fix or use the provided Poe/make tasks where available.
Shared utilities live under project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.