| CARVIEW |
AlgoTune
Can Language Models Speed Up General-Purpose Numerical Programs?
Can language models optimize the runtime of popular algorithms like gzip compression, AES encryption or SVD? To answer this, we built AlgoTune, a benchmark consisting of more than one hundred widely used math, physics, and computer science functions. For each function, the goal is to write code that is faster than the reference implementation while producing the same outputs as the reference, on a held-out test set of inputs. In addition to the benchmark, we also developed AlgoTuner, an agent which enables language models to iteratively optimize code.
This site contains AlgoTuner trajectories for all AlgoTune tasks. Each entry shows the complete conversation between the model and the AlgoTune environment, including code edits, timing evaluations, and the iterative optimization process.
Leaderboard
We use our agent, called AlgoTuner, to optimize functions in AlgoTune, using ten state-of-the-art models. AlgoTuner, using these models, is able to achieve impressive surface-level speedups on many tasks, but is unable to come up with novel algorithms.
| Model Name | AlgoTune Score |
|---|---|
o4-mini |
1.72x |
DeepSeek R1 |
1.70x |
GPT-5 |
1.67x |
Claude Sonnet 4.5 |
1.52x |
GLM-4.5 |
1.52x |
Gemini 2.5 Pro |
1.51x |
Qwen3 Coder |
1.44x |
gpt-oss-120b |
1.41x |
GPT-5 Mini |
1.38x |
Claude Opus 4.1 |
1.34x |
Claude Opus 4 |
1.33x |
GPT-5 Pro (medium) |
1.31x |
The AlgoTune score for each model is the harmonic mean of its speedups across all AlgoTune tasks. In the table at the bottom of this page, you can find the speedups achieved by each model on each AlgoTune task.
AlgoTune Task Implementation
To measure speedups for the algorithms in AlgoTune, we implement a class containing three functions for each algorithm. One generates problem instances (i.e. in the case of PCA this a matrix and number of components), one method checks that the problem has been solved (i.e. for PCA, we check that the matrix is orthonormal), and the last function is a reference solver (for the PCA task, we just use a PCA solver from scikit-learn).
Show code example
def generate_problem(self, n: int, random_seed: int = 1) -> dict[str, Any]:
"""
Generate random data matrix using n to control the hardness
"""
np.random.seed(random_seed)
# 50 * n samples
m = 50 * n
r = max(2, n * 5) # factorization rank
# Step 1: Generate non-negative W and H
W = np.random.rand(m, r) # m x r
H = np.random.rand(r, 10 * n) # r x (10 n)
# Step 2: Generate Y = W H + small noise
Y = W @ H
noise_level = 0.01
Y += noise_level * np.random.rand(
m, 10 * n
) # additive small noise to simulate imperfection
return dict(X=Y.tolist(), n_components=r)
def solve(self, problem: dict[str, Any]) -> list[list[float]]:
try:
# use sklearn.decomposition.PCA to solve the task
model = sklearn.decomposition.PCA(n_components=problem["n_components"])
X = np.array(problem["X"])
X = X - np.mean(X, axis=0)
model.fit(X)
V = model.components_
return V.tolist()
except Exception as e:
logging.error(f"Error: {e}")
n_components = problem["n_components"]
n, d = np.array(problem["X"]).shape
V = np.zeros((n_components, n))
id = np.eye(n_components)
V[:, :n_components] = id
return V.tolist() # return trivial answer
def is_solution(self, problem: dict[str, Any], solution: list[list[float]]) -> bool:
try:
n_components = problem["n_components"]
V = np.array(solution)
X = np.array(problem["X"])
X = X - np.mean(X, axis=0)
r, n = V.shape
# make sure that the number of components is satisfied
if n_components != r:
return False
# check shape
if n != X.shape[1]:
return False
tol = 1e-4
# check if the matrix V is orthonormal
VVT = V @ V.T
if not np.allclose(VVT, np.eye(n_components), rtol=tol, atol=tol / 10):
return False
# check objective
res = self.solve(problem)
V_solver = np.array(res)
obj_solver = np.linalg.norm(X @ V_solver.T) ** 2
obj_sol = np.linalg.norm(X @ V.T) ** 2
if np.allclose(obj_sol, obj_solver, rtol=tol, atol=tol / 10):
return True
return False
except Exception as e:
logging.error(f"Error when verifying solution: {e}")
return False
o4-mini
DeepSeek R1
Claude Sonnet 4.5
GLM-4.5
Gemini 2.5 Pro
Qwen3 Coder