Elo Ratings for classification and regression algorithms in scikit-learn.
This is just a proof of concept based on ideas from the LLM Chatbot Arena and AutoML Arena.
Elo ratings, originally developed for chess rankings, are a method for calculating the relative skill levels of players based on their head-to-head match outcomes. Here's how they work:
Each player starts with a base rating (e.g. 1200).
After each match, ratings are updated based on:
- The match outcome (win/loss/draw)
- The rating difference between players
- A K-factor that determines how much ratings can change per match
The core principle is that winning against a stronger opponent should increase your rating more than winning against a weaker one. Similarly, losing to a stronger opponent should decrease your rating less than losing to a weaker one.
The basic update formula works like this: After each match, the winner gains points while the loser loses points. The number of points exchanged depends on the expected probability of winning (calculated from rating difference) compared to the actual outcome.
If a heavily favored player wins, they gain few points while their opponent loses few points. But if an underdog wins, they gain many points while the favorite loses many points.
We can assign all scikit-learn algorithms an Elo rating, then rank algorithms by their rating.
-
Select a collection of standard datasets (e.g., from scikit-learn's built-in datasets or UCI repository)
-
For each dataset:
- Evaluate each algorithm on the dataset (e.g. repeated k-fold cross-validation).
- Record performance metric (e.g., accuracy for classification, MSE for regression)
- For each possible pair of algorithms, treat the one with better performance as the "winner"
- Update both algorithms' Elo ratings based on these pairwise "matches"
For example, if you have 3 algorithms (RandomForest, SVM, LogisticRegression) and 5 datasets:
- Start each algorithm at 1200 Elo
- On Dataset1, if RandomForest gets 0.85 accuracy and SVM gets 0.80:
- RandomForest "wins" vs SVM
- Update both Elo scores accordingly
- Continue this for all algorithm pairs on all datasets
The final Elo ratings will reflect each algorithm's relative performance across all datasets, accounting for:
- Consistency (performing well across many datasets)
- Margin of victory (winning by large or small performance differences)
- Quality of competition (beating strong algorithms vs weak ones)
Results:
Current Rankings:
Algorithm Elo Rating
0 SVC 1483.476429
1 KNeighborsClassifier 1436.259629
2 ExtraTreesClassifier 1418.984442
3 RandomForestClassifier 1399.931241
4 MLPClassifier 1394.189639
5 HistGradientBoostingClassifier 1360.896926
6 RadiusNeighborsClassifier 1339.349720
7 LogisticRegressionCV 1333.265391
8 LogisticRegression 1317.042033
9 NuSVC 1297.331826
10 GradientBoostingClassifier 1289.215738
11 LinearSVC 1264.363682
12 PassiveAggressiveClassifier 1239.713321
13 SGDClassifier 1236.419448
14 Perceptron 1182.985894
15 RidgeClassifier 1177.249455
16 RidgeClassifierCV 1149.653122
17 NearestCentroid 1090.004834
18 BaggingClassifier 1076.131953
19 MultinomialNB 1063.118807
20 DecisionTreeClassifier 1044.169226
21 GaussianNB 1035.428305
22 BernoulliNB 1027.888777
23 ComplementNB 1011.798905
24 ExtraTreeClassifier 1007.745416
25 CategoricalNB 986.312061
26 GaussianProcessClassifier 983.557891
27 AdaBoostClassifier 953.515889
See arena_regression.py.
Results:
Current Rankings:
Algorithm Elo Rating
0 HistGradientBoostingRegressor 1472.224472
1 ExtraTreesRegressor 1467.166315
2 RandomForestRegressor 1451.170396
3 GradientBoostingRegressor 1421.210202
4 BaggingRegressor 1413.318732
5 MLPRegressor 1395.907021
6 NuSVR 1369.294900
7 SVR 1354.186504
8 KNeighborsRegressor 1318.997026
9 DecisionTreeRegressor 1315.919028
10 Lars 1292.044124
11 TransformedTargetRegressor 1282.915341
12 RidgeCV 1246.612762
13 LinearRegression 1221.887239
14 LassoCV 1204.077560
15 Ridge 1182.862180
16 ElasticNetCV 1180.618743
17 ExtraTreeRegressor 1155.701573
18 LarsCV 1148.494488
19 AdaBoostRegressor 1129.530651
20 OrthogonalMatchingPursuit 1114.509803
21 TweedieRegressor 1086.206851
22 ElasticNet 1082.474490
23 DummyRegressor 1070.439746
24 Lasso 1038.544663
25 HuberRegressor 1027.773147
26 LinearSVR 1003.927533
27 RANSACRegressor 979.974378
28 PassiveAggressiveRegressor 951.083042
29 TheilSenRegressor 926.625594
30 SGDRegressor 894.301497
- More datasets (e.g. openml).
- More algorithms (e.g. xgboost, catboost, lightgbm, etc.)
- Capture no. matches, wins, losses, win rate, etc.
- Confidence intervals.

