Formalization of Erdős problems

[This is a guest post by Boris Alexeev. Now over to Boris.]

I’m here to tell you about various exciting developments centering on Erdős problems, especially involving the formalization of old and new mathematics using artificial intelligence.

Background

As is well known, Paul Erdős was a prolific mathematician of the 20th century who posed an extraordinary number of conjectures. Around May 2023, Thomas Bloom set up erdosproblems.com to collect these problems and keep track of progress on them. Over the past 2.5 years, this progress has accelerated as many people realized they could solve problems that were previously unknown to them. In August 2025, Thomas added a forum that became active very quickly, further accelerating developments.

In May 2025, Google DeepMind launched the Formal Conjectures project, an open repository of formalized mathematics conjectures, including (but not at all limited to) Erdős problems. In August 2025, Thomas Bloom and Terence Tao proposed a crowdsourced project to link up erdosproblems.com to the Online Encyclopedia of Integer Sequences (OEIS). These are both part of a greater push to increase the number of mathematical databases, as well as links between them.

At present, there are over 1100 problems on erdosproblems.com, of which approximately 40% have been solved. (Note that only ~100 problems are known to have monetary prizes associated with them.) Approximately 240 problems have statements formalized in Lean, and 17 have solutions formalized in Lean. Approximately 260 problems have been linked to sequences in the OEIS. Alexis Olson has implemented a progress graph displaying these statistics visually over time.

Human formalization without AI

At first, formalization was not a large part of the story with Erdős problems. One of the first developments was several years ago when Thomas Bloom and Bhavik Mehta formalized Bloom’s solution to Problem 47 about unit fractions (and its further applications to several related problems). I would like to highlight a couple of passages from their paper (emphasis mine):

The formalisation began in January 2022 and concluded in July 2022. At the beginning of the formalisation, the first author had no experience with Lean at all, and learnt Lean (or at least a sufficient subset of Lean) through the formalising process.

and

This formalisation is a first in several respects: it is the first recent analytic number theory result to be formally verified; the first instance of the circle method; the first solution to a long-standing problem of Erdős. Part of the motivation for this formalisation was as a proof of concept: the Lean proof assistant and accompanying mathlib is advanced enough to make feasible the fast formalisation of new research results in mathematics, on the same timescale as the production of the ‘human-readable’ paper. Of course, this was made feasible by the relatively elementary and self-contained nature of the mathematics involved. Nonetheless, we believe that this arrangement, with a formal certificate of validation accompanying the human version of the paper, is a sign of things to come.

In this post, I focus mostly on the formalization of solutions that completely resolve a problem, but following this proof, there was a lot of great formalization work for results that don’t technically resolve a problem in full. Luckily, this blog has already featured a guest post by Bhavik Mehta of this kind involving Problem 77. Similarly, last year, there was some extensive work on Problem 216 and separately involving the Hadwiger-Nelson problem (which is Problem 508).

I am not personally aware of any further developments in this area (my apologies to anyone I left out!) until the launch of the Formal Conjectures project. This project, of course, includes the formalization of a large (and increasing) number of the statements of Erdős problems, but does not generally include the formalization of solutions. Nonetheless, it does include a succinct proof by Bhavik Mehta of a counterexample to Problem 316 found by Tom Stobart (slightly smaller than Csaba Sándor’s original counterexample).

Shortly after the launch of the forum on erdosproblems.com, a collaboration between Stijn Cambie, Vjekoslav Kovač, and Terence Tao resolved Problem 379, with the solution formalized in Lean. A couple of days later, Terence Tao also resolved Problem 987 (unaware that Erdős himself had done so) and formalized the solution in Lean. I believe that both of these formalizations were done primarily “by hand”.

Human formalization with AI

Kevin’s latest blog post, discussing the formal and informal approaches to theorem proving, mentioned my paper with Dustin Mixon (and maybe also ChatGPT and Lean — should we have further included Marshall Hall?) resolving Problem 707. There are several fun aspects to that story, and I encourage readers to look at Sections 7 and 1 of our paper for more details (or perhaps they may enjoy a summary). However, the part of the story relevant to formalization is that after the “usual” mathematics was complete, we were able to vibe code the proof in Lean using ChatGPT (without Pro).

While writing the paper, we felt like we were trying a new style of mathematical research: combining large language models with formal verification to produce significant, certifiably correct results. I’m happy to have heard from several friends that our experience motivated them to look into formalizing results in Lean themselves. One mentioned that they hadn’t looked at Lean in three years, and they found that the landscape had changed completely in that time: improvements to Mathlib and Lean, together with the rise of LLMs, made it significantly easier to prove results in a reasonable amount of time. Our experience also motivated Terence Tao to similarly vibe code a solution to Problem 613 by Oleg Pikhurko.

In retrospect, I feel that our paper was written during a very brief window in time when our specific manner of interacting with an LLM and a formal assistant was the most effective manner for a novice to formalize (existing) mathematics. In Section 7 of the paper, we describe what our preferred interaction style would have been using a simple schematic drawing:

carview.php?tsp=

This is much closer to reality now. Approximately at the same time as we finished our paper, Harmonic released Aristotle to the general public.

At first, Aristotle could only fill in a sorry in a Lean proof; in other words, given a statement that had already been formalized in Lean, it could (attempt to) supply a proof. This immediately transformed my interactions with Lean. Problem 105 had been recently solved by Wu Xichuan in the forum, and I was able to use ChatGPT to generate formal Lean statements describing the proof, which Aristotle was able to fill in. Before Aristotle, I had tried to formalize this proof with ChatGPT alone, but did not succeed. Note that although Wu’s counterexample is not simple to find, verifying that it works is entirely straightforward. (I was also able to use Aristotle to formalize multiple results from non-Erdős mathematics I was working on. For example, see Section 5.3 of this paper, written in the first person from the perspective of my coauthor Dustin.)

As it turns out, that style of interaction was also a brief moment in time. Shortly thereafter, Aristotle released “informal” mode, which accepts mathematics written in informal language (possibly in LaTeX) and formalizes it all. As a result, Wouter van Doorn, Gemini DeepThink, Terence Tao, Aristotle, and I were able to formalize a solution to part of Problem 367.

I was excited to try using these tools to formalize more solutions. I noticed Problem 418 had recent activity on the forum and that it might be suitable for an experiment. I asked ChatGPT to explain the solution to the problem and then Aristotle to auto-formalize the resulting LaTeX file. The actual theorem statement was already available at the Formal Conjectures project. This process worked beautifully, and I noticed that no intelligent interaction was involved on my part.

As a result, I wrote a pipeline to automate the entire process. After I selected a promising problem number by hand, ChatGPT explained a solution, Aristotle auto-formalized it, and then it was glued together with the statement from the Formal Conjectures project. The solution was verified by Lean and posted automatically to GitHub. My program even wrote a comment on erdosproblems.com for me, but then I would make sure everything looks good before pressing “Post comment”. This process worked perfectly the first time I tried it, on Problem 645. Subsequent runs have revealed that my pipeline is quite fragile, and I occasionally have to intervene manually, but I have run it successfully approximately ten times. Note that ChatGPT is not a necessary component of the formalization step, as Aristotle handles informal text quite well by itself, but I still find it useful for various auxiliary tasks, such as finding the relevant proofs and transcribing (and translating!) PDFs.

Between the pipeline and other uses, Aristotle has now written the majority of the public formalized solutions on erdosproblems.com. (By this, I mean the number of problems, as reported in the community database. If you have solved a problem, whether formally or not, please let someone know either on erdosproblems.com or on the community database!) Furthermore, interaction with Aristotle is currently on the timescale of hours, which while not yet “real time”, is a far cry from the half a year it took at the beginning of this story. The time also mainly consists of waiting, as no interactvity is required or possible (as of today). Given my previous experiences with outdated workflows, I have no doubt that soon, this process will change completely yet again and become even better.

Formalization entirely by AI

Aristotle can also solve problems by itself, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad (IMO) problems.

I was curious what this meant in practice. As a trial run, I found Problem 488, which was available on the Formal Conjectures project and for which Stijn Cambie had already found a counterexample. And indeed, provided only with the formal statement of the problem, Aristotle was able to disprove the conjecture by itself. (Note: the statement of Problem 488 has since been modified.)

Excited, I moved on to open Erdős problems. And so last Friday night, I selected several problems by hand, launched Aristotle, and went to bed. (As mentioned previously, Aristotle runs can take several hours but you don’t have to do anything.) I must say, I was not emotionally prepared to wake up to an email that Aristotle had successfully resolved Problem 124. But after checking its proof and investigating various issues with the precise statement, I saw that indeed, Aristotle had resolved a problem listed by Erdős as open in two collections in 1997 (published shortly after his death).

As can be expected with many “firsts” or other records, there is an “asterisk” regarding this solution. In this case, it seems that Erdős probably intended a different formulation of the problem (and thus unfortunately note: the statement of Problem 124 has since been modified). Much of the discussion online about this accomplishment centered around how simple Aristotle’s solution to the problem was. Indeed, many described it as “Olympiad-level”, which we previously knew was within Aristotle’s capabilities. I would like to offer a different description of the solution: fundamentally, it is from “The Book”. While I agree the solution is simple, I find it remarkable that one of the first open problems resolved in this manner had such a beautiful proof, and I think it’s perfect for an Erdős problem.

Shortly thereafter, Kevin Barreto resolved Problem 481 (not knowing that it had already been resolved by David Klarner in 1982, as that reference was only located later), also with a nice proof. He then made two requests to Aristotle: one to formalize his proof, and one to solve the problem by itself. Both succeeded. Accordingly, this is another problem that Aristotle was able to solve by itself (in Lean), but had been solved by a human already. Afterwards, multiple other teams claimed to have solved these two problems autonomously.

AlphaProof, which produces Lean output, has also autonomously contributed to the community’s knowledge about Erdős problems. For example, it has found an interesting example for Problem 730 and a nice counterexample for Problem 198. The Lean proofs it produces are often reworked manually to be more readable; also, its counterexamples have often caused the problem statements to be changed. Sadly, as a result, no current Lean formalization on erdosproblems.com originates from AlphaProof. (Perhaps AlphaProof or other programs have independently solved more problems, but they have not yet published their results publicly.)

Efforts are underway to autonomously solve even more Erdős problems. Accordingly, I look forward to seeing more formalized proofs of both open and previously-solved problems from Aristotle, AlphaProof, and the other agents becoming available.

Side notes

I am aware of the recent paper about DeepSeekMath-V2, but I have not mentioned it because it has not (yet) directly contributed to formalization on erdosproblems.com. AlphaProof is mentioned briefly above, but I further note that there is a recent AlphaProof paper that may be of interest.

ChatGPT was used to find many problems on erdosproblems.com that had been labeled as “open” but were actually solved. ChatGPT has also been used to help solve (the “sufficiently large” statement of) Problem 848. I did not dive into those topics either because they are not related to formalization, though I will opine that ChatGPT Pro is very useful both for literature review and actual exploratory mathematics.

Misformalization

During my formalization adventures, I have encountered many instances of misformalization. I’m not attempting a full classification here, but they have come in three styles: “low-level” issues like bugs or incorrect definitions, “missing hypotheses” specifically, and “high-level” issues like the human mathematician missing something. Let me give an example of each of these:

  1. Low-level: The formalization of Problem 480 said m ≠ 0 but meant to say n ≠ 0. This was discovered when Aristotle found a counterexample taking advantage of this issue, which also happened to interact with the issue of “junk values” in Mathlib. (Problem 480 was solved by Fan Chung and Ron Graham with a very nice solution. I have not yet succeeded in getting Aristotle to formalize the proof.) Similarly, I have seen flipped inequalities, a trivial zero case forgotten, and flipped quantifiers.
  2. Missing hypotheses: in 1994, Rudolf Ahlswede and Levon Khachatrian found a counterexample to Problem 56 with a parameter equal to 212. When I attempted to formalize their proof, Aristotle found a completely trivial counterexample with tiny parameters like 2. It turns out that the conjecture was missing a hypothesis, somewhat implicitly. (I succeeded in formalizing the original proof with the added hypothesis, though not without making a low-level off-by-one error myself in the process.)
  3. High-level: Problem 124 might fit this description. In 1996, Stefan Burr, Paul Erdős, Ronald Graham, and Winnie Li formulated a particular conjecture, which appears to still be open. Shortly thereafter (just before his death, at age 83), Erdős re-formulated the problem in a slightly different language and altered the hypotheses. Unfortunately, in doing so, he may have missed that this problem has a simple solution. (As discussed above, Aristotle solved this later version autonomously.)

Low-level errors are common in programming, of course, but I think at present mathematicians are probably more likely to make them in Lean than when writing similar code in, say, Python. I think this comes from relative unfamiliarity with Lean, as well as from the fact that the code is not “executed” in the same way. Over time, I expect these errors to decrease in frequency until they’re at a similar baseline level to other programming.

Tools like Aristotle can help find all of these kinds of errors. By itself, finding typos isn’t particularly impressive, but still useful. Finding missing hypotheses is nice, in part because it can help increase the correctness of existing mathematics; some of my first experiences with Lean involved realizing how many more hypotheses were necessary for a statement than I had realized. But perhaps it’s the third category that is most helpful, because as a result, we can focus our attention on the relevant mathematical details.

Takeaways

The world of mathematics formalization is moving fast! Tools for formalizing mathematics are available today and developing quickly. AI is making it much easier to formalize existing mathematics, and it is beginning to create new formalized mathematics by itself.

Some people might be tempted to dismiss the significance of some of these developments because they involve Erdős problems, which are often very accessible and not dependent on deep mathematics. While there is some truth to these thoughts, I feel there are other reasons why we are seeing results here. First, there is a social aspect in which a community of people value Erdős problems and encourage enthusiastic discussion about and collaboration on these problems. Second, this community has curated a large collection of problems available both as informal statements and as formal statements that work in Mathlib today. As more definitions and results are formalized, I expect similar developments across many fields of mathematics. In particular, open problems will be solved autonomously and viewed as “simple” in retrospect. Also, more significant conjectures will be resolved, though I refuse to hazard a guess regarding the timescale.

If you would like for your area of mathematics to see results from formalization, you can learn from the experience of the Erdős problems community. For example, if you have a nice conjecture you would like to see solved, you may submit it to the Formal Conjectures project and someone might try solving it (whether autonomously or the old-fashioned way). Beyond conjectures, Mathlib can always use contributions across all fields of mathematics. These definitions and theorems can then be used by both human and AI mathematicians. More broadly, organizing available results, open problems, and other data in curated databases (with some kind of forum!) has proven to be a very productive activity.

Finally, I did not want to go into too many of the details because it was a bit of a digression relative to the main discussion of AI, but in my opinion, misformalization is a big issue and was more frequent than I had expected. In particular, we need better tools to prevent and detect misformalization.

Acknowledgments

These exciting developments have been made possible by many people. Thank you to the lively community on erdosproblems.com, especially Thomas Bloom for creating and curating the site, and similarly for Lean and Mathlib. Thank you to Terence Tao, who has been supportive of many efforts involving AI, formalization, and increasing collaboration in mathematics. Thank you to OpenAI for access to ChatGPT Pro and Harmonic for access to Aristotle (which anyone can sign up for). Thank you to the Formal Conjectures authors. And thank you to Kevin Buzzard for suggesting a guest post on this wonderful blog!

Posted in General, Machine Learning | Tagged AI, AlphaProof, Aristotle, chatgpt, erdos, erdos problem, Harmonic, lean, math, mathematics, OpenAI, philosophy, science | 2 Comments

Formal or not formal? That is the question in AI for theorem proving.

So it’s an interesting time for computers-doing-mathematics. A couple of interesting things happened in the last few days, which have inspired me to write about the question more broadly.

First there is the question on whether computers will ever prove interesting research-level mathematical theorems autonomously at all, which is really hard to answer right now, although clearly a lot of people are betting on “yes”. I personally am unsure: I’ve seen what can currently be done, and can’t quite extrapolate that far once you cut through the hype. I am however far more optimistic about AI tools becoming useful helpers to mathematics researchers, and this is one of the things which motivates my current work (trying to teach Lean a lot of research-level definitions, theorem statements and proofs).

But whether they beat us or just help us, one can still ask what an autonomous theorem-proving tool will look like. Let’s initially take a gigantic step down from research level mathematics and talk instead about problems which can be solved by smart schoolchildren, namely those on the International Mathematics Olympiad (IMO). I wrote recently about the attempts by AI on the 2025 IMO but all you need to know here is that there were four attempts which were made public: two of them were using language models to write proofs in English (which were then graded by humans); let’s call this the “informal approach”. The other two were using a language model/Lean hybrid set-up; let’s call this the “formal approach”. Lean is a computer proof checker or interactive theorem prover; it checks mathematical arguments at a superhuman level. In this hybrid setup, the language model writes proofs in Lean’s language which are then checked automatically by Lean; once Lean says the proofs are correct, humans only need to check that the statements have been translated correctly into Lean, although that is an important part of the process.

Worries about the non-formal approach.

What worries me mostly about the non-formal approach is the horror stories I am hearing from journal editors. Mathematics journals are being spammed with LLM-generated papers, which are becoming more and more plausible-looking, and sometimes the use of LLMs is not even revealed in the paper. I want to flag some differences between mathematics papers generated by LLMs and those generated by humans:

1. LLMs are sycophantic. You want to hear that the LLM-generated argument correctly proves the theorem? Sure you do, and that’s why the LLM will assure you that this is the case, even if the argument contains a hallucination (and remember, just one hallucination is enough to break a mathematical argument). In my experience, human mathematics researchers doing mathematics in the traditional way are vanishingly unlikely to send a paper to a journal claiming that they have proved something which they actually think they have not proved. Being stuck is a standard place for a mathematics researcher to be in. An LLM is far more likely to write plausible-looking rubbish to cover their tracks; in some sense then, they are trained to mislead. One thus has to read an LLM-generated paper with far more suspicion than a human-generated paper.

2. LLMs won’t say “I don’t know”. We expect them to perform very well on in-distribution data (i.e., what they have been trained on, which certainly includes all the techniques needed to solve an IMO problem). But what happens on out-of-distribution data (e.g. theorems that no humans know how to prove)? Well, the correct answer to many intermediate questions which show up when doing research is “I don’t know — indeed, nobody knows”. Whereas the LLM’s answer will be “I will confidently have a go”. Remember: one hallucination breaks an argument.

In short: I do not yet trust mathematics written by an LLM. Several mathematicians I have talked to have backed me up on this; they have tried typing hard research problems into language models and get nonsense output which they can instantly see through. Will LLMs get 100% reliable when out-of-distribution (i.e., when trying to do original research)? Nobody knows yet, and anyone who thinks they know is just guessing, or is stoking the hype. The disgraceful story from earlier this week about OpenAI and the ten Erdos problems is a case in point. People seem to be quick to believe that it’s happening; nothing close to it is happening right now.

One thing which is becoming more popular with the informal approach is testing systems on questions where the answer is not a long proof (indeed most breakthroughs in mathematics will consist of 30-page proofs or even much longer) and instead testing them on hard problems for which the answer is a number. An example of this is Epoch AI’s Frontier Math Tier 4 dataset. I was amused this week to have been sent data on what happens if you ask lots of agents to try and solve these problems and you mark the question as being solved if at least one agent gets the answer correct at least once. I would like to submit my own AI-based approach to the Tier 4 dataset, if that’s the mark scheme now: it’s written in python.

i = 0
while True:
    print(i)
    i = i + 1

My point of course is that a system which gives out lots of answers, one of which is correct but you don’t know which one, is rather less useful than it sounds.

Worries about the informal approach.

The amazing Erdos 707 story from earlier today (tl;dr: two researchers found a solution to an Erdos problem and vibe-coded the entire proof in Lean using ChatGPT 5 despite not having a clue about how to write Lean code; here is the paper by Alexeev and Mixon, with both Lean and ChatGPT listed as co-authors) might make people start to think that Lean is now becoming viable for research-level mathematics. And certainly this story shows that it can be used for some research level mathematics. But all of it? Not so fast.

My worries about the formal approach are that currently even the best formal mathematics libraries do not contain the definitions needed to understand most of what is happening in modern research mathematics. Because of my Renaissance Philanthropy AI for Math grant, I have been ploughing through recent issues of the Annals and attempting to see which definitions need to be formalized before we can even formally state the main results in these papers. Examples of what I saw: Tate-Shafarevich groups, Calabi-Yau varieties, algebraic stacks, automorphic representations, compactly-supported etale cohomology, Jacobians, Maass forms. These are everyday objects used by graduate students in mathematics departments across the world; no theorem prover knows what these things are. If Lean and the other provers cannot even understand the statements of most modern theorems at research level, how can they possibly help mathematicians to prove them?

Whose job is it to fix this?

As is probably clear, I am currently skeptical about whether a next-token-prediction system is ever going to get to the point where it can be 100% accurate when generating 50 or more pages full of out-of-training data (which is probably the kind of thing which is needed for a typical new breakthrough in research mathematics). A LLM/Lean hybrid is something which seems to me to be far more plausible (and we’ve just seen it work to great effect with the Erdos 707 story above, even though here it was the humans who had had the ideas, not machines). Teaching Lean many modern definitions seems to me to be the bottleneck; it is by no means an impossible problem to solve, but it will take time, and solving it seems to me to be a fundamental problem. But who is in a position to solve this problem? Let’s look at some of the likely candidates.

1. The mathematics researcher. These are the people who have taught Lean pretty much all of the nontrivial mathematical definitions in its mathematics library so far. For example I taught mathlib what elliptic curves were, and Chris Birkbeck taught them modular forms, so we can state the Shimura-Taniyama-Weil conjecture relating elliptic curves to modular forms; this was the conjecture partially resolved by Wiles and Taylor–Wiles and which implied Fermat’s Last Theorem. The big problem with scaling this up is that the mathematics researcher lives in a “publish or perish” environment, and work of this nature does not fit into the traditional publishing model; the current set-up rewards new proofs, not formalisations of old material, even if it has never been formalised before.

2. The LLM. This is an approach being pushed by Christian Szegedy and his colleagues at math.inc. My worry about this approach is that right now LLMs write really unidiomatic and poor code, and poor definitions might ultimately result in technical debt when things start scaling. Another worry about LLM-generated definitions is that, unlike proofs, Lean cannot check that the Lean definition aligns with the human concept that it is supposed to be capturing; if an LLM forgets or mangles an axiom which is part of a definition then the code is likely to compile, it just won’t mean what it’s supposed to mean.

3. The Mathlib Initative. This is a new institution whose aim is to support Lean’s mathematics library. They have a lot on their plate already as you can see from their roadmap, but the problem I raise above of teaching Lean how to understand modern definitions is not explicitly mentioned. Perhaps a simple solution would be to expand in this direction, but of course this will mean a financial commitment from somewhere.

Until we find a better solution, it will be my group at Imperial College London who will work on this problem, manually. We gratefully acknowledge support by Renaissance Philanthropy and the EPSRC to work on what I believe is a fundamental question in AI for mathematics.

Posted in General, Machine Learning | Tagged AI, Artificial Intelligence, imperial college, lean, llm, mathematics, mathlib, technology, theorem provers, theorem proving | 4 Comments

AI at IMO 2025: a round-up

Setting the scene

The 2025 International Mathematics Olympiad has come and gone. Reminder: this is an exam for high-school kids across the world (each country typically sends six kids), comprising of two 4.5-hour exams each containing three questions, so six questions in total, which I’ll call P1 to P6. Solutions to each question are scored out of 7, for some reason lost in the mists of time. As you can see pretty clearly from the individual results and especially the sea of scores highly close to the sequence “7,7,7,7,7,0”, this year had 5 reasonable questions and one stinker, namely P6. Around half of the participants get medals, determined by scores chosen so that approx 1/12th participants get a gold, 2/12ths get a silver and 3/12ths get a bronze. I still have my gold medal from 1987 in a drawer upstairs; I beat Terry Tao 😛 (although I was 18 and he had just turned 12 at the time; he was noticeably younger than all other contestants). The cut-off for gold this year was 35, which is also the answer to 7+7+7+7+7+0. You can take a look at the questions for 2025 (and indeed for any year) at the official IMO website.

Back in 2024 Google DeepMind “took the IMO” in the sense that they waited until the questions were released publically and then throw two tools at them; AlphaGeometry, which is designed to solve geometry problems in a human-readable format, and AlphaProof, which is designed to write Lean code to solve any mathematics problem written in Lean format. Humans translated the questions into a form appropriate to each tool, and then between them the tools solved 4 out of the 6 problems over a period of several days (rather than the 9 hours allowed for the exam), which led to DeepMind reporting that their system had “achieved Silver Medal standard” and which then led to the rest of the world saying that DeepMind had “got a Silver Medal” (although right now it is only possible for humans to get medals). Indeed DeepMind were one point off a gold in 2024.

This leads inevitably to two questions about IMO 2025. The first: will a system “get gold in 2025”? And the second: Were the IMO committee going to embrace this interest from the tech companies, define precisely what it would mean for an AI to “get gold”, demand an entrance fee or sponsorship from the tech companies and in return use official IMO markers to give official grades, also reporting on scores from tech companies who didn’t do very well? Spoiler alert: the answers are “yes” and “no” respectively.

Deciding the rules

Because the IMO committee chose not to write down any formal rules for how AI could enter the competition, this enabled each of the tech companies to make up their own rules and in particular to define what it means to “get gold”. In short, it enabled the tech companies to both set and mark their own homework. But before I go any further, it’s worth discussing what kind of entries are going to come in to the Wild West that is “AI at IMO 2025”. There are going to be two kinds of submissions — formal and informal. Let’s just break down what these mean.

“Informal” is just using a language model like ChatGPT — you give it the questions in English, you ask it for the answers in English, you then decide how many points to give it out of 7. Let’s scrutinise this last part a little more. IMO marking is done in a very precise fashion and the full marking scheme is not publically revealed; there are unpublished rules established by the coordinators (the markers) such as “lose a mark for not explicitly making a remark which deals with degenerate case X” and, because of this secrecy, it is not really possible for someone who is just “good at IMO problems” but who hasn’t seen the mark scheme to be able to accurately mark LLM output. Given that (spoiler alert) the systems are all going to get 0 points in problem 6, the claim that AI company X got a gold medal with an informal solution rests on things like “we paid some people to mark our solutions to questions 1 to 5 and in return they gave us 7/7 for each solution despite not having seen the official mark scheme”. Judging by this Reddit post there seems to have been some unofficial but failed(?) attempt by the tech companies to get the coordinators to officially give out scores to their informal solutions.

“Formal” is using an computer proof assistant, which is a programming language where code corresponds to mathematics. For such entries, someone has to translate the statement of each question into the language of the proof assistant, and then a language model trained to write code in this proof assistant will attempt to write a solution in code form. I had naively hoped that IMO 2025 would come with “official” translations of the questions into languages such as Lean (just as they supply official translations into many many human languages), but no such luck. So the formal solutions will involve something (probably a human) translating the question into the relevant computer language (possibly in many different ways; translation, just like translation between human languages, is not uniquely-determined by the input) and then giving the formal questions to a system which has been trained to write code solving them in the relevant language. Another catch here is that, as the name suggests, a proof assistant is something which can check proofs, and unfortunately the only question in the 2025 IMO of the form “Prove that…” was P2; all other questions were of the form “Determine this value”. This throws a spanner into the works of a proof-assistant-based attempt on the IMO: if one is asked to “determine the smallest real number c such that [something]” (which was what P3 asked us to do) then what is to stop a machine saying “the number to be determined is c, where c is the smallest real number such that [the same thing], and the proof that I am correct is that I am correct by definition”? It is actually rather difficult to formally even say what we mean by a “determine” question. Formal systems attempting the IMO thus typically use a second (typically informal) system to suggest answers and then get the formal tool to try and prove the suggested answer correct. Modulo checking that the corresponding “proof” question does correspond to the question being asked (and this does need checking), formal proofs are of a binary nature: if it compiles then it gets 7/7 because it is a computer-checked solution to the question which uses only the axioms of mathematics and their consequences. It is not really meaningful to ask for partial credit here, so anything other than a full solution gets 0/7 (although we’ll see an example of someone claiming to get 2/7 with a formal solution later).

The results are in!

First out of the starting blocks were the wonderful people at MathArena. These are people who don’t work for a tech company and are trying to do science. They reported on the performance of the best informal models available to mere mortals such as us (sometimes at a cost of hundreds of dollars a month), and the results were disappointing; the models that we regular people can get our hands on did not even get a bronze medal (this claim relies on the marking being accurate, which as I’ve already explained we cannot guarantee, but presumably they are in the right ball-park).

But of course we have not yet taken into account the fact that tech companies might have secret models up their sleeve, which have not been released to the public yet. And surprise surprise, this is what happened. So from this point on in the blog post, all claims made by tech companies are completely unable to be independently verified, a situation which is very far from my experience as a mathematician; one wonders if one is even allowed to call it science. Seems like Gregor Dolinar, the President of the IMO, is also cautious about the remainder of this post; his comments on AI entries can be seen here on the 2025 IMO website and I’ll reproduce them to save you the click:

It is very exciting to see progress in the mathematical capabilities of AI models, but we would like to be clear that the IMO cannot validate the methods, including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced.

In particular, because there is huge incentive to be the first to get gold, and no way to check what is going on, tech companies can in theory just cheat and there’s no way to check. Let’s assume they didn’t cheat though, and press on.

As we can see from the IMO 2025 website, the exams this year were on the 15th and 16th of July. Marking occurred on 17th and 18th, the students’ results were announced in the closing ceremony on the 19th, and students left Australia when the IMO officially ended on the 20th.

The first company to claim gold was OpenAI. Minutes after the closing ceremony (so a day before the IMO had even officially ended) OpenAI released a twitter thread claiming a score of 7+7+7+7+7+0 with an informal model which is not available to the public. They also released their models’ natural language solutions to problems 1–5, and they are on the whole pretty good. The solutions were marked by former IMO contestants. Whether the solutions submitted were hand-picked by humans from a collection of potential solutions, who knows — there were no rules. I have no reason to believe that OpenAI cheated — but my feeling is that we have diverged a long way from the traditional scientific process at this point (peer review etc). What is clear was that OpenAI were desperate to claim the prize and couldn’t wait.

I had heard through the grapevine that there was an informal embargo on tech companies announcing anything until a week after the IMO had ended (i.e. July 28th) but of course once one company had claimed they’d got the ultimate prize, informal embargos don’t count for much. On 21st July DeepMind published a blog post also claiming gold in 7+7+7+7+7+0 form, together with a quote from Dolinar saying “We can confirm that Google DeepMind has reached the much-desired milestone” (which rather seems to contradict his earlier statement quoted above, in which he explicitly states that IMO cannot validate the methods being used by the tech companies). DeepMind’s blog post will no doubt be regarded as the source of truth for this claim, neatly avoiding the pesky refereeing process. Indeed it could be noted that over 1 year since the 2024 announcement by DeepMind of their silver medal, we still only have the (unrefereed) blog post to cite for this claim, with no published refereed paper detailing the process, and this (paywalled) article from the Financial Times seems to indicate that perhaps the paper was blocked at senior management level.

Not to be outdone, Harmonic posted on Twitter that same day the fact that they would be sticking to the embargo date of 28th, together with a screenshot of what was plainly Lean code containing part of a solution to P3. By July 22nd both the New York Times and Reuters were uncritically reporting that both Google and OpenAI had achieved Gold, together with non-sequitur quotes from someone who works at Google saying “The achievement suggests AI is less than a year away from being used by mathematicians to crack unsolved research problems at the frontier of the field”. Let me come back to this later.

By the 23rd the Chinese had weighed in, with ByteDance announcing (in Chinese) that their Seed Prover model got a silver (2+7+7+7+7+0) after 3 days of working on the problems (note that there are no time limits because there are no rules, each tech company gets to make their own rules up, so quite how they determined that they got 2/7 for a formal solution is anyone’s guess), upgrading to a gold (7+7+7+7+7+0) if the solver was given “additional attempts” (whatever that means). This was the first announcement of a formal (Lean) approach in 2025. The announcement also contains the claim “The final score for Seed Prover at IMO 2025 disclosed here was confirmed with the IMO committee prior to this announcement” which again seems to go against Dolinar’s quote above, which is a bit confusing. Seed Prover didn’t just do IMO — in fact Seed Prover also scored state-of-the-art in a Lean database of old Putnam problems (although who knows if solutions were in the training data; this is a big problem with testing LLMs on old exams); if IMO gold is now regarded as old news, the 2026 Putnam will no doubt be next in the firing line.

Finally we get on to poor old Harmonic, who dutifully waited until the embargo date of 28th before doing a livestream on Twitter announcing that they also got a gold, again with formal methods; mostly Lean, but (like Bytedance in 2025 and DeepMind in 2024) using a bespoke system to solve the geometry problem. There was no information given (either by Harmonic or by ByteDance) about how the numbers to be “determined” in P1,P3,P4 and P5 were given to the system. The joys of unrefereed announcements.

So there you have it, the take-home message is that “AI got a gold in the IMO” and you can forget about all the annoying details.

Summary

As is probably clear, I am a bit frustrated by all this shenannigans. Don’t get me wrong — it is abundantly clear that both informal and formal math provers are getting better and better. We have seen absolutely incredible and extraordinary developments this decade in computer systems doing mathematics autonomously. However at the end of the day, AI was one point off the gold medal boundary in 2024, and got exactly the gold medal boundary in 2025, so in a sense what the last 12 months have given us is a one point improvement on a test which is solvable using just high school methods. To extrapolate from this to “less than a year away from being used by mathematicians to crack unsolved research problems” is in my opinion completely unjustified. I cannot stress enough that these IMO problems are a million miles away from the kind of questions which most research mathematicians are working on. All the ignorant comments on social media of the form “it’ll be the Riemann hypothesis next” are in my mind simply way off.

I also feel that the IMO committee missed a trick. This was the year where the tech companies could have been forced to play by rules which the committee laid down. They did not take such action, we ended up with a rather chaotic endgame, and now everyone knows that “AI has got a gold in the IMO” and the question is probably no longer of interest. Note that things appear to be completely bimodal here — other than the enigmatic 2/7 claimed by ByteDance on P1, all scores were either 7/7 or 0/7. Four problems out of the six were solved in 2024, five in 2025, so perhaps it’s perfect score time in 2026. Probably a lot will depend on what the financial backers to the tech companies want to see though, maybe the better headline would be “AI gets degree from Harvard/Cambridge/Stanford/whatever”, something which is clearly within reach given that most university exams are far more formulaic than IMO because they are designed in such a way that an average student can pass them. And although this will no doubt generate a lot of excitement on social media, again it is a million miles away from what researchers are doing at the frontier of the field; indeed most pure mathematics taught to undergraduates at even the best universities was known by the 1940s. Mathematics is an extraordinarily conservative field in this regard; we choose not to teach our undergraduates what is happening on the boundary of research and instead focus on a rigorous development of everything from the axioms; because of this approach there is simply no time to get any further. I find it striking that at Imperial College (my university) basically the only pure mathematics course being offered to undergraduates which I was not offered myself in Cambridge 35 years ago is the Lean course.

I will finish by stating what I have stated so many times before. Yes it is absolutely clear that AI is getting better at mathematics. Yes it is absolutely clear that progress is currently very rapid. It is important however to remember that past performance is no guarantee of future results, and the systems are right now still worse at IMO problems than I was when I was 18, which is an extremely long way away from where I was at 30 with a PhD and several novel research papers under my belt. AI approaches to harder problems seem to become more and more vibes-based as the difficulty increases, with the systems being consistently unable to rigorously justify their claims, instead resorting to insisting that certain facts are true without having any understanding as to why. Will the systems start to solve interesting open problems which many mathematicians have spent lots of time on and failed to crack? Maybe — but this is still far from guaranteed.

Posted in Machine Learning, Olympiad stuff | Tagged AI, Artificial Intelligence, ByteDance, DeepMind, Google, Harmonic, imo, International Math Olympiad, International Mathematics Olympiad, lean, Math Olympiad, mathematics, OpenAI, technology | 8 Comments

Think of a number: an update

A month or two ago I wrote this post which expressed my frustration with various issues around private datasets as a way of measuring the mathematical abilities of language models. More generally I was frustrated about the difficulty of being able to judge closed source software owned by a tech company when it’s extremely difficult to do science (i.e. perform reproducible experiments) on it. The post was written in anger, and I’ve calmed down a bit now. In the post I foolishly basically said “If you want a job done well you should do it yourself so let me try”.

In this post I will firstly go over my frustrations again (i.e. whinge a bit), and then I will report on my own (basically failed) attempt to do what Epoch AI did on the cheap. I’ll then reflect on lessons learnt and I would be interested in people’s suggestions about what to do next. My first post did open up a conversation, with several people making comments to me which I found very helpful and clarifying, and hopefully I’ll pass on some of these useful thoughts below.

Language models and mathematics

My frustrations with language models as a tool for mathematics remain. Since I wrote the first post I have been experimenting with them on questions around the mathematics of Fermat’s Last Theorem. A phenomenon I saw several times was that when it came to specific details, models were extremely good at inserting unjustified or invalid assumptions. Here is an extremely dumb example to start us off.

Earlier this year I asked several language models what the dimension of the space of modular forms of level Gamma_0(5^100) and weight 3^100 was. Obviously to many readers this question will be gobbledegook so let me explain the point. There is a moderately complicated formula for the dimension of the space of modular forms of level Gamma_0(N) and weight k, which all language models will know by now. That formula goes: “if k is odd, then the answer is 0, else the answer is [some messy equation depending on N and k]”. Every model I asked this question to just blindly applied the equation; none of them checked to see if 3^100 was odd or not (and of course it is, because it’s the product of 100 odd numbers). In short, the answer to this question is “it’s obviously zero”. No model got it right, and I tried four (the formula gives an answer which is a gigantic number). Anyone who finds an LLM which gets this right, please let me know in the comments: it surprised me that such basic mistakes were still being made but perhaps I didn’t try enough systems (and I’m aware that they’re getting better all the time).

I found several other examples of this phenomenon, where answers to specific questions used techniques which relied on assumptions which were invalid. It is completely consistent in my mind that language models are going to struggle if they are training on textbooks or papers which start with sentences like “in this book/paper, all rings and schemes are assumed to be Noetherian”. The model takes ideas from the source and what is to stop it applying them in situations where this assumption is invalid? My understanding of these systems is that they do not even really “understand” that the assumption is there. The modular forms example above showed me that they can even lose track of an assumption which is being made only perhaps half a page earlier.

We have seen the great successes of language models when applied to self-contained high-school or undergraduate level level problems but right now I really don’t see how we can move beyond this with the ideas we have; applying invalid assumptions is really just a form of confabulation (hallucination) which is occurring because these machines still have no understanding, they are regurgitating techniques which they have seen based on pattern-matching as opposed to mathematical rigour. I think these examples of systems implicitly making unjustified assumptions in order to proceed are exposing a problem which will be very difficult to resolve. Although I am not an expert, and my opinion is arguably biased, I think that integration of LLMs with computer theorem provers is required to take this stuff to the next level (just as integration of LLMs with traditional programming languages such as python solved the problem of early LLMs being completely unable to accurately multiply two ten-digit numbers).

Language models are answering the wrong questions

A major problem that I think we have right now in the area of language models and mathematics, is that scientists need to quantify success so that it can be measured, and the currently favoured way of doing this is hugely flawed. Both the AIMO prize competitions and the FrontierMath dataset involve private datasets of questions which are AI-hard, but the answer is a whole number. The dataset I proposed to make also had this property. But any mathematician would remark that this is completely unrepresentative of what mathematics actually is. Mathematicians try to find patterns, and then they try to prove that their observations are correct. They make conjectures, and they prove theorems or find counterexamples. We have seen examples of neural networks making conjectures and finding counterexamples (for example here, here, here, here, and there are many other examples) but language models are not involved at all in these works; these papers are all using neural networks to analyse mathematical data and do pattern-spotting. The new Epoch AI Frontier Math Tier 4 proposal is pushing this idea as far as it will go, asking for questions which test modern research methods but whose answers need to be a number; this is effectively ruling out many areas of mathematics which cannot be shoehorned into such a framework.

I personally would like to see better ways of testing language models on research level mathematics, but right now this is difficult because if the question is “prove this theorem” (which would be a huge step forward from “work out this number”) but the answer is a page of mathematical text then it’s currently very difficult for a machine to judge this output. Such output thus needs to be read and judged by a human, and this is slow and expensive: it doesn’t scale. This is another reason why I think that formal mathematics needs to be involved in order to progress: theorem provers are able to accurately assess the correctness of mathematical proofs. The downside is that the proof needs to be written in the language of the theorem prover, making the task of producing it far more difficult for an AI.

The current status of the “Think of a number” experiment.

Having whinged about how “what is this number” is the wrong question, I will now report on my attempt to make a database of hard number theory questions of the form “what is this number”. Around six weeks ago I wrote the blog post linked to above, and I circulated it to many colleagues of mine in the number theory community. There was an initial flurry of interest and I got some super problems: within only a few days I had 15 very promising-looking questions. The vast majority were in number theory, essentially none of them could be solved using only the techniques in a typical undergraduate degree. Some of them were accessible using what one could call “1st year graduate student level” techniques; others needed genuinely profound results. The answers were numbers, or could easily be translated into numbers (e.g. they were lists of numbers).

Perhaps due to deficiencies in my own explanation of what I was looking for, the problems I was sent fell into two rather distinct groups. Firstly, there were problems which could be solved by human reasoning alone, i.e. using pen and paper, as long as you knew the relevant techniques and theorems in the literature. And secondly there were problems which boiled down to computations which were completely impossible to do by hand but which could be done by writing bespoke code in standard computer algebra packages (such as pari-gp or magma or SageMath) and then leaving your computer on (for a few seconds or perhaps a few hours).

Interestingly, some of the questions I was sent had non-unique solutions. This was not a development I had been expecting. They were of the form “find a number with the following property” (or variants of this, for example “find an elliptic curve over a finite field with the following property”; such an object can be encoded by a number). For these questions, a new problem arises in checking the answer. In all cases, checking the answer was much easier than finding an answer, but sometimes checking the answer was something which took a few seconds or minutes of running bespoke code on a computer algebra system, which provides an extra headache if one wants to do such a thing automatically. I had not anticipated such questions coming in.

Some of the problems came with solutions, but some did not. Of the ones which did not, for some of them I could envisage spending many hours solving them manually. Ouch. I had not expected this twist either: my perhaps naive plan was clearly to get other people to do the work for me for free and here this had come back to bite me. In fact one comment made to me after the original post was by a computer scientist who said “how have these tech companies tricked you and your colleagues into doing work which benefits them, for free?”. Ouch.

But anyway, so far so good: we were only a few days in, I had 15 questions, an unexpectedly large amount of homework, some technical problems which I had not envisaged, but in general things were looking really interesting.

And then the submissions literally completely dried up. I had one more submission just before the 20th Feb deadline taking me to 16.

And I still haven’t done the homework, so for several of these questions nobody actually knows the answers at all. However as the submitter of one of these questions pointed out, “whilst I don’t know the answer, it would still be very easy to look at a machine’s reasoning and point out the mistakes in it”.

So to summarise, I have some data, but I am not sure that it’s enough to make an interesting thing yet; and to make it an interesting thing I would (a) need more data and (b) need to do a lot of time-consuming work which right now I am not sure I am motivated to do 😦 I guess the main problem is that even if I did do this work, it is not clear to me that the results would be of any interest.

I am not entirely clear about how to move forwards from here but several people have asked me about the status of the problem so I felt obliged to post this update. For those that submitted questions: I’ve not told them to anyone else, and I would note that Epoch AI seem to be offering $7500 per problem which they accept onto their dataset (see the link to Tier 4 above).

Lessons learnt

Epoch AI have made what sounds like a very impressive database, but it cost them a lot of money. I tried to do the same thing for free based on the good will of my number theory colleagues and I got nowhere near.

In retrospect I think that my reaction in anger to “current datasets are seriously flawed, let’s make another one” was very naive in the sense that even if I had got 20 or even 50 hard number theory problems with unique solutions, such a dataset still has many of the problems which I am complaining about. There are all sorts of problems with private datasets (e.g. they can only be used once per tech company as far as I can see, although AIMO’s approach using kaggle works great for open source models). And they are not measuring what mathematicians are doing (proving theorems); instead they are measuring what the tech companies are capable of doing (producing numbers).

Several people said to me about the original post “this is clearly a mathematician in denial, scared for their career because the machines are coming”. I can see that my thoughts can be interpreted in this way. But right now I am so sceptical about current approaches when it comes to machines proving theorems that I’m going to stick to my guns. I think my current take on things is: progress is incredibly rapid, but these systems are still so far away from being a game-changer to a researcher when it comes to helping with technical details. And past performance is not a guarantee of future results.

There have been some successes with computers helping mathematicians in crazy new ways. Neural networks being used as data-analysis tools have helped research mathematicians to make progress in certain areas of mathematics where questions are of a certain form (e.g. “find a combinatorial counterexample to this claim” or “spot a new pattern in this database of numbers”). However much of mathematics is not of this form. Language models have helped research mathematicians; I have an ever-growing list of examples of language models making helpful observations or helping with brainstorming ideas. The percentage hit rate right now is extremely low (e.g. “ask the LLM 20 questions and it says one useful thing for one of them”). Even a 5% hit rate is exciting — don’t get me wrong! It is great that my community is learning how to use these tools, and it is great that these tools are occasionally proving useful to research mathematicians. But google is also a useful tool for a research mathematician and nobody is running around claiming that search engines are going to solve mathematics. Sure LLMs will get better. But will they become a game-changer? Right now this is not at all clear to me.

Something which I think would help to take LLMs to the next level would be far more rigorous ways of measuring their abilities on technical questions. Right now the FrontierMath dataset seems to be the best thing we have. It is not adequate for many reasons, but right now is difficult for me to see how to do better with current technology. As I continue to say, integration of LLMs with theorem provers seems to me to be a very natural next step. But this is expensive and difficult, and recent successes such as AlphaProof are still only engaging with mathematical questions which can be understood by a schoolchild.

It has been suggested to me that one big bottleneck with the approach of integrating LLMs and ITPs is that the size of the formal mathematics libraries available is too small. “The bottleneck is mathlib“, someone from a tech company told me recently; Lean’s formal mathematics library is “only” millions of lines of code rather than billions, even if you take into account the over 1300 open pull requests which need reviewing. The formal mathematics community is too small and mathematics departments are not good at rewarding work in this area. Perhaps a useful intermediate goal is figuring out how AI can help to increase the size of a formal mathematics library, so other AIs can learn from it. Sergei Gukov suggests around 32 minutes into the video on the Epoch AI Tier 4 blog post that autoformalization (that is, getting a machine to read in LaTeX and output code in a language such as Lean) will be the key. How difficult this will be remains to be seen; I have seen progress recently but whether it will scale is another question. One interesting test problem: how long will it be before we can feed in the 10,000 page human-written proof of the classification of finite simple groups into a computer and get formalized code out? Whether or not this helps AI to solve modern problems, it would certainly be a spectacular milestone.

Posted in Machine Learning, Research formalisation | Tagged AI, Artificial Intelligence, Epoch AI, math, mathematics, Sergei Gukov | 16 Comments

What is a quotient?

Undergraduate mathematicians usually have a hard time defining functions from quotients in Lean, because they have been taught a specific model for quotients in their classes, which is not the model that Lean uses. This post is an attempt to explain what’s going on.

What are the natural numbers?

Before I start on what quotients are, let’s talk about what the natural numbers are. What is 37? What is it made of?

If you asked Euclid or Euler or Gauss or Riemann this question, they would think you were talking nonsense. 37 is a number, and it’s not made of anything. If you asked Peano, you would perhaps get a more nuanced answer: Peano might say that it doesn’t matter what 37 is made of, all that matters is that the natural numbers satisfy Peano’s axioms, because this is all you need to set up the theory (and if you don’t believe him, you should play the natural number game).

But if you were to ask someone who knows something about the foundations of mathematics, they would say that it is possible to give an answer, but only after you have decided which foundations you are using. If you build mathematics within set theory then the natural numbers are a set, as is 37, and under the usual conventions we have that 37 is the set {0,1,2,3,…,36}. If you build things within type theory then the natural numbers are typically an inductive type, and 37 is the term succ (succ (succ (succ ... (succ 0)))...) of this type. Finally, if you build mathematics within category theory then the natural numbers are typically an object of a certain category, and 37 will be a certain morphism from a terminal object into this natural number object.

But the point here is that the actual details of what 37 “is” do not matter. Gauss was proving quadratic reciprocity before any of these foundational theories were even proposed. All that matters is that the natural numbers, however they are modelled, satisfy Peano’s axioms. Peano’s axioms characterise the natural numbers uniquely up to unique isomorphism. In other words, if you have two models of the natural numbers and they both satisfy Peano’s axioms then those models are in a very strong sense “the same thing”.

Similarly, it doesn’t matter what the real numbers actually are, they can be a set, a type or an object of a category, and their elements can be Cauchy sequences, Dedekind cuts, Bourbaki uniform space completions, Eudoxus reals or whatever your favourite construction of the real numbers is; all that matters is that they are a complete archimedean ordered field. This hypothesis characterises the reals uniquely up to unique isomorphism, and you can build all of modern analysis from it (indeed my first analysis lecturer did just this: they developed all the basic theory of sequences, sums and integrals just from this assumption).

What is a quotient?

When I was a kid, I was into the maths olympiad scene, and I knew about the integers mod n. For example I knew that the integers mod 10 were {0,1,2,3,4,5,6,7,8,9} with addition and multiplication redefined so that if you go above 10 then you just subtract 10 until you’re back less than 10. For example 9 + 3 = 2 in the integers mod 10, because 10=0 and 11=1 and so on. This was my childhood model of the integers mod 10. But then I went to the University of Cambridge and there I was told by my algebra lecturer that this model was wrong. The integers mod 10, or \mathbb{Z}/10\mathbb{Z} as I was now expected to call them, were a quotient group, and so by definition the elements were cosets. Turns out that 3 wasn’t an element of the integers mod 10 after all, turns out that the thing I was calling 3 was actually the infinite set {...-17,-7,3,13,23,33,43,...}, or 3+10\mathbb{Z} or [3].

At the time, I was confused, because I had hard evidence that my model for the integers mod 10 worked fine, and nobody had hauled me up on this when marking my solutions to olympiad questions. But it was explained to me that my idea of choosing a canonical set of coset representatives did not generalise well, and that I would be better off modelling quotients as sets of sets. Whatever. It took me decades to understand that this claim was not the end of the story (and I thank Patrick Massot for pointing this out to me).

The point of this post is to explain that, just as it doesn’t matter what the natural numbers are as long as they satisfy Peano’s axioms, and just as it doesn’t matter what the real numbers are as long as they are a complete ordered archimedean field, it also doesn’t matter what the elements of a quotient group are, as long as a certain axiom is satisfied, which will characterise quotient groups uniquely up to unique isomorphism.

“Well-defined” functions.

So what is the “axiom for quotients”? It is typically very well-hidden in an undergraduate mathematics curriculum, and is usually explained in terms which heavily rely on the “set of sets” model for quotients. The axiom is often not clearly stated (I looked back in my undergraduate notes and it never seems to be made explicit). However I do understand why not; the axiom is actually a universal property, which is quite an abstract concept. One would like to talk about things like quotient groups as early as possible in the mathematics curriculum (for example, to state the first isomorphism theorem for groups); but students are typically only told about universal properties in more advanced algebra courses covering, for example, tensor products and localisations (both of which are, uncoincidentally, constructed as quotients).

The axiom we seek is typically hidden behind this strange idea that a function is “well-defined”. Let’s keep running with our example of the additive group of integers mod 10. If we want to define a function to this set, it’s very easy. There is a canonical “reduction mod 10” function from the integers to the integers mod 10; let’s call it q.

carview.php?tsp=

If we now have some other set X and want to give a function f from X to \mathbb{Z}/10\mathbb{Z}, then one way to do so would simply be to give a function \tilde{f} from X to \mathbb{Z}, and then just compose this with q.

carview.php?tsp=
Given \tilde{f}, we can just make f by composing with q.

If X happens to be a group and \tilde{f} happens to be a group homomorphism, then f will also be a group homomorphism, because q is. So we have a perfectly good way of defining maps into the quotient, which doesn’t rely on anything fancy. The fun starts when we want to define a map out of the quotient. This is where we have to check that something is “well-defined”.

Let’s work through a mathematically simple example to remind us of the “well-defined yoga” which we put undergraduates through. To know a positive integer mod 10 is to know its last digit, and if you know the last digit of a number then you can figure out if it’s even or odd. So there should be some natural map s from the integers mod 10 to the set {Even,Odd}. Let’s write this map with a dotted arrow and then discuss how we can go about defining it.

carview.php?tsp=
We want to define the map s and check it’s “well-defined”.

Let’s now run through the way we are formally taught how to construct s. First we choose an integer mod 10, for example 8, or, as my algebra lecturer would have called it, [8]. Then we remember that secretly this element is actually an infinite set {...-12,-2,8,18,28,...}. So now we choose a random element in this set, for example 28. Now 28 is clearly even, so we define s([8])=\mathrm{Even}.

But there’s a catch! The catch is that at some point in the procedure we made a completely random choice of an element of an infinite set. What if we had chosen 108 or 1000008 instead? Well, here’s a funny thing: both 108 and 1000008 are also even, as indeed is every element of the equivalence class, so it didn’t actually matter which element we chose; we always got the same answer. This means that our definition of s([8]) is “well-defined”. Indeed, we can dress up the key point of this argument as a theorem:

Theorem. If x and y are integers which are congruent mod 10, then x is even if and only if y is even, and x is odd if and only if y is odd.

Proof. If x and y are congruent mod 10, then x - y is a multiple of 10 and thus a multiple of 2, so x and y have the same parity: they’re either both even, or both odd. QED.

We defined s([8]); let’s now do the general case. Given t, an integer mod 10, t is secretly an equivalence class of integers, where two integers are equivalent if they’re congruent mod 10. Choose an element x of this equivalence class and see if it’s even or odd; define s(t) to be the answer. If we had instead chosen a different element y in t then by the theorem above we would have got the same answer for s(t); thus s(t) is “well-defined” and thus exists.

carview.php?tsp=

The question is: what actually just happened?

The universal property

What we actually did was the following. There’s certainly a well-defined map \tilde{s} from the integers to {Even,Odd}. So what we actually have is this situation:

carview.php?tsp=

And what we want to do is to construct the diagonal map s such that \tilde{s}=s\circ q:

carview.php?tsp=

But, in contrast to when we were making f from \tilde{f}, we can’t define s by composing q and \tilde{s} because they’re not composable this time: the source of one of these maps isn’t equal to the target of the other. We already defined s in the traditional way above, using this “well-defined” argument, which boiled down to the theorem we stated and proved earlier. Now we have introduced \tilde{s} we can restate this theorem as

Theorem. If x and y are congruent mod 10 then \tilde{s}(x)=\tilde{s}(y).

In other words,

Theorem. If x and y are equivalent, then \tilde{s}(x) and \tilde{s}(y) are equal.

This theorem was exactly what we needed to construct the function s.

So the axiom must be: to give a function from \mathbb{Z}/10\mathbb{Z} to a set Y is to give a function from \mathbb{Z} to Y which is constant on equivalence classes. That is what all this “well-defined” stuff boils down to.

How it’s done in Lean

Let’s now consider the general case, and how it’s done in Lean. Given a type X and an equivalence relation r on X, one can make the quotient Quot r of X by this equivalence relation (so in the example above, r is “congruence mod 10” and Quot r is \mathbb{Z}/10\mathbb{Z}), and there’s a function Quot.mk r : X → Quot r (i.e. Quot.mk r is the function we were calling q above). The terms of type Quot r are not equivalence classes; they are opaque (i.e., we don’t know what they are). If you really want the equivalence class associated to an element of the quotient, you can have it; just take its preimage under Quot.mk r. But are you sure you actually want it?

One time when we think we want the equivalence class is when defining a function from the quotient, using the “well-defined yoga” that we just ploughed through above. Lean’s version of this yoga is called Quot.lift. This is a function which eats a function \tilde{s} : X → Y plus a proof that it is constant on equivalence classes, and “descends” it to a function s : Quot r → Y such that \tilde{s} = s \circ q. Furthermore the Lean developers have designed things so that the proof that \tilde{s}(x)=s(q(x)) is just rfl, i.e., it’s true by definition. Unfortunately they also decided to write their diagrams upside-down so we are stuck with calling s the “lift” of \tilde{s}.

This axiom, that to give a function X/~ → Y is to give a function X → Y constant on equivalence classes, and that the obvious triangle commutes, is the universal property of quotients. Just like the earlier examples of axioms characterising constructions, it is all we need to know about quotients. We don’t need to assume that an element of a quotient is a coset, all we need to know is the axiom, and we can use this, and nothing more than this, to develop the entire theory of quotient groups, quotient rings and so on; indeed this is what happens in Lean’s mathematics library. There are also many more convenient consequences of this axiom which are available to a Lean user: for example QuotientGroup.lift is the construction which given a group homomorphism G → H and a proof that the normal subgroup N of G is in the kernel of this homomorphism, descends (or “lifts” :-/) this homomorphism to a homomorphism G/N → H. To give another example, QuotientGroup.map is the function which eats a group homomorphism φ : G → H, normal subgroups N of G and M of H and a proof that N ⊆ φ⁻¹(M), and produces a group homomorphism G/N → H/M. But all of these more convenient ways to construct maps from a quotient are ultimately built from Quot.lift.

If you want to see more details of how to use quotients in Lean, I show how to define the integers from the naturals here, and the rationals from the integers here.

Appendix: what Lean’s quotients are made of

If you’re still a bit confused by the above, you might not want to read this last part.

One reason why Lean does not model quotients as sets of sets is that Lean uses type theory as its foundation, so there are types G and terms g, and Lean writes g : G when a set theorist would write g\in G. But g is a term, not a type, so x : g makes no sense. Set theory allows you to make sets of sets, but type theory does not allow you to make types of types; types contain terms, terms don’t contain anything because they’re not containers, and that’s how it works. We can make sets in type theory and then make sets of sets, but given that groups and rings and fields and so on are all set up as types in Lean’s mathematics library, there is little point using Lean’s sets just to make a model of a quotient when it can be done using the axiomatic approach above. However Lean still has to somehow actually define the quotient of a type X by an equivalence relation: what is the definition under the hood?

I was amused to learn that in Lean, internally the terms of X/~ are just defined to be the terms of X! But equality is redefined to be ~. This can be done because unlike in set theory, where an axiom tells you when two sets are equal, equality is a much more flexible concept in Lean’s type theory; it is defined as an inductive type rather than being a fundamental part of the logic. If you think this approach is strange, I would argue that actually it probably aligns far better with Gauss’ model of the integers mod 10 than the “set of sets” approach commonly adopted by mathematics departments nowadays. Surely we’re happy with the idea that 2 = 12 in \mathbb{Z}/10\mathbb{Z}? Lean just takes this literally, rather than muddying the waters.

Posted in formalising mathematics course, M1F, undergrad maths | Tagged lean, math, mathematics, quotients, universal property | 7 Comments

Think of a number.

My feed was recently clogged up with news articles reporting that Sam Altman thinks that AGI is here, or will be here next year, or whatever. I will refrain from giving even more air to this nonsense by linking to the stories. This kind of irresponsible hype-generation drives me nuts (although it also drives up stock prices so I can see why the tech bros are motivated to do it). Sure AI can have a good crack at undergraduate mathematics right now, and sure that’s pretty amazing. But our universities are full of students who can also have a good crack at undergraduate mathematics so in some sense we have achieved very little (and certainly nothing which is of any use to me as a working mathematician in 2025). If AI cannot do mathematics at PhD-student level (i.e., if it can’t start thinking for itself) then AGI is certainly not here, whatever “AGI” even means.

In an attempt to move beyond the hype and to focus on where we are right now, I’m going to try an experiment. It might fail. But if you’re a number theorist reading this, you can help me make it succeed.

Can AI do mathematics?

The null hypothesis as far as I am concerned is “not really”. More precisely, I have seen absolutely no evidence that language models are doing anything other than paraphrasing and attempting to generalise what they’ve seen on the internet, not always correctly. Don’t get me wrong — this gets you a long way! Indeed many undergraduate mathematics exams at university level are designed so that students who have a basic knowledge of the ideas in the course will pass the exam. The exam will test this knowledge by asking the student either to regurgitate the ideas, or to apply them in situations analogous to those seen in the problem sheets which came with the course. You can pass these exams by intelligent pattern-matching (if you have an encyclopedic knowledge of the material, which a machine would have). Furthermore, undergraduate pure mathematics is stagnant. The courses (other than my Lean course) which we teach in the pure mathematics degree at Imperial College London are essentially exactly the same as those which were offered to me when I was an undergraduate in Cambridge in the late 1980s, and very little post-1960 mathematics is taught at undergraduate level, even at the top institutions in the world, because it simply takes too long to get there. This is exactly why language models can pass undergraduate mathematics exams. But as I already said, this is no use to me as a working mathematician and it is certainly not AGI.

How do we test AI?

If you look at currently available mathematics databases, they are essentially all focused on mathematics which is at undergraduate or olympiad-style level. Perhaps the tech bros are deluded in thinking that because their models are getting high marks on such databases then they’re doing well. I want to firstly stress the fundamental notion that these databases are completely unrepresentative of mathematics. More recently we had the FrontierMath dataset, which was designed by mathematicians and which looked like a real step in the right direction, even though the answers to the questions were not proofs or ideas, but just numbers (which is also a long way from what research mathematics looks like); the dataset is private, but the 5 sample questions which were revealed to us were clearly beyond undergraduate level. People (including me) were tricked into saying publically that any machine which can do something nontrivial here would really be a breakthrough. The next thing we know, OpenAI were claiming that their system could get 25 percent on this database. Oof. But shortly after that, Epoch AI (who put he dataset together) revealed that 25 percent of the questions were undergraduate or olympiad level, and this week it transpires that OpenAI were behind the funding of the dataset and reportedly were even given access to some of the questions (added later: in fact Elliot Glazer from Epoch AI has confirmed this on Reddit); all of a sudden 25 percent doesn’t sound so great any more.

So maybe we need to do it all again :-/

Let’s make a database

I want people (i.e. researchers in number theory at PhD level or beyond) to help me put together a secret database of hard number theory problems. I already have 5 but I need at least 20 and ideally far more, so I need help. The problems need to be beyond undergraduate level in the sense that undergraduates will not have been taught all the skills necessary to solve them. Because LLMs are not up to writing proofs, the answers unfortunately need to be non-negative integers: there is simply no point asking an LLM basic research level questions and then ploughing through the crap they produce and giving it 0/10; anyone who thinks that LLMs are actually capable of writing proofs should show me one that can do the Putnam because all efforts I saw on the 2024 exam were abysmal and this is undergraduate level: as always, the difficulty here is that soon after the questions and solutions are made public, the models train on them and the experiment is thus invalidated. In particular there is no point asking a model how to solve the 2024 Putnam questions now, one would expect perfect solutions, parroted off the internet by the parrots which we are being told are close to AGI. Note that restricting to questions for which the answer is a non-negative integer is again moving in a direction which is really really far from what researchers actually do, but unfortunately this is the level we are at right now.

Once we have a decent amount of questions, which perhaps means at least 20 and maybe means 50, I’ll announce this and then any AI company who wants a go can get in touch and I’ll send them the questions but with no solutions and they can send me back the answers and then I’ll announce the company and the score they got publically. Each company is allowed one go. When a few have had a go, I’ll make the questions public.

What will the questions look like?

I’ll finish this post with a more technical description of what I’m looking for. I need to be able to solve the question myself so they should be in number theory (broadly interpreted) or a nearby area. The answer should be a nonnegative integer (expressible in base 10 so perhaps < 10^100). Ideally they should be expressible in LaTeX without any fancy diagrams. The questions may or may not be comprehensible to an undergraduate but the proofs should be beyond all but the smartest ones, ideally because they need material which is not taught to undergraduates. The questions should not be mindless variants of standard questions which are already available on the internet — although variants which actually need some understanding of what is going on are fine; we are trying to test the difference between understanding and the stochastic parrot model which is my model of an LLM: we are interested in testing the hypothesis that a language model can think mathematically. Language models can use computers and the internet so questions where you need to use a computer or online database to answer them is fine, although the question should not simply be “look up a number in a database” as this is too easy. Ideally the question needs an idea which is not explicit in the question. Ideally the answer is not easily guessable, because I have heard reports of LLMs getting questions on the FrontierMath dataset right, backed up by completely spurious reasoning. Please don’t test your question by typing it into an LLM to see the answer. Any ideas? Contact me at my Imperial email.

Let’s give it 1 month, so applications close on 20th Feb and I’ll report back soon after that date, hopefully to say that we have a decent-sized database and the experiment can commence. Thanks in advance to all number theorists who respond.

Posted in Machine Learning, number theory | Tagged AI, Artificial Intelligence, chatgpt, database, llm, technology, think of a number | 12 Comments

Can AI do maths yet? Thoughts from a mathematician.

So the big news this week is that o3, OpenAI’s new language model, got 25% on FrontierMath. Let’s start by explaining what this means.

What is o3? What is FrontierMath?

A language model, as probably most people know, is one of these things like ChatGPT where you can ask it a question and it will write some sentences which are an attempt to give you an answer. There were language models before ChatGPT, and on the whole they couldn’t even write coherent sentences and paragraphs. ChatGPT was really the first public model which was coherent. There have been many other models since. Right now they’re still getting better really fast. How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon. o3 is a new language model.

FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month. “Hundreds” is a quote from the paper (first line of the abstract), but I’ve heard a rumour that when the paper came out there were under 200 questions, although I’ve heard another rumour that apparently more are have been added since. As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me a little that already in this paragraph we’ve seen more questions than answers — I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so the moment you make a database of maths questions public, the language models will train on it. And then if you ask such a model a question from the database they’ll probably just rattle off the answer which they already saw.

How hard is the FrontierMath dataset?

So what are the questions in the FrontierMath dataset like? Here’s what we know. They’re not “prove this theorem!” questions, they’re “find this number!” questions. More precisely, the paper says “Problems had to possess definitive, computable answers that could be automatically verified”, and in the five sample problems which were made public from the dataset (Appendix A of the paper, pages 14 to 23) the solutions are all positive whole numbers (one answer is 9811, another is 367707, and the final three solutions are even larger — clearly these questions are designed in such a way that random guesswork is extremely unlikely to succeed). The sample questions are nontrivial, even to a research mathematician. I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive) and I knew exactly how to do the 5th one (it’s a standard trick involving the Weil conjectures for curves) but I didn’t bother doing the algebra to work out the exact 13-digit answer. The first and second question I knew I couldn’t do, and I figured I might be able to make progress on the 4th one if I put some real effort in, but ultimately I didn’t attempt it, I just read the solution. I suspect that a typical smart mathematics undergraduate would struggle to do even one of these questions. To do the first one you would, I imagine, have to be at least a PhD student in analytic number theory. The FrontierMath paper contains some quotes from mathematicians about the difficulty level of the problems. Tao (Fields Medal) says “These are extremely challenging” and suggests that they can only be tackled by a domain expert (and indeed the two sample questions which I could solve are in arithmetic, my area of expertise; I failed to do all of the ones outside my area). Borcherds (also Fields Medal) however is quoted in the paper as saying that machines producing numerical answers “aren’t quite the same as coming up with original proofs”.

So why make such a dataset? The problem is that grading solutions to “hundreds” of answers to “prove this theorem!” questions is expensive (one would not trust a machine to do grading at this level, at least in 2024, so one would have to pay human experts), whereas checking whether hundreds of numbers in one list correspond to hundreds of numbers in another list can be done in a fraction of a second by a computer. As Borcherds pointed out, mathematics researchers spend most of the time trying to come up with proofs or ideas, rather than numbers, however the FrontierMath dataset is still extremely valuable because the area of AI for mathematics is desperately short of hard datasets, and creating a dataset such as this is very hard work (or equivalently very expensive). This recent article by Frieder et al talks in a lot more depth about the shortcomings in datasets for AI in mathematics.

So there was an article about the FrontierMath dataset in Science and I was quoted in it as saying “If you have a system that can ace that database, then it’s game over for mathematicians.” Just to be clear: I had nothing to do with the dataset, I’ve only seen the five public questions, and was basing my comments on those. I also said “In my opinion, currently, AI is a long way away from being able to do those questions … but I’ve been wrong before”. And then this week there’s an announcement that the language model o3 got a score of 25 percent on the dataset. I was shocked.

What exactly has happened here?

Why was I shocked? Because my mental model on where “AI” is currently, when it comes to doing mathematics, is “undergrad or pre-undergrad”. It’s getting very good at “Olympiad-style” problems of the sort given to bright high-schoolers. Within a year it’s absolutely clear that AI systems will be passing undergraduate mathematics exams (not least because when you’re setting an undergraduate mathematics exam you ideally need to make sure that you don’t fail 50 percent of the class, so you throw in a couple of standard questions which are very similar to questions that the students have seen already, to ensure that those with a basic understanding of the course will pass the exam. Machines will easily be able to ace such questions). But the jump from that to having innovative ideas at advanced undergrad/early PhD level beyond recycling standard ideas seems to me to be quite a big one. For example I was very unimpressed by the ChatGPT answers to the recent Putnam exam posted here — as far as I can see only question B4 was answered adequately by the machine, most other answers are worth one or two out of 10 at most. So I was expecting this dataset to remain pretty unattackable for a couple of years.

My initial excitement was tempered however by a post from Elliot Glazer from Epoch AI on Reddit where he claimed that in fact 25 percent of the problems in the dataset were “IMO/undergrad style problems”. This claim is a little confusing because I would be hard pressed to apply such adjectives to any of the five publically-released problems in the dataset; even the simplest one used the Weil conjectures for curves (or a brute force argument which is probably just about possible but would be extremely painful, as it involves factoring 10^12 degree 3 polynomials over a finite field, although this could certainly be parallelised). This of course raises questions in my mind about what the actual level of the problems in this secret dataset is (or equivalently whether the five public questions are actually a representative sample), but this is not knowledge which we’re likely to have access to. Given this new piece of information that 25 percent of the problems are undergraduate level, perhaps I will revert to being unsurprised again, but will look forward to being surprised when AI is getting nearer 50 percent on the dataset, because performance at “qual level” (as Elliot describes it — the next 50 percent of the questions) is exactly what I’m waiting to see from these systems — for me this would represent a big breakthrough.

Prove this theorem!

However, as Borcherds points out, even if we ended up with a machine which was super-human at “find this number!” questions, it would still have limited applicability in many areas of research mathematics, where the key question of interest is usually how to “prove this theorem!”. In my mind, the biggest success story in 2024 is DeepMind’s AlphaProof, which solved four out of the six 2024 IMO (International Mathematics Olympiad) problems. These were either “prove this theorem!” or “find a number and furthermore prove that it’s the right number” questions and for three of them, the output of the machine was a fully formalized Lean proof. Lean is an interactive theorem prover with a solid mathematics library mathlib containing many of the techniques needed to solve IMO problems and a lot more besides; DeepMind’s system’s solutions were human-checked and verified to be “full marks” solutions. However, we are back at high school level again; whilst the questions are extremely hard, the solutions use only school-level techniques. In 2025 I’m sure we’ll see machines performing at gold level standard in the IMO. However this now forces us to open up the “grading” can of worms which I’ve already mentioned once, and I’ll finish this post by talking a little more about it.

Who is marking the machines?

July 2025. I can envisage the following situation. As well as hundreds of the world’s smartest schoolchildren entering the IMO, there will be machines entering. Hopefully not too many though. Because the systems will be of two types. There will be systems submitting answers in the language of a computer proof checker like Lean (or Rocq, Isabelle, or many others). And there will be language models submitting answers in human language. The big difference between these two submissions are that: if a marker verifies that the statement of the question has been correctly translated into the computer proof checker, then all they need to do is to check that the proof compiles and then they basically know that it is a “full marks” solution. For the language models we will have a situation like the poor Putnam solutions above — the computer will write something, it will look convincing, but a human is going to have to read it carefully and grade it, and there is certainly no guarantee that it will be a “full marks” solution. Borcherds is right to remind the AI community that “prove this theorem!” is what we really want to see as mathematicians, and language models are currently at least an order of magnitude less accurate than expert humans when it comes to logical reasoning. I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up. On the other hand, theorem provers are at least an order of magnitude more accurate: every time I’ve seen Lean not accept a human argument in the mathematical literature, the human has been wrong.

In fact, as mathematicians, we would like to see more than “prove this theorem!”. We would like to see “prove this theorem, correctly, and explain what makes the proof work in a way which we humans understand”. With the language model approach I worry (a lot) about “correctly” and with the theorem prover approach I worry about “in a way which we humans understand”. There is still a huge amount to be done. Progress is currently happening really quickly. But we are a long way away. When will we “beat the undergraduate barrier”? Nobody knows.

Posted in Machine Learning, Olympiad stuff, Research formalisation | Tagged AI, Artificial Intelligence, FrontierMath, math, mathematics, o3 | 16 Comments

Fermat’s Last Theorem — how it’s going

So I’m two months into trying to teach a proof of Fermat’s Last Theorem (FLT) to a computer. Most of “how it’s going” is quite tedious and technical to explain: to cut a long story short, Wiles proved an “R=T” theorem and much of the work so far has gone into teaching the computer what R and T are; we still have not finished either definition. However my PhD student Andrew Yang has already proved the abstract commutative algebra result which we need (“if abstract rings R and T satisfy lots of technical conditions then they’re equal”), and this is an exciting first step. The current state of the write-up is here, and the system we are using is Lean and its mathematical library mathlib, maintained by the Lean prover community. If you know a bit of Lean and a bit of number theory then feel free to read the contribution guidelines, checkout the project dashboard and claim an issue. As I say, we’re two months in. However we already have one interesting story, which I felt was worth sharing. Who knows if it’s an indication of what is to come.

We’re not formalising the old-fashioned 1990s proof of FLT. Since then, work of many people (Diamond/Fujiwara, Kisin, Taylor, Scholze and several others) led to the proof being generalised and simplified, and part of the motivation of my work is not to just get FLT over the line but to prove these more general and powerful results. Why? Because if the AI mathematics revolution actually happens (which it might) and if Lean turns out to be an important component (which it might) then computers will be in a better position to start helping humans to push back the boundaries of modern number theory because of this formalization work, as they’ll have access to key modern definitions in a form which they understand. One concept which was not used in Wiles’ original proof but which is being used in the proof we’re formalizing, is crystalline cohomology, a theory developed in the 60s and 70s in Paris, with the foundations laid down by Berthelot following ideas of Grothendieck. The basic idea here is that the classical exponential and logarithm functions play a key role in differential geometry (relating Lie algebras and Lie groups, for example), and in particular in understanding de Rham cohomology, but they do not work in more arithmetic situations (for example in characteristic p); the theory of “divided power structures”, developed in the 1960s in a series of beautiful papers by Roby, play a crucial role in constructing an analogue of these functions which can be used in the arithmetic case. tl;dr: we need to teach the computer crystalline cohomology, so first we need to teach it divided powers.

Antoine Chambert-Loir and Maria Ines de Frutos Fernandez have been teaching the theory of divided powers to Lean, and over the summer Lean did that irritating thing which it sometimes does: it complained about the human presentation of an argument in the standard literature, and on closer inspection it turned out that the human argument left something to be desired. In particular a key lemma in Roby’s work seems to be incorrect. When Antoine told me this in a DM, he remarked that he would suppose I thought this was funny, and indeed the very long string of laughing emojis which he got back as a response to his message confirmed this. However Antoine, being rather more professional than me, argued that instead of me tweeting about the issue (which I can’t do anyway because I left Twitter and joined bluesky yesterday), we should in fact attempt to fix the problem. We went about this in rather different ways. Antoine put it on his job list to look at, and I completely ignored the problem and just started occasionally mentioning to people that the proof was in trouble, in a weak sense. I say “in a weak sense” because this observation has to be put into some context. According to the way I currently view mathematics (as a formalist), the entire theory of crystalline cohomology vanished from the literature at the moment Antoine discovered the issue, with massive collateral damage (for example huge chunks of Scholze’s work just disappeared, entire books and papers vaporised etc). But this disappearance is only temporary. Crystalline cohomology is in no practical sense “wrong”. The theorems were still undoubtedly correct, it’s just that the proofs were as far as I am concerned incomplete (or at least, the ones Antoine and Maria Ines were following were), and unfortunately it is now our job to fix them. The thing I want to stress is that it was absolutely clear to both me and Antoine that the proofs of the main results were of course going to be fixable, even if an intermediate lemma was false, because crystalline cohomology has been used so much since the 1970s that if there were a problem with it, it would have come to light a long time ago. Every expert I’ve spoken to is in complete agreement on this point (and several even went so far as to claim that I’m making a fuss about nothing, but perhaps they don’t understand what formalization actually means in practice: you can’t just say “I’m sure it’s fixable” — you have to actually fix it). One added twist is that Roby, Grothendieck and Berthelot have all died, so we could not go back to the original experts and ask directly for help.

[For those that are interested in more technical details, here they are: Berthelot’s thesis does not develop the theory of divided powers from scratch, he uses Roby’s “Les algebres a puissances divisees”, published in Bull Sci Math, 2ieme serie, 89, 1965, pages 75-91. Lemme 8 (on p86) of that paper seems to be false and it’s not obvious how to repair the proof; the proof of the lemma misquotes another lemma of Roby from his 1963 Ann Sci ENS paper; the correct statement is Gamma_A(M) tensor_A R = Gamma_R(M tensor_A R) but one of the tensor products accidentally falls off in the application. This breaks Roby’s proof that the divided power algebra of a module has divided powers, and thus stops us from defining the ring A_{cris}.]

So as I say, Antoine worked on fixing the problem, whereas I just worked on gossiping about it to the experts, and I made the mistake of telling Tadashi Tokieda about it in a coffeeshop in Islington, he duly went back to Stanford and mentioned it to Brian Conrad, and the next thing I knew Conrad was in my inbox asking me what was all this about crystalline cohomology being wrong. I explained the technical details of the issue, Conrad agreed that there seemed to be a problem and he went off to think about it. Several hours later he got back to me and pointed out that another, different, proof of the claim that the universal divided power algebra of a module had divided powers was in the appendix to the Berthelot-Ogus book on crystalline cohomology, and that as far as Conrad was concerned this approach should be fine. The proof was back on!

And that is pretty much the end of the story, other than the fact that last month I visited Berkeley and I had lunch with Arthur Ogus, who I’ve known since I was a post-doc there in the 90s. I’d promised Arthur a story of how he’d saved Fermat’s Last Theorem, and over the meal I told him about how his appendix had dug me out of a hole. His response was “Oh! That appendix has several errors in it! But it’s OK, I think I know how to fix them.”

This story really highlights, to me, the poor job which humans do of documenting modern mathematics. There appear to be so many things which are “known to the experts” but not correctly documented. The experts are in agreement that the important ideas are robust enough to withstand knocks like this, but the details of what is actually going on might not actually be where you expect them to be. For me, this is just one of many reasons why humans might want to consider getting mathematics written down properly, i.e. in a formal system, where the chances of error are orders of magnitude smaller. However most mathematicians are not formalists, and for those people I need to justify my work in a different way. For those mathematicians, I argue that teaching machines our arguments is a crucial step towards getting machines to do it themselves. Until then, we seemed to be doomed to fix up human errors manually.

The story does have a happy ending though — two weeks ago Maria Ines gave a talk about formalization of divided powers in the Cambridge Formalization of Mathematics seminar (which was started by Angeliki Koutsoukou-Argyraki a couple of years ago — thanks Angeliki!), and my understanding of Maria Ines’ talk is that these issues have now been sorted out. So we are actually back on track. Until the next time the literature lets us down…

Posted in Fermat's Last Theorem, Research formalisation | Tagged crystalline cohomology, Fermat's Last Theorem, lean, logic, math, mathematics, maths, Peter Scholze, philosophy | 19 Comments

Lean in 2024

A huge amount happened in the Lean theorem prover community in 2023; this blog post looks back at some of these events, plus some of what we have to look forward to in 2024.

Modern mathematics

I personally am a member of the Lean community because of its phenomenal mathematics library mathlib, which was born around six and a half years ago and has now become a very fast-growing database of mathematics, containing many (but by no means all) of the definitions used by modern research mathematicians today. 2023 was the year where we really began to accrue evidence that formalisation of certain parts of serious modern mathematics is feasible in “real time”. Quanta’s video on 2023’s biggest breakthroughs in mathematics talks about several results; let’s take a look at their status when it comes to formalisation in Lean.

The first result in the Quanta video is the result by Campos–Griffiths–Morris–Sahasrabudhe giving an exponential improvement for upper bounds for Ramsey numbers. The previous post on this blog, by Bhavik Mehta, discusses his formalisation of the result; he formalised the 50+ page paper in five months (before the paper had even been refereed).

The second result in the video is the work by Smith, Myers, Kaplan and Goodman-Strauss on aperiodic monotiles. Myers has developed enough of the theory of planar geometry to formalise the solution to a 2019 IMO problem, and announced earlier this week his intention to formalise the result in Lean in 2024.

The third and final result mentioned in Quanta’s round-up was the Kelley–Maka result on bounds for how big a set of positive integers must be before it must contain a 3-term arithmetic progression. A formalisation of the result is being actively worked on by Yael Dillies; they have made substantial progress.

However, when it comes to formalising mathematics in real time, we now have an even more spectacular data point: Terry Tao led a team which formalised the main result in his breakthrough new paper with Gowers, Green and Manners proving Katalin Marton’s polynomial Freiman-Ruzsa conjecture. The 33 page paper was formalised in just three weeks (before the paper had even been submitted)! Tao used Patrick Massot’s Lean blueprint software to make this web page which contains a LaTeX/Lean hybrid document, containing an explanation of the proof which is comprehensible to both a human and a computer. He also made very effective use of the Lean Zulip, with summaries posted every few days of what was ready to be worked upon; ultimately 25 people contributed.

Based on these and other stories coming out of the mathlib community, one can come to the conclusion that there are parts of the modern theory of combinatorics (and in particular additive combinatorics) for which asking for a complete computer formalisation is not an unreasonable thing to do. Mehta and Bloom also formalised Bloom’s proof of the Erdos-Graham unit fractions conjecture, which for me is evidence that some parts of analytic number theory are also becoming amenable to this technique.

Fermat’s Last Theorem

However not all of modern mathematics is ready to be eaten by these systems. For example in September I was awarded a grant by the EPSRC (the UK mathematics funding body) to spend five years of my life formalising a proof of Fermat’s Last Theorem in Lean! Lean has the definitions of elliptic curves and modular forms required to start us off, but we are still working on other definitions, such as automorphic representations, finite flat group schemes and the like. I cannot yet link to a blueprint because the project does not officially start until October 2024 and right now I’m more worried about teaching my undergraduate Lean course, but I am part way through writing one and right now the graph looks like this:

carview.php?tsp=

People familiar with Tao’s blueprint graph will know that green means “done” and blue means “ready to be done”. The new orange colour means “a long way away”. Already we see the first major bifurcation: we need Mazur’s theorem on the torsion subgroups of elliptic curves over the rationals, a huge project, and completely independent of the R=T theorems we’ll need to push Wiles’ ideas over the line. The route we’ll be taking is not the original Wiles/Taylor-Wiles argument, but a more modern one due to Richard Taylor, who is collaborating with me on the mathematics of the project.

The Focused Research Organisation

In July 2023 the Lean Focused Research Organisation (FRO) was launched. Focused Research Organisations are a new type of nonprofit for science, backed by Convergent Research. Lean’s FRO is funded by the Simons Foundation, Sloan, and Richard Merkin. This has enabled a team of people to work full-time on supporting the ever-growing Lean infrastructure and focus on many things (such as making the mathematics library compile faster, and adding infrastructure and tactics to enhance the user experience: I give some examples below). The FRO is hiring, by the way. One example of the ways that the FRO has made my own life easier: Scott Morrison has developed omega for Lean 4, a tactic for proving obvious-to-mathematicians lemmas about integers and naturals; these things will no longer slow down my undergraduate students.

In fact July 2023 was an amazing month for the community, because as well as the announcement of the FRO, it marked the end of the project to port all of Lean’s mathematics library from the now end-of-life Lean 3 to Lean 4.

Growing user base in computer science

But let’s get back to 2024. Sometimes it felt to me that Lean 3 was viewed as “a tool for mathematics”, with its mathematics library the flagship project. But both the FRO and other researchers in computer science are contributing code and tools to make Lean much more than this. Between 9th and 12th January 2024, the Lean Together 2024 conference showcased some of what was currently going on. As well as talks by mathematicians on growing the mathematics library, there were plenty of talks by computer scientists on tools or projects which they have been working on with Lean. For example, David Thraine Christianson from the FRO presented Verso, his domain-specific language for documentation. Static blog posts such as Tao’s Lean 4 proof tour can now be enhanced using this tool. Take a look at Joachim Breitner’s blog post about recursive definitions on the Lean-lang web page: this contains “live” code (hover over it!) and the entire post was generated from a Lean file; if the language updates and the code breaks then a warning will be generated. Perhaps in the future, infrastructure developed using Alex Best’s leaff tool (also presented at the conference) could even be used to suggest fixes.

Other non-mathematical highlights for me at the conference: Emina Torlak gave a presentation about how Amazon are using Lean in their Cedar project; Evgenia Karunus and Anton Kovsharov spoke about Paperproof, their visualisation tool for Lean proofs (which undergraduates seem to really like when I show it off in talks), and Kaiyu Yang spoke about LeanCopilot, an LLM tool trained on Lean’s mathematics library, whose main goal is to generate Lean proofs. I have always been skeptical of people who want to claim “Lean + AI = AGI”; in fact I’ve been maximally on the other side of the argument, with my opinion being that LLMs are right now of essentially no use to me as a working mathematician. But because I thought my undergraduates might like what I’d seen in the talk, I added LeanCopilot as a dependency for the repository I’m using in my undergraduate course, and it really can do a bunch of the problems (sometimes in totally different ways to my model solutions). I also tried it on a question someone asked on the Lean Zulip chat and it solved that too — and quickly. I think it will be interesting to see whether future versions of these tools will actually be of use to me in the Fermat project I mentioned above (which is funded until 2029; who knows what language models will be capable of then).

What else is to come in 2024?

Lean’s widgets offer the user a very flexible way of presenting data in an interactive graphical way. One application: I am very much hoping that this year will be the year that I will be able to start doing high school level geometry interactively in Lean; I give talks in schools about the Natural Number Game but I would like to write some new material about how to interact with statements such as “angle at centre equals twice angle at circumference” (proving it, using it) in Lean: there are some subtleties here which are not usually explained to schoolchildren (such as the difference between oriented and unoriented angles) and which would be exposed by a formal approach.

As far as mathematics goes, my team at Imperial are currently working on Tate modules, Selmer groups, the local and global Langlands program, commutative algebra, Lie algebras and more. When I start working full time on Fermat’s Last Theorem this will push the development of more of this kind of mathematics in Lean. There seems to be a coherent plan developing on the Lean Zulip for getting a basic theory of L-functions formalised (thanks to David Loeffler, Michael Stoll and others); one natural goal would be results like Cebotarev’s density theorem, used in Fermat. Heather Macbeth in 2023 formalised the definition of Riemannian manifolds; I feel like differential geometry is another area which now has huge potential for growth. The FRO have made it clear that they want to support this push for more mathematics in Lean; any mathematician interested in joining the community might want to start by working through the free online mathematics in Lean textbook.

For plans in computer science-related topics in 2024, I would invite you to consult the FRO roadmap . One thing I like about that page is the broad vision which the FRO has: one of the goals is “enhancing its interface and functionality to cater to a wide range of users, from mathematicians and software developers to verification engineers, AI researchers, and the academic community.”

As always, the Lean community web pages are another place where people can begin to learn about Lean (including installation instructions and so on); anyone who wants to try a fast interactive type-based programming language for their project, or who wants to formalise some mathematics, is welcome to jump in, and ask questions on the Zulip chat. Full names are preferred, and the two rules which have emerged in our community are: stay on topic, and be nice. Beginner questions are welcome in the #new-members stream.

Posted in Uncategorized | Tagged Artificial Intelligence, math, mathematics | 3 Comments

Formalising modern research mathematics in real time

(This is a guest post by Bhavik Mehta)

On March 16, 2023, a paper by Campos, Griffiths, Morris, and Sahasrabudhe appeared on the arXiv, announcing an exponential improvement to the upper bound on Ramsey numbers, an open problem since 1935. Around the same time, posts by Terence Tao, Timothy Gowers and Gil Kalai appeared, all discussing this announcement. These were soon followed by articles in Quanta, Nature and New Scientist. The new paper was clearly exciting news.

About six months after this, I finished formally verifying the titular result of Campos, Griffiths, Morris, and Sahasrabudhe’s beautiful paper in the Lean Theorem Prover. In this post I’ll discuss the gap between informal mathematics and formalised mathematics, demonstrating that very recent and involved arguments in combinatorics can often be formalised in real time. The code for this formalisation is available here: https://github.com/b-mehta/exponential-ramsey/blob/main/src/main_results.lean, and this specific file shows an overview of the relevant definitions and theorems. Another version can be found in a branch of Lean’s mathlib, which additionally shows that the proof has been checked by Lean, and further verified by an external type-checker.

This isn’t the first recent result in combinatorics to be formalised in Lean soon after its announcement: together with Thomas Bloom I formalised a 2021 paper of his within a year of it being uploaded, and Ellenberg-Gijswijt’s 2016 result giving upper bounds on the cap-set problem (which later appeared in the Annals) was formalised by Dahmen, Hölzl, and Lewis in 2019. Of course, no post about formalisation in combinatorics would be complete without a mention of the the four colour theorem, proved in 1976 (with the final gaps closed in 1989) and formalised in Coq in 2005.

5 years, 50 pages, 5 months

Quite unlike the remarkably short proof of Ellenberg-Gijswijt result on cap-sets, the new result here has a particularly involved proof which was a 5-year long project of the authors, and the paper comes to over 50 pages in total. In comparison, the formalisation took me about 5 months, including prerequisites, and I ended up with around 20k lines of Lean code.

For the sake of comparison with other recent large-scale formalisations, the Liquid Tensor Experiment took just over 18 months with a large amount of person-hours, and the portion of the lecture notes that was formalised was around 10 pages; although that project had a massive amount of prerequisites to formalise also. The cap-set proof (4 pages long) similarly had a large proportion of prerequisities to deal with, as that project was written at a time when Lean’s mathlib had not much support for polynomials, which is of course necessary for a proof which uses the polynomial method.

The formalised exponential Ramsey proof also relies on mathlib, most heavily the support for manipulating finite sets, but vitally on asymptotic estimates as well, since almost all the statements are about “sufficiently large” naturals. Another key aspect of the proof is probabilistic constructions, in which we show the existence of an object because it has non-zero probability of appearing if we try to make it randomly. Fortunately, the probability spaces needed here are all finite, so they can be formally represented by sums on finite sets.1

The end of the proof requires surprisingly precise numerical approximations, as the final part of the argument fundamentally relies on the fact that the red and blue regions in the diagram below don’t intersect. In the original paper, it is enough to say that this can be checked by computer (though the authors give suggestions towards a slightly more detailed proof as well), but formally some more careful analysis is needed.

carview.php?tsp=
A graph of a red and blue region which are very close to intersecting.

In terms of the global structure of the proof, the formal version follows the informal version closely, which is a testament to both the clarity of the paper, and the strength of Lean and mathlib to formalise practical arguments. In particular, no blueprint was needed for this formalisation, and it was in some sense done “on the fly”. I understood each part of the proof as it came to translate it, rather than writing a separate semi-formal version first. Indeed, when I started, my goal wasn’t to verify the entire paper. It was just to understand the argument because I thought it was an interesting result!

Mathematical typos

An obvious question when formalising a maths paper is how much the proof needed to change to be formalised: in other words whether there were any flaws in the paper. This question is surprisingly subtle, but a simple summary is that most of the mistakes I identified were easily fixable. Importantly all mistakes were fixable in a way that preserved the spirit of the argument, so the Lean version of the proof can reasonably be said to be a translation of the informal proof.

I like to consider certain such small mistakes “mathematical typos”. It’s fair to call them errors, and just like typos in English prose can confuse a non-native speaker, mathematical typos can impede meaning for a reader who isn’t familiar with mathematics. However, with a little experience, it’s not hard to see what the intended meaning was, and fixing it isn’t too bad. A mathematical typo certainly doesn’t constitute a flaw in the proof, and almost all the problems I found fall in this category.

That said, there were two places where I needed some help from the authors to figure out how to fix something from the paper, but it still remains true that the spirit of the argument was unchanged.2

What are Ramsey numbers?

The existence of Ramsey numbers was shown by Frank Ramsey in a 1930 paper whose goal was to show decidability of a particular fragment of first-order logic. Shortly afterward, in 1935, a paper of Erdős and Szekeres popularised Ramsey numbers, giving rise to the fundamental branch of combinatorics now known as Ramsey theory. Ramsey theory itself can fill entire lecture series and textbooks, but in this post I’ll focus on the simplest and most well-studied case: two-colour Ramsey numbers.

The definition of two-colour Ramsey numbers is fairly easy to give. For any numbers k and l, R(k, l) is the smallest number of people at a party you need, to find k people who all know each other or l people who all don’t know each other. More mathematically, for any red/blue colouring of the edges of the complete graph on R(k, l) vertices, you can find k vertices with only red edges between them or l vertices with only blue edges between them.

Erdős and Szekeres in their 1935 paper gave an upper bound of R(k, l) \leq \binom{k + l - 2}{k - 1}, (implying that the diagonal Ramsey number R(k, k) can be upper-bounded by 4^k). The research community attempted improvements to the problem of decreasing their upper bound, and in 1988 the first polynomial improvement was found by Thomason. His method was taken further by Conlon in 2009 and later by Sah, eventually reaching a bound of the form R(k, k) \leq 4^{k - (\log k)^2}. Conlon’s paper, giving the first super-polynomial improvement to the Erdős-Szekeres bound, was especially noteworthy and appeared in the Annals of Mathematics. But despite all of these successes, the question of decreasing the base of the exponent in the bound was still open. Is there a constant C < 4 for which R(k, k) \leq C ^ k?

It is worth saying something briefly about the lower bound for R(k, k). It was first shown by Erdős in 1947 that R(k, k) \geq \sqrt{2}^k using a probabilistic argument which was itself groundbreaking, and gave rise to the field of probabilistic combinatorics. While a lower order improvement has been made to this, the base \sqrt{2} has not been improved, giving a wide gap between the lower and upper bounds of \sqrt{2} and 4.

Finally, in March 2023, the paper of Campos, Griffiths, Morris and Sahasrabudhe was made public, giving the desired exponential improvement, a massive strengthening of Conlon’s result. They show that there is a constant c > 0 such that for all k, we have R(k, k) \leq (4 - c) ^ k. This paper is in the publication process, and this is the result which I formalised.

As part of this formalisation project, I additionally formalised a general form of Erdős’s lower bound and gave some upper and lower bounds for the off-diagonal Ramsey numbers R(3, k).3

“Suppose aliens invade the earth and threaten to obliterate it in a year’s time unless human beings can find the Ramsey number for red five and blue five. We could marshal the world’s best minds and fastest computers, and within a year we could probably calculate the value. If the aliens demanded the Ramsey number for red six and blue six, however, we would have no choice but to launch a preemptive attack.” – Paul Erdős, relayed by Graham and Spencer.

This famous quote of Erdős discusses the difficulty of computing R(5, 5) and R(6, 6), both of which are still unknown, and the main result of this post doesn’t get us any closer. Nonetheless, as part of this project, I formalised bounds on some of the known small values of Ramsey numbers, including R(4, 4).4

What’s next?

To get to the primary result – an exponential improvement on the upper bound for Ramsey numbers – the authors give two proofs, their second proof providing a slightly better constant. Here I only formalised the first, meaning a future project could be to improve the constant in the Lean proof. This should be easier now because a large portion of the technical difficulties have been overcome. Furthermore, any future claimed improvements on the number should be much quicker to verify, since the original framework of this paper can be extended.

This formalisation was done in Lean 3, for multiple reasons, but most importantly that when I started the prerequisite mathlib lemmas were not yet implemented in Lean 4. I’m in the process of trying to port the project to Lean 4.5

Finally, if you want to get involved in the Lean community, contribute to mathlib, use Lean to check some part of your research, or just try to understand a new argument like I did, the community website is a great place to start.


  1. While mathlib does have support for probability spaces in general, working with them in this case ultimately requires viewing them as finite sums anyway, so for convenience it’s simpler here to avoid the formal probability world. ↩︎
  2. For brevity I’ll avoid too much detail, but one example is in the proof of Theorem 9.1, where we cannot quite guarantee the red density is as large as claimed. Allowing for some small error term in \eta however, the proof can be recovered using that |U| can be made large enough. ↩︎
  3. The bounds on the off-diagonal Ramsey numbers I formalised aren’t the best-known ones, there’s really exciting recent work on lower bounds for R(3, k) and R(4, k) which I’d love to see formalised. ↩︎
  4. The problem of determining Schur numbers falls into Ramsey theory, and Schur numbers are also difficult to compute. But a recent result of Marijn Heule computes Schur number five using SAT solvers, and checks its result in a proof-checker, which itself has been formally verified in ACL2. It would be really exciting to see similar methods applied to Ramsey number five as well! ↩︎
  5. It should – in principle – be possible to port the formalisation to Lean 4, but at the moment it seems more difficult than one might hope. ↩︎
Posted in Research formalisation | Tagged combinatorics, lean | Leave a comment