Hi! I’m Huajian Xin (辛华剑), a Ph.D. student in Artificial Intelligence at the University of Edinburgh, where I’m supervised by Prof. Wenda Li. My research explores how large language models can learn to reason in the rigorous world of formal mathematics – and beyond. In short: I want to build AI systems that don’t just sound right, but can actually prove themselves right.
At the moment, I’m also working as a student researcher at ByteDance’s Seed Team in London. There, I had the privilege of contributing to Seed-Prover, which recently reached silver medal–level performance at the International Mathematical Olympiad (IMO) 2025—a milestone that still feels a little surreal to me.
Before this, I interned at DeepSeek AI (where I worked on the DeepSeek-Prover series), and earlier, I studied Philosophy (Logic) at Sun Yat-sen University with a minor in Mathematics. Along the way, I’ve been lucky to collaborate with inspiring mentors including Xiaodan Liang, Daya Guo, Chong Ruan, and Zheng Yuan.
Recently, I was honored to be interviewed by Overseas Unicorn (海外独角兽) about my research journey and the broader vision I call Certified AI. If you’re curious, you can read it here.
If you’re wondering how to pronounce my name: “Xin” is like sheen and “Huajian” is roughly hwa-jyen.
You can find me on Google Scholar, GitHub, or my CV. And of course, feel free to drop me an email if you’d like to chat or collaborate.
Research
Motivation: A Journey Towards Certified AI
Artificial intelligence is often said to be entering the “Era of Experience” (a phrase coined by David Silver and Richard Sutton). I find this description apt: we’ve reached the limits of training purely on static human data, and the next breakthroughs will come from systems that learn by interacting with environments, generating their own data, and receiving constant feedback.
My conviction is that formal mathematics offers the perfect proving ground for this shift. Proof assistants like Lean are not just symbolic calculators—they define a miniature universe where every object and relation is rigorously specified, every inference is computable, and every step can be mechanically verified against a trusted kernel.
In other words, formal systems give us more than “rich symbolic structures.” They provide an idealized, structured world model: a simulation of reasoning itself, as clean and self-consistent as the idealized models physicists use to study nature. Within this simulated universe, an AI agent can experiment endlessly, receive dense and reliable feedback at every step, and accumulate experience without ambiguity or noise.
This makes formal mathematics uniquely suited as the training ground for what I call Certified AI—systems that move beyond plausible outputs to provably correct reasoning, capable of checking, reflecting on, and repairing their own thought processes.
Looking Back: What Has Moved the Needle in LLMs for Formal Theorem Proving
When I look back at the past few years, I see that most of the real progress in large language models for formal theorem proving has come from two directions: better data synthesis and better task abstractions. These shifts didn’t just incrementally improve performance—they redefined how we think about training and evaluating theorem-proving models.

Pass rate on the miniF2F benchmark over time. The sharp acceleration after May 2024 reflects advances in data synthesis and task abstraction.
First, data synthesis through autoformalization.
Relying only on existing proof libraries like Mathlib soon runs into limitations. The real breakthrough was learning to generate new data by autoformalizing math problems from natural language into formal statements and proofs. This made it possible to curate large, domain-specific datasets that match the challenges of competition-style mathematics. With this richer supervision, methods like rejection-sampling fine-tuning and reinforcement learning from verifier feedback became far more effective.
→ This turning point was exemplified by systems such as DeepSeek-Prover V1 and the Lean Workbook project, which both showed how shifting from “using what’s available” to “creating what’s needed” could meaningfully push the frontier.
Second, task abstraction beyond step-by-step prediction.
Early systems treated theorem proving as predicting only the next tactic given a local goal, but this view is too narrow. Stronger results came from reframing the task as whole-proof generation, and eventually as lemma-driven reasoning—where the agent proposes, proves, and reuses lemmas in a growing knowledge base. This abstraction is closer to how mathematicians actually work and gives models a medium-term memory for strategy.
→ Advances like LEGO-Prover and later Seed-Prover demonstrated how lemma-centric approaches, combined with test-time refinement strategies, could enable both deeper reasoning chains and broader problem coverage—culminating in results strong enough to reach silver medal–level performance at the IMO 2025.
Beyond Solving Theorems: Toward Automated Proof Engineering
As striking as the recent progress in theorem proving has been, it is important to acknowledge its limits. Most benchmarks still resemble math competitions: a collection of isolated problems, each awaiting a clever proof. Real-world formal mathematics is nothing like this. It is a living ecosystem, more akin to a software project than a problem set.
This has long been recognized. The QED Manifesto in 1994 already articulated the dream of a computer-verified repository of all significant mathematics. In practice, projects such as Flyspeck and the Archive of Formal Proofs have shown what it takes: hundreds of contributors, millions of lines of formal code, and years of collective effort to maintain a coherent body of knowledge. Day-to-day progress is less about solving a single theorem than about adding new material, refactoring old proofs, harmonizing interfaces, and keeping the entire library consistent as it evolves. In short, the real bottleneck is not proving per se, but engineering.
This trajectory mirrors the history of software itself. Early programming was about getting individual algorithms correct; today, software engineering is about maintaining systems at scale, with tools for modularity, refactoring, testing, and continuous integration. Formal mathematics is on the same path. The challenge is not just “can we solve this problem?” but “can we sustain a growing repository of verified knowledge?”
It is in this context that I propose Automated Proof Engineering (APE) as the next frontier. Instead of agents that merely close a goal, we need agents that can edit, repair, and extend large proof libraries—interpreting tasks, navigating dependencies, and landing changes that keep the whole ecosystem healthy. This is a natural continuation of the shifts that have already advanced theorem proving: if data synthesis taught us to create the right problems, and task abstraction taught us to reuse knowledge, then the next step is to embed these abilities within the maintenance of libraries themselves.
To make this vision concrete, I introduced APE-Bench I, the first benchmark built from the actual commit history of Mathlib4. Unlike competition-style datasets, APE-Bench I evaluates whether an AI can perform realistic proof-engineering tasks: adding a feature, fixing a bug, refactoring a file—exactly the kind of work that human contributors do every day.

APE-Bench I benchmark structure and evaluation pipeline. Real-world Mathlib commits are transformed into tasks with natural-language instructions, providing a realistic environment for testing proof-engineering agents.
Looking ahead, I see APE-Bench I as just the first step in a staged agenda. The broader APE-Bench series is designed to progressively increase realism and difficulty: from single-file edits (APE-Bench I), to multi-file coordination (APE-Bench II), and ultimately to autonomous, end-to-end workflows (APE-Bench III).

In my view, this marks the beginning of a shift from theorem proving as problem solving to proof engineering as codebase evolution. Ultimately, I see Automated Proof Engineering as an essential step toward Certified AI: systems that can not only generate correct proofs, but also sustain and grow entire libraries of verified knowledge.