I wrote 98 posts on my blog in July (that page recently enhanced using OpenAI Codex). Here's your sponsors-only summary of the most important trends and highlights from the past month.
I've been spending a lot of time with Claude Code this month. I published a video showing how I used claude --dangerously-skip-permissions
to add an automated table of contents to this README. I also wrote about using Claude Code to write, compile and run Mandelbrot in x86 assembly in a Docker container.
Working with Claude Code lead me to the following idea:
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.
The challenge then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
I've also been experimenting a lot with OpenAI Codex - the tool that runs online (via the ChatGPT app) and files PRs against your code, not their Codex CLI tool that's their version of Claude Code. I wrote about my most substantial experiment with that in Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone.
There were so many new models released this month!
Grok 4 came out, followed by some embarassing revelations - most notably that it turned out Grok would run a search for tweets from:elonmusk
when asked its opinion on controversial topics! This was fixed shortly after by an update to the system prompt.
Google released Gemini 2.5 Flash-Lite, the least expensive model in their Gemini 2.5 family.
Mistral released their first audio-input models, Voxtral Small and Voxtral Mini. They also published detailed figures on their environmental impact and released an updated Codestral code autocomplete model.
It was a huge month for open weight models from Chinese AI labs. I wrote about the following:
- Moonshot Kimi-K2-Instruct - 11th July, 1 trillion parameters
- Qwen Qwen3-235B-A22B-Instruct-2507 - 21st July, 235 billion
- Qwen Qwen3-Coder-480B-A35B-Instruct - 22nd July, 480 billion
- Qwen Qwen3-235B-A22B-Thinking-2507 - 25th July, 235 billion
- Z.ai GLM-4.5 and GLM-4.5 Air - 28th July, 355 and 106 billion
- Qwen Qwen3-30B-A3B-Instruct-2507 - 29th July, 30 billion
- Qwen Qwen3-30B-A3B-Thinking-2507 - 30th July, 30 billion
- Qwen Qwen3-Coder-30B-A3B-Instruct - 31st July, 30 billion
These are all excellent models. I've been able to run the GLM-4.5 Air and Qwen-30B models on my 64GB M2 MacBook Pro laptop and I have been astonished at how useful they are. I started using a new benchmark, "Write an HTML and JavaScript page implementing space invaders", and got working games from a single shot using GLM-4.5 Air and Qwen3-Coder-30B running directly on my own machine.
I wrote about those two in more detail, with extensive notes on how I ran them:
- My 2.5 year old laptop can write Space Invaders in JavaScript now, using GLM-4.5 Air and MLX
- Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM
There are two interesting trends here. First, it's now possible to run genuinely useful coding models directly on a high end (32GB or 64GB) developer laptop. Secondly, the Chinese AI labs are now undeniably producing the best available open weight models.
OpenAI's open weights model is rumored to show up any day now. It has some substantial competition!
The IMO is the International Mathematical Olympiad, an annual mathematical competition for high school students that's been held since 1959. It's long been a goal of AI labs to produce a model that can compete in this contest at a high level.
This year two teams achieved a gold medal performance in IMO, from OpenAI and from Gemini DeepMind.
OpenAI announced first and got a lot of press coverage. The Gemini team announced later and there was then some dispute between the two teams about whether their announcement timings were compatible with the guidelines set out by the IMO themselves.
Both models scored the same, solving 5 of the 6 problems (the unsolved 6th was also the hardest for the human contestants). Notably, neither of the gold medal models had access to tools or internet search - they were able to reason through the problems using their model weights alone.
Google just released Gemini Deep Think for their Gemini Ultra $249.99/month subscribers - a close relative of the model that they used for the IMO.
One of the best ways I know of to level up as a prompt engineer is to reverse engineer the system prompts of other products and see how they work.
I wrote up three of those explorations in detail this month:
- Using GitHub Spark to reverse engineer GitHub Spark - GitHub Spark is GitHub's new prompt-to-app platform, and it has a fascinating system prompt which includes multiple paragraphs of instructions on how to implement good design.
- OpenAI's new study mode is a mode of ChatGPT designed to help study without doing homework for you - and it's implemented entirely as a system prompt.
- Reverse engineering some updates to Claude looks at two new Claude features - "create calendar event/create message" and "Upload PDFs, images, code files, and more to AI-powered apps" - and uses the system prompt to help explain what they are and how they work.
- My daily drivers have remained the same as last month: Claude Sonnet 4 for most things, OpenAI o3 for search and research tasks, both through their respective apps and websites.
- I'm fully switched to Zed as my editor, because it uses so much less memory than VS Code.
- I'm running Claude Code a lot. I've also started tinkering with OpenAI's equivalent, codex-cli, to run Claude Code style tasks with their models.
- I continue to use my own LLM tool for other command-line tasks, defaulting to GPT-4.1 but often using Gemini 2.5 Pro and o3 for harder tasks.
- The only time I use GPT-4o is for advanced voice mode. I wish they'd upgrade that to use a more powerful model!
- For local models I've been leaning more on LM Studio, especially now they've changed their policy to allow commercial use of their free desktop app. I also still run Ollama for those, and frequently dabble with mlx-lm as well.
- If you're interested in LLM evals, Frequently Asked Questions (And Answers) About AI Evals by Hamel Husain and Shreya Shankar is essential reading.
- Another example of the lethal trifecta: Supabase MCP can leak your entire SQL database.
- Christopher Smith put together a delightful video introduction to my LLM tool: Become a command-line superhero with Simon Willison's llm tool.
- A paper on LLM programming productivity came out that got a lot of coverage: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. They found that developers using LLMs frequently over-estimated their productivity gains and often worked slower, not faster. Here are my own notes on that paper.
- Django celebrated its 20th birthday! I published an annotated version of a talk I gave on the 10th birthday about Django Origins.
If this newsletter was useful feel free to forward it to friends who might find it useful too, especially if they might be convinced to sign up to sponsor me for the next one!
Thanks for your support,
Simon Willison https://simonwillison.net/
I'm available for consulting calls over Zoom or similar, you can contact me at contact@simonwillison.net
I also offer private remote workshops for teams, of both my Building software on top of Large Language Models workshop and a new workshop on Writing code with LLMs.