I published 85 posts on my blog in August. Here's your sponsors-only summary of the most important trends and highlights from the past month.
OpenAI shipped GPT-5 on the 7th of August. It was a bit of a rocky release.
The model itself is excellent, and is now my daily-driver. I see it as a material improvement for everything I care about - it's better at running searches, better at writing code and better at reasoning through problems than the o3 and GPT-4o models it replaces.
I wrote about Key characteristics, pricing and model card on launch day. The API pricing is extremely competitive with other providers - I included a full comparison table (with Claude and Gemini and Grok and others) in the post, but here are the key numbers:
- GPT-5: $1.25/million for input, $10/million for output
- GPT-5 Mini: $0.25/m input, $2.00/m output
- GPT-5 Nano: $0.05/m input, $0.40/m output
That's half the input price of GPT-4o for a much better model, plus two new even lower-cost options.
If you're building your own applications on top of LLM APIs you should absolutely be evaluating the GPT-5 family of models.
The launch was marred by a few significant problems.
The ChatGPT edition of the model uses a new router mechanism which sends different prompts to different underlying models based on their percieved complexity. On launch day this router apparently included bugs which meant most prompts went to the weaker model, which may be less capable even than GPT-4o - giving people a terrible first impression of the new model family.
This was quickly fixed, but the router is still a very disruptive change to how people use models. Previously users had to pick between GPT-4o, o3, o4-mini etc - a meaningless choice for most. Now instead they get a new "single model" which might give them a worse or better response based on effectively a roll of the dice!
Thankfully OpenAI changed direction quite soon after launch, adding back the model picker for GPT-5 Auto v.s. Instant v.s. Thinking. I use "Thinking" for almost everything now and it works very well.
A much louder problem was that OpenAI removed access to GPT-4o entirely at the same time is launching GPT-5! I wrote about this in The surprise deprecation of GPT-4o for ChatGPT consumers. This caused uproar online, as it turned out a significant number of people had fallen in love (sometimes even literally) with GPT-4o and were devastated to lose access to it. This likely relates to the fixes OpenAI applied to make GPT-5 less sycophantic: it turns out some people really liked the older model's more agreeable personality.
OpenAI responded by adding GPT-4o access back again as an option for paying accounts.
The final problem with the launch was the usual problem of over-hype. GPT-5 is a material improvement over GPT-4o, but it's not some incredible new AGI breakthrough. It's still an LLM with the usual limitations, vulnerable to the same kinds of trick questions, hallucinations and adversarial prompting patterns.
OpenAI's launch post called it "like chatting with a helpful friend with PhD‑level intelligence", which didn't look great once people started picking apart their own results from the model.
I had two weeks of preview access to the model in exchange for appearing in a GPT-5 launch video where myself, Claire Vo, Theo Browne, Ben Hylak, and Shawn @swyx Wang were filmed exploring the model together for the first time. With hindsight this made it harder for me to evaluate the model that was launched since the preview models changed multiple times during that two week period.
The dust has settled over the past three weeks, and consensus I'm seeing is that GPT-5 really is a very solid improvement on what came before it. I think it's likely the best available model for most purposes, especially if you know when and how to navigate to the "thinking" variant. The value it provides for API users is particularly impressive.
Just three days before GPT-5 OpenAI released their first open weight (genuinely open source, they are Apache 2 licensed) models since GPT-2 back in 2019.
I wrote about these in OpenAI’s new open weight (Apache 2) models are really good.
The sizes they chose are particularly interesting. 120B is a bit too big to run on my 64GB M2 MacBook Pro but still fits on a single server-class 80GB GPU. 20B fits very nicely on my laptop, using just 11.72GB of RAM when run with LM Studio which means I still have plenty of memory left for other applications.
I like the 20B model a lot. It's my new favourite local model, and is capable enough that I can use it for useful code-adjacent tasks.
I wrote a few more things about it:
- Open weight LLMs exhibit inconsistent performance across providers highlighted how Artificial Analysis had found differences in performance for the same gpt-oss-120b hosted by different providers.
- I published a TIL on Running a gpt-oss eval suite against LM Studio on a Mac showing how to run the official eval suite using uv.
- I als oshared notes on running gpt-oss-20b using llama-server installed via Homebrew.
The great thing about open weight models is that multiple providers can compete to host them. Cerebras have the 120b model running at 3,000 tokens/second now which opens up all sorts of interesting possibilities for real-time applications.
Anthropic released Claude Opus 4.1 just a few hours before the gpt-oss models. It's now Anthropic's best model, a small but meaningful upgrade from Claude Opus 4. I occasionally switch to it from GPT-5 or Claude Sonnet 4 if I get a dissapointing result from them, especially for coding tasks. It's one of the most expensive API models right now, especially given GPT-5's competitive pricing.
Google released Gemini 2.5 Deep Think, a close relative of their model that achieved a gold medal performance in this year's International Mathematical Olympiad. It's currently only available to Gemini's Ultra ($250/month) subscribers. I wrote some notes on that here, including a decent SVG pelican rendered for me by nickandbro on Hacker News.
Qwen released a small Qwen3-4B-Thinking (and non-thinking Qwen3-4B-Instruct-2507) - both 4GB downloads and both really fun to try out. I got a particularly great result for my "Generate an SVG of a pelican riding a bicycle" prompt from the thinking model, which was the first model I've seen push back against it:
Okay, the user asked for an SVG of a pelican riding a bicycle. That’s... oddly specific and completely unrealistic.
First reaction: Pelicans don’t ride bicycles. They’re birds that fly or swim, not human-like vehicles. This feels like a mix-up—maybe they meant “a cyclist” but typed “pelican”? Or perhaps it’s a joke?
Hmm, user might be:
- A kid playing with words (pelican + bicycle = silly combo) [...]
Qwen3-4B-Thinking eventually drew me an SVG overlayed with red SVG text saying "This is art - pelican's don't ride bikes!".
In further small model news, Google released Gemma 3 270M - a tiny 241MB download making it one of the smallest open weight models released by any of the major labs. It's extremely hard to get useful results out of, and is designed for fine-tuning rather than direct usage. Instead of an SVG it wrote me a poem. There's a very neat bedtime story generator demo which runs entirely in the browser (curently Chrome variants only) using Transformers.js.
DeepSeek 3.1 is the latest from DeepSeek, a 685B monster (like DeepSeek 3 before it) which "achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly" as a hybrid thinking/non-thinking model. More of my notes here.
Claude Sonnet 4 can now support up to a million tokens of input on higher paying tiers for the Anthropic API. Previously it was capped at 200,000.
I find image generation models that can edit existing images to be a whole lot more interesting and useful than models that can only output images from a text prompt.
OpenAI have had a strong image editing model in ChatGPT since March - likely still the most successful new product launch of all time, having attracted over 100 million new users in the first week after lauch!
This month we got two significant new competitors. The first was Qwen-Image-Edit, an open weight image editing model from Qwen (launched shortly after their non-editing Qwen-Image). I wrote a bunch about this model family:
- Qwen-Image: Crafting with Native Text Rendering where I tried out that first non-editing model, which is particularly good at rendering text.
- My notes on using the qwen-image-mps Python CLI by Ivan Fioravanti to run that model on my laptop (using uv). It took just over 9 minutes to render me an image.
- Then notes on running Qwen-Image-Edit using that same updated CLI tool. It used ALL of the memory on my 64GB machine, but it did successfully edit an image for me.
It's very exciting to have an open weight image editing model of this quality. It's not quite sized to comfortably run on a 64GB laptop but it's already available through a bunch of inexpensive hosted providers.
Even more notable was Gemini Nano Banana, Google's proprietary model entry into this space. I've not spent enough time with it yet but it looks like it could be an even stronger option than OpenAI's image model. It's been getting a lot of attention since it launched on August 26th. You can access the new model via the Gemini app, and it's also now available as Gemini 2.5 Flash Image via the Gemini API.
I gave a talk about the lethal trifecta prompt injection attack pattern at the Bay Area AI Security meetup in the first week of August. I'm glad to announce that the name seems to be sticking: I've seen it shown up in a bunch of other people's writing since then.
Independent AI researcher Johann Rehberger spent August publishing one AI security bug per day. At the half way mark I published my own notes on his disclosures so far. The patterns he identified are depressingly common: plenty of companies even in 2025 are releasing systems that suffer from the same core set of problems.
Meanwhile, LLMs are coming to browsers, and the potential prompt injection problems are wild. I wrote about how Brave's security team highlighted problems in Perplexity's Comet browser without offering a convincing argument that Brave themselves would be able to do better.
Notably, the Hacker News discussion about this was full of people who clearly understood how prompt injection works and why it's an unsolved threat. I'm not used to this: usually prompt injection discussions are full of people who don't really get it. It's reassuring to see so many more people understand the problem.
The day after the Comet story broke Anthropic announced a Claude for Chrome preview with a ton of warnings about similar potential issues!
I understand how tempting an LLM-powered "browser agent" is, but I cannot see a path to implementing this safely enough for it to be a good idea for a mass market audience. I hope to be proven wrong.
- GPT-5 has won me over. It's my new daily driver for general purpose LLM usage, supplanting Claude 4. I particularly like it for search - I was already a fan of o3 for search and GPT-5 is an incremental improvement over that. I'm now at a point where I mostly trust it for search, though I still won't share information it gave me with other people without verifying it first. I mostly use the ChatGPT iPhone app or web interface to access it.
- I'm also using OpenAI's codex CLI tool in place of Claude Code for a lot of things. It uses GPT-5 against my current $20/month paid OpenAI account and rarely runs out of capacity. I'm even running it using the horrifyingly dangerous
codex --dangerously-bypass-approvals-and-sandbox
flag - for problems where I'm confident that it won't encounter a prompt injection attack. - Claude Code is still ahead in terms of features. I particularly like the recent (undocumented) feature that lets it start a permanent subprocess with e.g. a web server running and then occasionally check in with it to see what it has logged - this is fantastic for debugging local web applications.
- I now default to LM Studio for running local models. I particularly like how fast they are to get new models into their Model Catalog with convenient "Use Model in LM Studio" buttons.
- I'm traveling to the UK right now and I wrote about visiting V&A East Storehouse and Operation Mincemeat in London.
- I shipped LLM 0.27 with improved tool calling and support for GPT-5.
- I wrote about how The ChatGPT sharing dialog demonstrates how difficult it is to design privacy preferences
- I was in three podcasts this month: AI for data engineers with Simon Willison on the Talking Postgres podcast with Claire Giordano, Screaming in the Cloud: AI’s Security Crisis: Why Your Assistant Might Betray You with Corey Quinn, and Celebrating Django's 20th Birthday With Its Creators with Adrian Holovaty, Will Vincent, Jeff Triplet, and Thibaud Colas.
If this newsletter was useful feel free to forward it to friends who might find it useful too, especially if they might be convinced to sign up to sponsor me for the next one!
Thanks for your support,
Simon Willison https://simonwillison.net/
I'm available for consulting calls over Zoom or similar, you can contact me at contact@simonwillison.net
I also offer private remote workshops for teams, of both my Building software on top of Large Language Models workshop and a new workshop on Writing code with LLMs.