2025 in LLMs so far, illustrated by Pelicans on Bicycles — Simon Willison

158.0K views · Jul 09, 2025 · 18:30 min · Watch on YouTube ↗

Takeaway

Frontier capability is rapidly commoditizing across labs and shrinking model sizes, while pricing collapses except for a few overpriced outliers.

Summary

Tracks ~30 significant model releases from Dec 2024 to mid-2025 using a 'draw a pelican on a bicycle in SVG' benchmark instead of leaderboards
Notes local model breakthrough: Llama 3.3 70B and Mistral Small 3 (24B) match GPT-4-class capability, runnable on a 64GB MacBook
DeepSeek V3 (685B) was dropped on Hugging Face Christmas Day for ~$5.5M training cost; DeepSeek R1 in January caused Nvidia's largest single-day market cap drop
Pricing has crashed ~500x for capable models; GPT-4.5 at $75/M tokens was an outlier and deprecated, while Gemini 2.5 Pro produces a quality pelican for ~4.5 cents

llm-modelsbenchmarksopen-weights

Original description

What's changed in the world of LLMs since the AIE World's Fair last year? A lot!

I'll be taking full advantage of my role as a fiercely independent researcher to review the past 12 months of advances in the field and catch everyone up on the latest models, free from any influence of vendors or employers.

About Simon Willison
Simon Willison is the creator of Datasette, an open source tool for exploring and publishing data. He currently works full-time building open source tools for data journalism, built around Datasette and SQLite.

Prior to becoming an independent open source developer, Simon was an engineering director at Eventbrite. Simon joined Eventbrite through their acquisition of Lanyrd, a Y Combinator funded company he co-founded in 2010.

He is a co-creator of the Django Web Framework, and has been blogging about web development and programming since 2002 at simonwillison.net

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps:

00:00 A review of the last six months in LLMs
01:08 The "Pelican Riding a Bicycle" Benchmark
02:10 AWS Nova and Llama 3.3 70B
03:30 DeepSeek and its impact
05:42 Mistral Small 3 and the rise of local models
06:45 Claude 3.7 Sonnet and GPT 4.5
08:44 Gemini 2.5 Pro, GPT-4o, and Llama 4
11:21 GPT 4.1, O3, and O4 Mini
12:05 Claude 4 and other recent releases
14:11 Amusing and concerning LLM bugs
16:58 The power of tools and reasoning in AI
17:41 Prompt injection and the "Lethal Trifecta"
18:11 The future of the pelican benchmark