What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

8.8K views · Apr 24, 2026 · 20:24 min · Watch on YouTube ↗

Takeaway

Benchmarks all trend up while models still confidently bullshit on nonsense prompts; epistemic pushback is a major remaining gap.

Summary

Peter Gostev built BullshitBench — 155 deliberately nonsense questions — and grades model responses with LLM-as-judge to test whether they push back or play along.
Claude Sonnet pushes back cleanly; Gemini partially pushes back then invents proxy variables and rationalizations, illustrating sycophancy under nonsense prompts.
Shares unreleased Arena.ai data tracking ~700 text models since Q2 2023, with the top-model-per-org line climbing steadily.
Argues 'line goes up' benchmark psychosis hides persistent gaps — especially in epistemic humility on ill-posed questions.

evalsbenchmarkshallucinations

Original description

What type of real world model responses do users still hate? We get to see millions of user's prompts - and we let users 'dislike both' on the Arena. We'll show you trends and examples of the tasks that LLMs still suck at despite the relentless hillclimbing.

Speaker info:
- https://x.com/petergostev
- https://www.linkedin.com/in/peter-gostev/