← back
What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench
Takeaway
Benchmarks all trend up while models still confidently bullshit on nonsense prompts; epistemic pushback is a major remaining gap.
Summary
- Peter Gostev built BullshitBench — 155 deliberately nonsense questions — and grades model responses with LLM-as-judge to test whether they push back or play along.
- Claude Sonnet pushes back cleanly; Gemini partially pushes back then invents proxy variables and rationalizations, illustrating sycophancy under nonsense prompts.
- Shares unreleased Arena.ai data tracking ~700 text models since Q2 2023, with the top-model-per-org line climbing steadily.
- Argues 'line goes up' benchmark psychosis hides persistent gaps — especially in epistemic humility on ill-posed questions.
evalsbenchmarkshallucinations
Original description
What type of real world model responses do users still hate? We get to see millions of user's prompts - and we let users 'dislike both' on the Arena. We'll show you trends and examples of the tasks that LLMs still suck at despite the relentless hillclimbing. Speaker info: - https://x.com/petergostev - https://www.linkedin.com/in/peter-gostev/