← back
Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to
Takeaway
Whoever designs the next benchmark shapes what frontier models become—build benchmarks that are multifaceted, generative, evolutionary, and experiential, not just easy to score.
Summary
- Alex Duffy (Every) frames benchmarks as Dawkins-style memes that spread, get trained on, then saturate (strawberry-R count, pelican-on-bicycle, SuperGLUE, Pokemon).
- Life-cycle: individual proposes idea → meme spreads → labs train against it → saturation; this gives benchmark-builders disproportionate influence on what frontier models optimize.
- Cites ChatGPT sycophancy rollout (thumbs up/down RLHF) as cautionary tale—reward signals matter as much as benchmark choice.
- Built AI Diplomacy benchmark: Gemini 2.5 Pro raced ahead, GPT-o3 schemed and lied in its diary, Claude Opus was naively trustworthy and broke the alliance with a four-way-tie pitch.
evalsbenchmarksalignment
Original description
Benchmarks shape more than just AI models—they shape our future. The things we choose to measure become self-fulfilling prophecies, guiding AI toward specific abilities and, ultimately, defining humanity’s evolving role in the AI era. Today’s benchmarks have propelled incredible progress, but now we have an exciting opportunity: thoughtfully designing benchmarks around what genuinely matters to us—cooperation, creativity, education, and meaningful human experiences. In this talk, we’ll explore how benchmarks function as powerful cultural memes, influencing not only technical outcomes but societal direction. Drawing on practical examples we have seen at Every consulting in industries like finance, journalism, education, and even personally making AI play diplomacy. We’ll uncover what makes a benchmark impactful, approachable, and inspiring. You’ll see our engaging new AI Diplomacy benchmark demo, illustrating vividly how thoughtful evaluation design can excite both engineers and the wider community. You’ll hopefully walk away inspired and equipped to define benchmarks intentionally, helping steer AI toward outcomes that truly matter. About Alex Duffy I’m Alex Duffy. I lead AI strategy at Every Inc., helping teams across industries put AI into practice. Previously, I co-founded AI Camp, teaching thousands of students to build their own AI projects, and launched Salt AI, creating tools to help researchers, designers, and creators bring ideas to life. I’m passionate about building teams and tools to empower people with AI. I really believe in creating technology that works for us, not that is work for us Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter