Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

Original: Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

2.9K views · Aug 23, 2025 · 16:27 min · Watch on YouTube ↗

Takeaway

Aesthetic evals must center human perception — current metrics like FID and CLIP miss what people actually find broken in generative imagery.

Summary

Diego (Krea.ai) shows o3 spending 17s of tool-calling Python/OpenCV failing to label an obviously broken AI-generated hand image — current models can't do basic aesthetic perception.
JPEG/MP3/MP4 work by exploiting human perception (brightness > color, etc.) — but our training data is full of such lossy compressed media, contaminating model 'aesthetics'.
FID scores penalize JPEG artifacts heavily even though images look identical to humans — using FID to grade generative models is misaligned with perception.
Evals focus on what's easy to measure (CLIP prompt adherence, object counts) and miss perceptual coherence ('the clock doesn't look right, that sky makes no sense').
Quotes friend at Midjourney: predicting the car was easy, predicting traffic was hard — what 'traffic' are AI engineers missing now? Possibly perceptual evaluation.

evalsaestheticsgenerative-media

Original description

Special session with KREA.ai's cofounder Diego Rodriguez on how evals for aesthetics and image/generative media work — the hardest kinds of evals.

  linkedin.com/in/asciidiego/

Timestamps
00:15 Introduction to Perceptual Evaluations
00:50 The Problem with Current AI Evaluations
02:16 Historical Context and Compression
05:14 Limitations in AI and Human-centric Metrics
08:00 Rethinking Evaluation and the Future of AI
12:44 Evaluating Our Evaluations
13:32 Krea's Role and Call to Action