← back
Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai
Original: Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai
Takeaway
Aesthetic evals must center human perception — current metrics like FID and CLIP miss what people actually find broken in generative imagery.
Summary
- Diego (Krea.ai) shows o3 spending 17s of tool-calling Python/OpenCV failing to label an obviously broken AI-generated hand image — current models can't do basic aesthetic perception.
- JPEG/MP3/MP4 work by exploiting human perception (brightness > color, etc.) — but our training data is full of such lossy compressed media, contaminating model 'aesthetics'.
- FID scores penalize JPEG artifacts heavily even though images look identical to humans — using FID to grade generative models is misaligned with perception.
- Evals focus on what's easy to measure (CLIP prompt adherence, object counts) and miss perceptual coherence ('the clock doesn't look right, that sky makes no sense').
- Quotes friend at Midjourney: predicting the car was easy, predicting traffic was hard — what 'traffic' are AI engineers missing now? Possibly perceptual evaluation.
evalsaestheticsgenerative-media
Original description
Special session with KREA.ai's cofounder Diego Rodriguez on how evals for aesthetics and image/generative media work — the hardest kinds of evals. linkedin.com/in/asciidiego/ Timestamps 00:15 Introduction to Perceptual Evaluations 00:50 The Problem with Current AI Evaluations 02:16 Historical Context and Compression 05:14 Limitations in AI and Human-centric Metrics 08:00 Rethinking Evaluation and the Future of AI 12:44 Evaluating Our Evaluations 13:32 Krea's Role and Call to Action