Vision AI in 2025 — Peter Robicheaux, Roboflow

11.9K views · Aug 03, 2025 · 17:24 min · Watch on YouTube ↗

Takeaway

Vision needs DinoV2-style self-supervised pre-training feeding into transformer detectors to unlock language-style scaling — and harder benchmarks like RF100-VL to measure it.

Summary

Roboflow's ML lead argues VLMs still can't see — Claude 3.5/4 guesses watch times and gets school-bus orientation wrong on MMVP, hallucinating supporting details.
Root cause: CLIP-style caption contrastive pre-training can't distinguish image pairs whose visual differences (dog pose) aren't captioned; DinoV2 self-supervised pre-training does discriminate.
Object-detection field hasn't adopted big pre-training: YOLOv8 (CNN) gains <0.2 mAP from Objects365 pre-training, while LW-DETR (transformer) gains 5-7 mAP.
Roboflow announces RF-DETR, a real-time detector swapping in a DinoV2 backbone — close to SOTA on COCO and far ahead on the new RF100-VL benchmark for domain adaptation.
Introduces RF100-VL: 100 curated detection datasets covering aerial views and unusual modalities to measure visual intelligence beyond saturated COCO.

computer-visionvlmobject-detection

Original description

Attendee-Only and Attendee-Led 10min lightning talks: see https://crowdcomms.com/aiengineer25/qanda/41445

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter