← back
Vision AI in 2025 — Peter Robicheaux, Roboflow
Takeaway
Vision needs DinoV2-style self-supervised pre-training feeding into transformer detectors to unlock language-style scaling — and harder benchmarks like RF100-VL to measure it.
Summary
- Roboflow's ML lead argues VLMs still can't see — Claude 3.5/4 guesses watch times and gets school-bus orientation wrong on MMVP, hallucinating supporting details.
- Root cause: CLIP-style caption contrastive pre-training can't distinguish image pairs whose visual differences (dog pose) aren't captioned; DinoV2 self-supervised pre-training does discriminate.
- Object-detection field hasn't adopted big pre-training: YOLOv8 (CNN) gains <0.2 mAP from Objects365 pre-training, while LW-DETR (transformer) gains 5-7 mAP.
- Roboflow announces RF-DETR, a real-time detector swapping in a DinoV2 backbone — close to SOTA on COCO and far ahead on the new RF100-VL benchmark for domain adaptation.
- Introduces RF100-VL: 100 curated detection datasets covering aerial views and unusual modalities to measure visual intelligence beyond saturated COCO.
computer-visionvlmobject-detection
Original description
Attendee-Only and Attendee-Led 10min lightning talks: see https://crowdcomms.com/aiengineer25/qanda/41445 Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter