How Transformers Finally Ate Vision – Isaac Robinson, Roboflow

4.9K views · May 08, 2026 · 17:05 min · Watch on YouTube ↗

Takeaway

Pretraining beats inductive bias: ViTs won not because n^4 is good but because Flash Attention plus MAE/DINO pretraining made it scale.

Summary

Isaac Robinson (Roboflow research lead) traces how ViTs ultimately beat CNNs: pure n^4 set-to-set attention with no inductive bias, on 16x16 patches with learned positional encodings, won via massive ViT-specific pretraining and LLM-borrowed infra (Flash Attention).
Swin Transformer added back locality (windowed attention with shifted windows) to drop to n^2 — convolution-like inductive bias retrofitted onto transformers.
ConvNeXt swung back: take ViT lessons (4x4 patchify, mixer/feed-forward pattern, hierarchical structure, LayerNorm) and graft them onto CNNs — beats ViT and Swin on ImageNet.
Meta's Hera stripped inductive biases one-by-one and replaced them with pretraining (MAE — masked autoencoder); MAE is ViT-specific (you can't drop patches in a convolution).
DINOv2/v3 self-supervised pretraining yields semantically meaningful feature maps (PCA shows cat paws traced correctly); linear-probe accuracy nearly matches fully supervised. SAM lineage: SAM (ViT+MAE), MobileSAM (TinyViT hybrid), SAM2 (Hera), SAM3 (gives up ablating, uses big pretrained backbone).

vision-transformerspretrainingcomputer-vision

Original description

Vision used to belong to CNNs. This talk explains why that changed, and why transformers only recently started winning for vision despite looking like the less natural fit for images. The answer runs through pretraining, scaling, borrowed infrastructure from the LLM world, and the long arc back to the simple architecture that scales best.

Using the evolution from ViT and Swin through ConvNeXt, Hiera, SAM, and RF-DETR, Isaac Robinson walks through what actually made transformer vision systems practical, where the tradeoffs still are, and why deployment flexibility now matters as much as raw benchmark wins. What comes next for VLMs, world models, and physical AI?

Speaker info:
- https://www.linkedin.com/in/robinsonish/