The Current State of Browser Agents - Jerry Wu and Wyatt Marshall

623 views · Jun 03, 2025 · 21:13 min · Watch on YouTube ↗

Takeaway

Browser agents are now feasible for read tasks and creeping into write tasks, but evaluation is hard and the underlying browser infrastructure can swing performance as much as the model.

Summary

Jerry Wu and Wyatt Marshall (founders of Illuminate) define browser agents as AI that controls a web browser via an observe-reason-act loop (screenshots/VLM or HTML/DOM extraction).
Common use cases: web scraping for sales prospecting, software QA, form/job-application filling, generative RPA replacing brittle UiPath-style workflows.
Tasks split into read tasks (info gathering, easier) and write tasks (state-changing actions, much harder both to build and evaluate).
Introduce their own benchmark (published the prior week) measuring agent performance and emphasize that infrastructure (browser hosting) materially affects results.

browser-agentsbenchmarksrpa

Original description

Browser agents are here. But beyond simple sample use cases (I'm looking at you flight booking demo), are they as good as advertised? 

In this talk, we introduce Web Bench, a new benchmark we've developed that rigorously tests browser agents across 450+ websites on real-world action based objectives such as info extraction, login/auth, form filling, and others. We'll dive into the results, unpack some unexpected discoveries, and discuss broader implications for the future of general purpose agents. 

You'll walk away with practical insights into:

1. data-driven understanding of the capabilities and limitations of state-of-the-art browser agents
2. how to meaningfully evaluate browser agents 
3. hard-won lessons on designing and launching a benchmark

Come through and see what browser agents can really do.

Resources

Leaderboard - https://webbench.ai/
Technical Report: https://halluminate.ai/blog/benchmark
Github - https://github.com/Halluminate/WebBench
Huggingface - https://huggingface.co/datasets/Halluminate/WebBench