← back

How to defend your sites from AI bots — David Mytton, Arcjet

Original: How to defend your sites from AI bots — David Mytton, Arcjet

2.0K views · Jul 30, 2025 · 20:11 min · Watch on YouTube ↗
Takeaway

Layered defenses (robots.txt + UA + reverse-DNS verification + behavioral signals) are needed to distinguish good AI crawlers from training/abuse bots and agent-driven browsers.

Summary

  • ~50% of web traffic is automated; gaming hits 60% — AI crawlers are making it worse: Diaspora saw 24% from GPTBot, Read the Docs cut bandwidth from 800GB→200GB/day by blocking AI crawlers, Wikipedia spends ~35% serving bots
  • OpenAI has at least 4 bots: SearchBot (good, like Googlebot), ChatGPT-User (real-time user fetches, no training), GPTBot (training crawler with no citation benefit), Operator (computer-use agents indistinguishable from a Chrome browser)
  • Defense layers: robots.txt (voluntary), User-Agent header (forgeable string), reverse DNS verification against published IP ranges (works for Apple, Bing, Google, OpenAI)
  • Arcjet open-sources thousands of user-agent fingerprints for rule building; recommends combining UA + IP verification + behavioral signals
  • Computer-use agents like Operator are hardest — they present as a real browser, so detection requires deeper behavioral analysis
ai-botsweb-securityarcjet
Original description
Constantly seeing CAPTCHAs? It used to be easy to detect the humans from the droids, but what else can we do when synthetic clients make up nearly half of all web requests. Rotating IPs, spoofed browsers, and agents acting on behalf of real users - are we doomed to forever be solving puzzles?
    
    In this talk, we’ll explore user agents, HTTP fingerprints, and IP reputation signals that make humans and agents stand out from scrapers, build a realistic threat model, and dig into the behaviors that reveal the LLM-mimicry. Leave with AX- and UX-safe code, benchmarks, and tools to help you take back control.