Is Your Website Blocking AI Crawlers? How to Find Out Before It Costs You.
By Greg Arnold
Your security settings might be doing their job too well. 63% of ChatGPT agent visits to websites bounce immediately, mostly due to HTTP errors, CAPTCHAs, or bot blocking rules. Approximately 35.5% of the top 1,000 websites block GPTBot via robots.txt. And since July 2025, Cloudflare has blocked AI crawlers by default on every new domain, affecting approximately 20% of the public web.
Over 80% of Cloudflare customers chose to block AI bots when the one-click option launched. For publishers protecting intellectual property, this is a deliberate choice. For most businesses, it is an accident they do not even know happened.
Your CDN (Content Delivery Network), your WAF (Web Application Firewall), and your rate limiter do not distinguish between a credential-stuffing attack and ChatGPT trying to read your homepage. An analysis of over one million AI citations found that 73% of websites have technical barriers blocking AI crawler access. Unless someone on your team specifically configured exceptions for AI crawlers, your security infrastructure may be making your content invisible to the fastest-growing search channel in a decade.
How we got here: 25 years of scraping wars
The infrastructure that blocks AI crawlers today was not built for AI. It was built for a different war entirely.
The web scraping battles started in 2000, when eBay sued Bidder's Edge for making 100,000 automated requests per day to aggregate auction data. The court ruled it trespass. That set the template. Over the next two decades, scraping lawsuits shaped the legal and technical landscape: Craigslist sued 3Taps for scraping housing ads, settling for $1 million in 2015. Ryanair sued booking aggregators and even canceled tickets booked through scraped fare data. LinkedIn fought hiQ Labs all the way to the Supreme Court over public profile scraping.
In every case, the companies being scraped had a legitimate grievance. Competitors were taking their data and using it to build rival products. The real estate industry saw MLS listing data scraped and republished on competing sites. Airlines watched fare aggregators undercut their direct booking channels. The response was rational: build walls.
A billion-dollar industry formed around those walls. Akamai launched Bot Manager in 2016. Cloudflare introduced machine-learning-based bot management in 2019. The bot security market reached approximately $1.2 billion by 2024, with projections of $4.9 to $5.6 billion by 2030. The technology evolved from simple IP blocking to behavioral analytics, browser fingerprinting, and JavaScript challenges.
Then AI search arrived.
OpenAI launched GPTBot on August 7, 2023. Within six weeks, 25.9% of the top 1,000 websites had blocked it. The New York Times sued OpenAI in December 2023 for training on its articles without consent. Cloudflare launched one-click AI bot blocking in July 2024, and over a million customers activated it. Sites blocking AI crawlers increased 336% year-over-year.
The publishers and content companies had reason to block. AI training crawlers take your data to build models without necessarily sending anything back. OpenAI's crawl-to-referral ratio for training crawls was 1,700 to 1. Anthropic's was 73,000 to 1. That is not a fair trade.
But here is the problem: the same blocking infrastructure now catches AI search crawlers, the ones that retrieve your content in real time to answer user queries and send those users to your site. The walls built to keep data thieves out are now keeping potential customers from finding you. And for most businesses that are not publishers fighting over intellectual property, this trade-off was never a conscious decision. It was a default setting.
The distinction that changes everything
Every GEO (Generative Engine Optimization) guide tells you to optimize your content structure. Write self-contained passages. Add schema markup. Improve your heading hierarchy. All of that advice is correct. None of it matters if AI crawlers cannot reach your pages in the first place.
This is the GEO equivalent of renovating a storefront that faces a brick wall. The work is real. The effort is genuine. But the audience you are trying to reach never sees it, because the door is locked.
ChatGPT now has over 883 million monthly users. Perplexity processes over a billion queries per month. Google AI Overviews serve 2 billion monthly users. When those users ask a question about your industry, AI engines retrieve content from the open web to build their answers. If your site returns a challenge page instead of content, you are excluded from that answer. Not ranked lower. Excluded entirely.
The critical difference between scraping and AI search: scraping takes your data and gives you nothing. AI search reads your data and sends you customers. Your WAF treats both the same. Harvard Business Review reported on March 6, 2026 that LLMs are overtaking traditional search, and that companies need to shift from optimizing for clicks to "engineering recall inside AI systems." That shift starts with making sure AI systems can reach your content in the first place.
How your security tools block AI crawlers
AI crawlers are bots. Your security tools are designed to block bots. The collision was inevitable, given 25 years of infrastructure built to stop exactly this kind of automated access.
Cloudflare is the most common source of accidental AI blocking. Cloudflare protects over 20% of all websites on the internet. On July 1, 2025, Cloudflare became the first major internet infrastructure provider to block AI crawlers by default on every new domain. Its bot management features include JavaScript challenges, CAPTCHA pages, and automated threat scoring. AI crawlers like GPTBot and ClaudeBot cannot execute JavaScript. When they encounter a Cloudflare challenge page, they receive an HTML page asking them to verify they are human. They are not human. The challenge fails. Your content is never retrieved. This blocking often does not appear in zone-level security event logs, giving site owners no visibility into what is happening.
Akamai Bot Manager uses behavioral analytics, fingerprinting, and real-time signature matching. AI crawlers that arrive with datacenter IP addresses and known bot user agents can be flagged as suspicious and served challenge pages or blocked outright.
AWS WAF Bot Control provides configurable rules for filtering bot traffic. AI crawlers can be categorized alongside scrapers and DDoS bots unless administrators specifically allowlist them.
Sucuri and other security plugins on WordPress and other CMS platforms often include aggressive bot filtering that catches AI crawlers in the same net as malicious traffic.
The standard cases are straightforward. But the scraping wars produced a generation of blocking methods that are far harder to detect, and far more likely to go unnoticed.
Rate limiting that looks like a normal page. Some sites return HTTP 429 (Too Many Requests) to automated visitors. The problem: a 429 response often includes a small error page that looks like HTML content. Standard monitoring tools see a successful response. Audit tools that do not check the status code will score the error page as if it were the real site, producing meaningless results. We encountered this firsthand when scanning a major real estate site that returned a 36-word rate-limit page. Every content check scored the error page. The overall score looked plausible. The entire report was garbage.
Challenge pages that return 200 OK. Many CDNs and bot management platforms return a 200 OK status code even when serving a JavaScript challenge or CAPTCHA page. The response looks successful. The content is a verification prompt. Uptime monitors do not catch it. Your analytics do not flag it. The site appears healthy from every human perspective while returning empty content to every AI crawler.
AI tarpits and honeypots. Cloudflare launched AI Labyrinth in March 2025, a feature that feeds unauthorized AI crawlers into a maze of AI-generated decoy pages filled with convincing but irrelevant content. The crawler follows hidden links deeper and deeper into meaningless pages, wasting its resources while real content stays out of reach. Other providers are building similar decoy systems. These are not accidental blocks. They are the latest evolution of anti-scraping technology, now turned against AI crawlers indiscriminately.
Bot management platforms with invisible fingerprinting. Services like Datadome, PerimeterX (now HUMAN Security), Imperva, and Kasada identify bots through TLS fingerprinting, behavioral analysis, and device characteristic analysis. They do not rely on user-agent strings or IP addresses. A crawler can present a valid browser user-agent and still be identified and blocked based on how its TLS handshake differs from a real browser. These platforms were built to stop sophisticated scrapers who had learned to mimic browsers. AI crawlers get caught in the same net.
The pattern across all of these: your security tools see a bot, your security tools block the bot. The tools do not know, and cannot know without explicit configuration, that this particular bot is ChatGPT trying to read your pricing page so it can recommend your product to a potential customer.
What you see vs. what AI crawlers see
This is why the problem stays invisible. When you visit your own website, everything works. Your browser executes JavaScript, passes CAPTCHA challenges, and renders the page normally. You have no indication that anything is wrong.
AI crawlers have a fundamentally different experience. They send an HTTP request and receive whatever the server returns without JavaScript execution. The same URL produces two completely different responses depending on who is asking:
Your browser: Request goes to yoursite.com/pricing -> CDN sees a real browser with JavaScript, cookies, and a standard user agent -> Response: 200 OK with your full pricing page HTML.
GPTBot: Request goes to yoursite.com/pricing -> CDN sees a bot user agent from a datacenter IP with no JavaScript capability -> Response: 200 OK with a Cloudflare challenge page, or 403 Forbidden. Either way, zero actual content.
The status code is the quiet part. Many CDNs return a 200 OK even when serving a challenge page, which means standard uptime monitors will not catch the problem. The response looks successful. The content is empty.
From the AI engine's perspective, your website contains a challenge page. Not your products, not your expertise, not your pricing. A challenge page. The engine moves on to the next source. Your content is never considered for citation.
The commercial cost of blocking AI crawlers
This is not a theoretical problem. Blocking AI crawlers has a direct, measured, and growing traffic cost.
A December 2025 study by researchers at Rutgers Business School and the Wharton School found that news publishers who blocked AI crawlers experienced a 23.1% decline in monthly visits. The study analyzed traffic data from October 2022 through June 2025 using SimilarWeb and Comscore data. The mechanism: AI systems act as discovery channels that drive indirect referrals. When you block them, you lose not just direct AI traffic but the downstream visits that AI-generated recommendations produce.
ChatGPT alone sent 293.5 million referral visits to websites in April 2025, up 53% from January 2025. AI referral traffic to websites grew 527% in the first five months of 2025. That growth rate has not slowed. AI search traffic currently represents approximately 1% of all web traffic and is growing at over 100% year-over-year.
The conversion data makes it more urgent. AI search traffic does not just visit. It converts. Perplexity referral traffic converts at 10.5% compared to 1.76% for Google organic in some verticals. Multiple studies show AI-referred visitors convert at 4 to 5 times higher rates in research-heavy verticals like B2B, professional services, and finance.
For a business receiving 100,000 monthly visits, blocking AI search crawlers today means forgoing roughly 1,000 direct AI visits. But the Rutgers/Wharton data suggests the total impact is significantly larger than the direct number alone, because AI systems also act as discovery channels that drive indirect referrals through other sources. When AI stops recommending you, the downstream traffic from those recommendations disappears too.
Training bots vs. search bots: a critical distinction
Not all AI crawlers serve the same purpose, and the decision to block should not be binary.
Training bots collect data for model training. Blocking them has zero direct traffic impact. They do not send visitors to your site. Many publishers block training bots for legitimate intellectual property reasons. That decision is defensible.
Search and retrieval bots retrieve content in real time to build answers for user queries. Blocking these bots means your content cannot appear in AI search results. This is the equivalent of deindexing yourself from a search engine.
| Category | User Agent | Operator | Purpose | Block = Lost Traffic? |
|---|---|---|---|---|
| Training | GPTBot/1.0 |
OpenAI | Model training data | No |
| Training | Google-Extended |
Gemini training | No | |
| Training | ClaudeBot/1.0 |
Anthropic | Model training data | No |
| Training | CCBot/2.0 |
Common Crawl | Open training corpus | No |
| Search | OAI-SearchBot/1.0 |
OpenAI | ChatGPT Search results | Yes (900M+ users) |
| Search | ChatGPT-User/1.0 |
OpenAI | User-shared URL fetching | Yes (900M+ users) |
| Search | PerplexityBot/1.0 |
Perplexity | Perplexity search results | Yes (30-45M users) |
| Search | Claude-SearchBot/1.0 |
Anthropic | Claude web search | Yes (~19M users) |
Dark Visitors tracks over 200 distinct AI crawler user agents actively scanning the web. The taxonomy matters: AI Data Scrapers (training), AI Search Crawlers (real-time retrieval), AI Assistants (user-triggered fetching), and AI Agents (autonomous browsing). Your robots.txt and WAF rules should treat these categories differently.
The problem is that most security tools do not make this distinction. Cloudflare's bot management, Akamai's bot detection, and AWS WAF's bot control treat all non-human traffic through the same lens. Unless you have specifically configured rules to allow AI search crawlers, the default behavior is to challenge or block them alongside everything else.
The cross-platform visibility trap
The platforms where AI blocking matters most are also the most diverse in how they retrieve content.
ChatGPT cites Wikipedia and encyclopedic content in 47.9% of its responses. Perplexity cites Reddit in 46.7% of responses. Only 11% of domains are cited by both ChatGPT and Perplexity. Google AI Overviews draw from its existing search index, which uses Googlebot (a different crawler entirely).
If your WAF blocks PerplexityBot but allows Googlebot, you are visible in Google AI Overviews but invisible in Perplexity. If your robots.txt blocks OAI-SearchBot, you are invisible in ChatGPT Search but potentially visible everywhere else. These are trade-offs most businesses are making without knowing it.
How to find out what AI crawlers see when they visit your site
Most businesses discover they are blocking AI crawlers only when they notice their content is absent from AI-generated answers. By then, the cost has already accumulated. The challenge is that this problem is designed to be invisible to you.
You can check your robots.txt manually. Visit yoursite.com/robots.txt and look for directives like these:
# This blocks AI training AND search crawlers (common default)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
If your robots.txt looks like this, every AI crawler listed is blocked from your entire site. A more targeted approach blocks training while allowing search:
# Block training crawlers (no traffic impact)
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow search crawlers (these send you traffic)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
But robots.txt is only half the picture.
CDN and WAF blocking is harder to detect because it happens at a layer you cannot see from your browser. Challenge pages are returned only to bots, not to human visitors. You will not encounter the block by visiting your own site. Your site loads perfectly for you while returning a "Just a moment" challenge page to every AI crawler that tries to read it. This blocking often does not even appear in your security event logs.
You need a tool that tests your site from the crawler's perspective.
GeoScored's AI Visibility Screening detects when your website's security tools are blocking AI crawlers. The screening identifies the specific barrier, whether it is Cloudflare, Akamai, AWS WAF, a rate limiter, or a CAPTCHA provider. It shows you which AI systems are affected and which can still reach your content. And it provides specific remediation steps for your particular blocking source, so you can fix the problem without compromising your security.
Once you know what is blocking you, the fix is targeted
You do not need to lower your security posture. The fix is about precision, not removal.
For Cloudflare: Navigate to Dashboard > Security > Bots > AI Crawl Control. Cloudflare now separates AI crawlers into categories (training, search, assistant). You can block training crawlers while allowing search crawlers with individual toggles. If you are on the free plan, the one-click "block AI bots" toggle blocks all categories indiscriminately. Turn off that specific toggle (your other security protections remain active) and use robots.txt for selective training-bot blocking instead.
For robots.txt: Add explicit Allow directives for search-related AI crawlers while blocking training-only crawlers, as shown in the examples above. The key is making the distinction intentional rather than defaulting to block-all.
For AWS WAF: Create a custom rule in your Web ACL that matches the User-Agent header against known AI search crawler strings (OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot) and sets the action to Allow, bypassing your bot control rules for those specific agents.
For Akamai: In Bot Manager, add the AI search crawler user agents to your verified bot allowlist. Akamai's bot categories can be configured to treat AI search crawlers differently from unverified automated traffic.
Blocking AI training crawlers is a legitimate choice. Blocking AI search crawlers is a business decision with growing revenue consequences. The worst outcome is making neither decision and letting your CDN's default settings decide for you.
See what AI crawlers see when they visit your site. Run your AI Visibility Screening at geoscored.ai. It takes under two minutes, it is free, and no account is required.
Sources
-
OtterlyAI: The AI Citations Report 2026 - OtterlyAI, January-February 2026. Analysis of 1M+ AI citations. 73% of sites have technical barriers blocking AI crawler access.
-
Cloudflare: Content Independence Day - Cloudflare Blog, July 2025. Cloudflare blocks AI crawlers by default on every new domain, affecting 20% of the public web.
-
Cloudflare Helps Content Creators Regain Control - Cloudflare, September 2024. Over 80% of customers chose to block AI bots after one-click feature launched.
-
MIT Technology Review: Cloudflare Will Now by Default Block AI Bots - MIT Technology Review, July 2025.
-
PPC.land: Blocking AI Crawlers Backfired - PPC.land, December 2025. Rutgers Business School and Wharton School study: publishers who blocked AI crawlers experienced 23.1% decline in monthly visits.
-
Search Engine Land: Insights from ChatGPT Agent Mode - Search Engine Land, October 2025. 63% of ChatGPT agent visits bounced due to HTTP errors, CAPTCHAs, or bot blocking.
-
DemandSage: ChatGPT Statistics - DemandSage, January 2026. ChatGPT reached 883 million monthly users.
-
DemandSage: Perplexity AI Statistics - DemandSage, 2026. Perplexity processes over 1 billion queries per month.
-
TechCrunch: Google's AI Overviews Have 2B Monthly Users - TechCrunch, July 2025.
-
Search Engine Land: AI Referral Traffic - Search Engine Land, 2025. ChatGPT sent 293.5M referral visits in April 2025, up 53% from January.
-
Search Engine Land: AI Traffic Up 527% - Search Engine Land, 2025.
-
upGrowth: AI Traffic Share Report 2026 - upGrowth, 2026. AI search traffic represents approximately 1% of all web traffic, growing 100%+ YoY.
-
SE Ranking: AI Traffic Research Study - SE Ranking, 2025. Perplexity referral conversion at 10.5% vs. Google organic at 1.76%.
-
Dark Visitors: AI Bot User Agent Database - Dark Visitors, 2024-2026. 200+ AI crawler user agents tracked.
-
Averi.ai: ChatGPT vs. Perplexity vs. Google AI Mode Citation Benchmarks - Averi.ai, 2026. ChatGPT cites Wikipedia 47.9%; Perplexity cites Reddit 46.7%; only 11% of domains cited by both.
-
Harvard Business Review: LLMs Are Overtaking Search - HBR, March 6, 2026.
-
Cloudflare: AI Crawl Control - Cloudflare, 2025-2026. AI crawler taxonomy and granular control settings.
-
Cloudflare Community Forum: Invisible AI Crawler Blocking - Cloudflare Community, 2025. Default blocking does not appear in zone-level security event logs.
-
eBay v. Bidder's Edge - 2000. First major anti-scraping ruling establishing trespass to chattels framework.
-
Craigslist Inc. v. 3Taps Inc. - 2012-2015. $1M settlement. Established that circumventing IP blocks after cease-and-desist constitutes unauthorized access.
-
hiQ Labs v. LinkedIn - 2017-2022. Supreme Court vacated and remanded. Settled with permanent injunction.
-
PhocusWire: U.S. Court Rules Against Booking.com in Ryanair Lawsuit - PhocusWire. Ryanair wins CFAA jury verdict over screen-scraping.
-
Grand View Research: Bot Security Market - Grand View Research. Bot management market ~$1.2B in 2024, projected $4.9-5.6B by 2030.
-
Akamai Revolutionizes Bot Management - PR Newswire, February 2016. Akamai Bot Manager launch.
-
Cloudflare Bot Management: Machine Learning and More - Cloudflare Blog, 2019. ML-based bot detection launch.
-
Stan Ventures: Major Websites Block OpenAI's GPTBot - Stan Ventures, August 2023. 25.9% of top 1,000 blocked GPTBot within six weeks of launch.
-
TechCrunch: NYT Wants OpenAI and Microsoft to Pay - TechCrunch, December 2023.
-
Cloudflare: Declaring Your AIndependence - Cloudflare Blog, July 2024. One-click AI bot blocking; 1M+ customers activated.
-
BuzzStream: 336% Increase in AI Crawler Blocking - BuzzStream, 2025. Year-over-year growth in sites blocking AI crawlers.
-
Cloudflare: AI Labyrinth - Cloudflare Blog, March 2025. Honeypot that feeds unauthorized AI crawlers into a maze of AI-generated decoy pages.