Cloudflare accuses Perplexity of evading no-crawl rules with stealth bots, masking identity, and bypassing robots.txt, prompting a verified bot delisting.
Cloudflare is observing stealth crawling behaviour from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences.Â
Cloudflare sees continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files. The Internet for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behaviour, which is incompatible with those preferences, Cloudflare have de-listed them as a verified bot and added heuristics to its managed rules that block this stealth crawling.
How well-meaning bot operators respect website preferences
The Internet has expressed clear preferences on how good crawlers should behave. All well-intentioned crawlers acting in good faith should:
- Be transparent. Identify themselves honestly, using a unique user-agent, a declared list of IP ranges or Web Bot Auth integration, and provide contact information if something goes wrong.
- Be well-behaved netizens. Don’t flood sites with excessive traffic, scrape sensitive data, or use stealth tactics to try and dodge detection.
- Serve a clear purpose. Whether it’s powering a voice assistant, checking product prices, or making a website more accessible, every bot has a reason to be there. The purpose should be clearly and precisely defined and easy for site owners to look up publicly.
- Separate bots for separate activities. Perform each activity from a unique bot. This makes it easy for site owners to decide which activities they want to allow. Don’t force site owners to make an all-or-nothing decision.
- Follow the rules. That means checking for and respecting website signals like robots.txt, staying within rate limits, and never bypassing security protections.
More details are outlined in Cloudflare’s official Verified Bots Policy Developer Docs.
How can users protect themselves?
All the undeclared crawling activity that Cloudflare observed from Perplexity’s hidden User Agent was scored by the company’s bot management system as a bot and was unable to pass managed challenges. Any bot management customer who has an existing block rule in place is already protected. Customers who don’t want to block traffic can set up rules to challenge requests, giving real humans an opportunity to proceed. Customers with existing challenge rules are already protected. Lastly, Cloudflare added signature matches for the stealth crawler into our managed rule that blocks AI crawling activity. This rule is available to all customers, including its free customers. Â
What’s next?
It’s been just over a month since Cloudflare announced Content Independence Day, giving content creators and publishers more control over how their content is accessed. Today, over two and a half million websites have chosen to completely disallow AI training through the company’s managed robots.txt feature or its managed rule blocking AI Crawlers. Every Cloudflare customer is now able to selectively decide which declared AI crawlers are able to access their content in accordance with their business objectives.
Cloudflare is actively working with technical and policy experts around the world, like the IETF efforts to standardize extensions to robots.txt, to establish clear and measurable principles that well-meaning bot operators should abide by. This is an important next step in this quickly evolving space.