Unmasking the AI Visitors on Your Website: A Comprehensive Guide for Digital Agencies

The Rise of AI Bots and Crawlers: Benefits, Risks, and What You Need to Know
In the rapidly evolving digital landscape, artificial intelligence (AI) is no longer a futuristic concept; it's an integral part of our online world. From Google's omnipresent crawlers to the burgeoning presence of AI models like ChatGPT, Claude, and Perplexity, these automated entities are constantly interacting with websites. Understanding their behavior is no longer a niche concern for tech geeks; it’s a critical aspect of modern SEO optimisation and website performance auditing. This blog post will delve deep into the methods you can employ to identify, monitor, and manage AI bot traffic, focusing on practical approaches using platforms like Cloudflare Bot Management, Google Analytics, and direct server log analysis. By the end of this guide, you'll have a clear roadmap to demystify the AI presence on your website, allowing you to make informed decisions about your content strategy, infrastructure scaling, and overall data integrity. Let's embark on this journey to unmask the AI visitors and harness their presence for your benefit.
Why You Need to Pay Attention to AI Bots
The rise of AI crawlers presents a dual-edged sword for website owners. On one hand, these sophisticated bots are essential for the discoverability and indexing of your content in the new era of AI-driven search. On the other, they can introduce complexities that, if left unmanaged, could impact your site's performance and data accuracy. Understanding this dichotomy is the first step towards effective AI bot management.The Good: AI as an Ally for Discoverability & SEO Impact
Legitimate AI crawlers, such as Googlebot, Bingbot, and the newer generation of AI search crawlers like OAI-SearchBot, play a crucial role in ensuring your content is found by users. They index pages, understand context, and may cite your content in AI-driven search experiences, potentially driving new referral traffic. For instance, Googlebot is fundamental to your site's visibility in Google Search results without its diligent work, your content would remain largely undiscovered.[1] With the advent of generative AI and AI-powered search experiences, the role of these crawlers is expanding. AI search engines often provide summarised answers to user queries, and crucially, they may link back to the original source of information. This means that allowing legitimate AI search crawlers to access your site could lead to new avenues of referral traffic and increased brand visibility. For example, if your well-researched article provides the perfect answer to an AI search query, that AI might cite your page, driving interested users directly to your site. This represents a significant opportunity for content creators and businesses to gain exposure in an evolving search landscape.The Bad: Hidden Costs and Skewed Analytics Data
Each AI bot request consumes server resources and bandwidth, potentially affecting site performance for human users.[2]Unfiltered AI bot traffic can skew your analytics, inflate page views, and mislead your marketing analysis. For digital marketers and agencies relying on AI-powered analytics tools, this can lead to costly misinterpretations. Another growing concern is data scraping. While search engine crawlers index content for legitimate search purposes, some AI data scrapers are designed to harvest vast amounts of information to train large language models or for other commercial uses, often without explicit permission. This raises questions about content ownership, intellectual property, and the fair use of online data. Understanding which bots are scraping your site and for what purpose—becomes paramount in protecting your digital assets.
The Ugly: Malicious Bots and Website Security
Some bots are outright malicious, including scrapers, spambots, and DDoS attackers.It's essential to differentiate between helpful AI bots and harmful automated traffic.
A sudden, unexplained surge in traffic, unusual navigation patterns, or suspicious IP addresses can all be indicators of malicious bot activity. Effective bot management isn't just about optimising for AI; it's about safeguarding your digital presence from all unwanted automated threats.
In the following sections, we will explore practical methods to distinguish between these different types of visitors, giving you the power to understand, control, and leverage the AI presence on your website.
References:
[1] Googlebot and Other Google Crawler Verification | Documentation [2] AI Crawlers Are Reportedly Draining Site Resources & Skewing Analytics:Part 1: Your First Line of Defence — Using Cloudflare to Uncover AI Bot Activity
For many website owners and agencies, Cloudflare serves as a powerful shield, offering a suite of services from DDoS protection to content delivery. Beyond its security and performance enhancements, Cloudflare has become an invaluable tool for understanding and managing the influx of AI bot traffic. Its dedicated Bot Management features provide granular insights into who (or what) is crawling your site, offering a crucial first line of defence and analysis.Introduction to Cloudflare’s Bot Management
Cloudflare’s Bot Management is designed to identify and mitigate automated traffic, distinguishing between legitimate and malicious bots. It leverages machine learning, behavioural analysis, and fingerprinting to accurately classify incoming requests. This sophisticated system allows you to gain visibility into bot activity and take appropriate actions, ensuring that your website remains accessible to human users and beneficial bots, while fending off unwanted automated interactions.[3]Finding the Hidden Data: The AI Audit Tab
One of Cloudflare’s most significant recent additions for understanding AI bot activity is the AI Audit tab. This feature provides a summary of the crawling behaviour of popular and known AI services accessing your site. It’s a game-changer for agencies and website owners looking to understand the specifics of AI scanning. To access this invaluable data, navigate to your Cloudflare dashboard. Select the specific site you wish to analyse, and then look for the AI Audit tab in the left-side navigation bar. Here, you’ll find a clear breakdown of AI bot activity, including the names of the crawlers, their operators, and the number of requests they’ve made to your site. This view is remarkably similar to the screenshot you provided, offering a quick and digestible overview of your AI visitors.Cloudflare AI Audit Tab Overview

This detailed breakdown allows you to identify which AI services are most actively engaging with your content. Cloudflare achieves this by examining the `User Agent` HTTP header in incoming requests, which often contains the bot’s name. Even if a bot doesn’t explicitly identify itself, Cloudflare employs other heuristics like IP addresses and behavioural patterns to classify them accurately. [4]
Understanding these distinctions is crucial. For instance, OpenAI uses `GPTBot` for data scraping to train models, while `OAI-SearchBot` is used for their new AI search engine. The former might simply consume your content, while the latter could potentially drive traffic back to your site through AI search results. Cloudflare’s AI Audit tab helps you differentiate between these types of interactions, allowing you to tailor your strategy accordingly.
Taking Control: Blocking and Managing AI Bots in Cloudflare
Once you’ve identified the AI bots interacting with your site, Cloudflare provides powerful tools to manage their access. Whether you want to block all AI bots, allow only specific ones, or implement nuanced policies, Cloudflare’s Bot Management and Web Application Firewall (WAF) features offer the flexibility you need. For a quick and decisive action, Cloudflare offers a one-click option to block all known AI bots and crawlers. This is particularly useful if you need a temporary pause to assess your strategy or if you decide to restrict all AI scraping. To enable this, navigate to the Security tab in your Cloudflare Dashboard, then select Bots. Look for the “Block AI Scrapers and Crawlers” card and toggle the button to the “On” position. This action will block AI-related bots based on a list maintained by Cloudflare.Enable or disable AI scraper blocking in one click

Cloudflare WAF Rule Creation for Bot Management

References;
[3] Cloudflare Bot Management & Protection: [4] Start auditing and controlling the AI models accessing your content:Part 3: Digging Deeper – Uncovering AI Insights in GA4
While Cloudflare provides an excellent overview and control mechanism for AI bot traffic at the network edge, your website analytics platform offers a different, yet equally crucial, perspective. Google Analytics 4 (GA4) is the modern standard for website analytics, and while it has built-in bot filtering, understanding how to create custom reports is essential for truly uncovering AI-driven insights.Google Analytics 4 (GA4) and the AI Traffic Challenge
GA4 automatically excludes traffic from known bots and spiders, which is helpful for maintaining clean data. However, the landscape of AI crawlers is rapidly evolving, and GA4’s default filtering might not catch every new or emerging AI bot. More importantly, even legitimate AI crawlers can skew your understanding of human user behaviour if their activity isn't segmented and analysed separately. For instance, a high volume of AI bot traffic could artificially inflate page views or reduce engagement metrics, making it difficult to assess the true performance of your content with human audiences.[5] The challenge lies in identifying these AI-driven interactions within your GA4 data. Traditional analytics often focus on human user behaviour, but with the rise of AI search engines and large language models, understanding how these automated entities interact with your site is becoming increasingly vital for SEO and content strategy. You need to be able to differentiate between a human user exploring your site and an AI bot systematically crawling it.Creating Custom AI Traffic Reports in GA4
To gain a clearer picture of AI traffic in GA4, you’ll need to create custom reports. This involves building a custom segment to identify AI sources and then creating a custom channel group for ongoing monitoring. The process, while requiring a few steps, provides invaluable insights into the impact of AI on your website traffic.Step 1: Start an Exploration & Create an AI Traffic Segment
Begin by opening your GA4 property and navigating to the Explore section. Start a new blank exploration. This is where you’ll build your custom report. 1. In the exploration, add Session Source/Medium as a dimension and Sessions as a metric. You might also want to include Engaged Sessions or Key Events for additional context on how these sessions behave. 2. Under Segments, create a new custom Session segment. Name this segment something descriptive, like “AI Sources”. 3. Add a condition to this segment: Session Source, Matches regex, and then insert a comprehensive regular expression (regex) pattern that includes known AI bot sources. A good starting point, which you can update as new AI platforms emerge, is:`.*(aitastic\.app|bnngpt\.com|chat-'gpt\.org|chatgpt\.com|claude\.ai|copilot\.microsoft\.com|copy\.ai|edgepilot|edgeservices|gemini\.google\.com|iask\.ai|neeva|nimble\.ai|openai\.com|perplexity|writesonic\.com).*`
This regex pattern allows you to capture traffic from a wide range of AI platforms. Remember to keep an eye on your traffic sources and update this regex as the AI landscape evolves. Tools like ChatGPT or Claude can even help you generate or refine regex patterns if you are not comfortable with them.[6] 4. Click Save to property so you can reuse this segment in the future, then click Apply to see the data in your exploration.Creating an AI Traffic Segment in GA4

Step 2: Visualize AI Traffic Over Time
Once your segment is applied, you can visualize the growth of AI traffic over time. In your exploration: 1. Change the Visualization to a line chart. 2. Select Session Source/Medium as the breakdown dimension and add Sessions to the Values section. 3. Set the Time Range to a broader view, such as the last 90 days, to observe trends. 4. Adjust the Granularity from Day to Week for a clearer, less noisy view of changes. This visualization will help you understand the historical impact and growth of AI traffic on your site, allowing you to spot trends and anomalies.GA4 AI Traffic Over Time Visualization

Step 3: Creating an AI Traffic Channel Group
For ongoing reporting and easier integration into standard GA4 reports, you can create a custom channel group for AI traffic. This ensures that AI traffic is consistently categorized and visible alongside your other traffic channels. 1. In the GA4 Admin section, go to Channel groups under Data display. 2. Click Create new channel group and give it a clear name, such as “Custom channel group with AI”. 3. Add a new channel within this group, naming it “AI”. Set the channel conditions to Source, matches regex, and insert the same regex pattern you used for your AI Sources segment. 4. Crucially, Reorder this new “AI” group to a higher priority, ensuring it’s assigned before “Referral” or other general channels. This prevents AI traffic from being miscategorized. Save the group.[6]Creating a Custom AI Channel Group in GA4

Step 4: View the AI Channel Group in the Traffic Acquisition Report
Once your new channel group has been collecting data for some time (it won’t apply retroactively), you can view it in your standard reports: 1. Go to Reports > Acquisition > Traffic Acquisition. 2. At the top of the data table, change the primary dimension to your newly created “Custom channel group with AI” (or whatever you named it). You will now see “AI” as a distinct line item in your traffic acquisition report, providing a clear overview of its performance.Viewing AI Traffic in GA4 Traffic Acquisition Report

Excluding Bot Traffic for Cleaner Data
While the custom reports help you understand AI traffic, it’s also important to ensure your core analytics data accurately reflects human user behaviour. GA4 has a built-in feature to exclude known bot traffic, which is enabled by default. You can verify this setting to ensure it’s active. To check this, navigate to Admin > Data Settings > Data Filters. You should see an “Internal Traffic” filter and a “Developer Traffic” filter, along with a filter for “Google-defined bot traffic”. Ensure this bot traffic filter is in an active state. This filter uses Google’s internal lists of known bots and spiders to automatically exclude their hits from your reports, helping to keep your data clean for human-centric analysis. [5]GA4 Bot Traffic Exclusion Settings

References
[5] [GA4] Known bot-traffic exclusion - Analytics Help: [6] Tracking AI Traffic in GA4: A Step-by-Step Guide | Two Octobers:Part 4: The Raw Data - Analysing Server Logs for Ultimate Proof
While Cloudflare provides a high-level overview and Google Analytics offers valuable insights into user behaviour, server logs represent the unvarnished, ground-truth record of every single request made to your website. For those who want the ultimate proof of AI bot activity, delving into server logs is the definitive method. This raw data, while more technical to analyse, provides unparalleled detail and accuracy.What are Server Logs and Why are They Important?
Server logs are plain text files that are automatically created and maintained by a server, recording every action that occurs on it. For a web server, this means logging every request for a file, whether it’s an HTML page, an image, a CSS file, or a JavaScript file. Each log entry contains a wealth of information, including: * IP Address: The unique address of the client making the request. * Timestamp: The date and time of the request. * Request Method: The type of request (e.g., GET, POST). * Requested URL: The specific file or page that was requested. * HTTP Status Code: The server’s response to the request (e.g., 200 for success, 404 for not found). * User Agent: A string of text that identifies the client making the request, such as a browser or, crucially, a bot. Server logs are incredibly important because they capture all traffic, including bots that may not execute JavaScript and therefore wouldn’t be tracked by Google Analytics. They provide a complete and unfiltered view of every interaction with your site, making them the most reliable source for identifying and analysing bot activity.[7]Accessing and Analysing Your Server Logs
Accessing your server logs will depend on your hosting provider and server configuration. Most hosting providers offer access to raw log files through their control panels, such as cPanel or Plesk. You can typically download these log files as compressed archives (e.g., `.gz` files) and then analyse them on your local machine. Once you have your log files, the key to identifying AI bots is to look for their unique user agent strings. As we have discussed, most legitimate bots identify themselves through their user agent. For example, Googlebot’s user agent string typically includes “Googlebot”, and OpenAI’s crawlers use strings like “ChatGPT-User” or “GPTBot”. To analyse your logs, you can use command-line tools like `grep` on Linux or macOS, or text editors that support searching through large files. For example, to find all requests from OpenAI’s bots in a log file named `access.log`, you could use the following command:`grep -i "GPTBot\|ChatGPT-User\|OAI-SearchBot" access.log`
This command will search the log file for any lines containing the specified user agent strings, giving you a clear list of all interactions from those bots.Here is a list of common AI bot user agent strings to look for in your server logs:
| Operator | User Agent String | | --- | --- | | Google | Googlebot | | Microsoft | Bingbot | | OpenAI | ChatGPT-User, GPTBot, OAI-SearchBot | | Anthropic | ClaudeBot, Claude-Web, Anthropic-ai | | Amazon | Amazonbot | | Apple | Applebot | | Perplexity | PerplexityBot, Perplexity-User | | Meta | Meta-ExternalAgent | | You.com | YouBot | | ByteDance | Bytespider | | Common Crawl | CCBot |
Example Server Log Entry for an AI Bot
172.23.45.67 - - [21/Jul/2025:10:30:00 +0100] "GET /blog/my-latest-article HTTP/1.1" 200 12345 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; GPTBot/1.0; +https://openai.com/bot)"
In this example, you can clearly see the `GPTBot` user agent string, confirming a visit from one of OpenAI’s crawlers.