Identifying And Analyzing Bot Activity

This is just a cleaned up, reformatted, and updated repost of a Message thread which I started in WMW (webmasterworld.com) 2008, titled Quick primer on identifying bot activity.

The article assumes the reader is familiar with server side technologies used to create dynamic pages. Also a good understanding of the purpose of browser request headers and how to access and use them. This page more than a bit dated when it comes to mobile browsers and the abilities of the major search engines. I have for the most part left this page as I wrote it back in 2008.

Background: Content Scrapers

The main purpose of this article describe methods to limit the content which a content scrapers can access at any one point in time. As a content producer I want to restrict and stop from gaining access to the content which I have assembled so they can not profit off all the work that has gone into assembling all the content.

Content scrapers are people or entities that re-brand/steal/copy other peoples content and place it to there own site and earn the benefits from said content. The usual benefits from the new content is more traffic to their website(s). And with more high quality traffic usually will lead into more advertising dollars being earned.

The job as a content producer is to limit unwanted traffic and limited unneeded access to the content while maximizing the target audience access to the content. So in the end the content producer receives the most benefits from the content provided. So the goal is never to send the content to a unwanted recipient.

Things to take a note of

There are a few things which will be used in the process of weeding out unwanted visitors.

Originating Browser IP
User-Agent Header
Other browser supplied Headers
Referrer if any
Entire Page URL that was requested

With those bits of information, and using the tools listed in this article the following are able to be determine based on the supplied information.

Generally the Country of Origin of the Browsing IP (excluding proxy traffic)
The reported browser name, version details based on User-Agent alone.
Keywords used to reach the site from Search Engines, and related page.

Common blocking actions

a). Send a 403 "Access forbidden" status code with no further content, which is usually my choice. I send no content at all to save overall bandwidth. Since one the goals is to limit the resources a scraper can use. So that we can used these resources to benefit the site and the real clients who are our target audience.

b). Another approach is to feed them an empty page with no real content and 200 "Access OK" status code. This would keep them guessing trying to figure out if there is something wrong with their scraper.

Check for Proxy's

Look for "X-Forwarded-For" header used by proxy servers to denote the original browser IP address. This header is often not supplied when the proxy server setup to hide the identities of the browser.

One of the goals here is to try to detect where a browser is really coming from. This will be required later on to help filter out unwanted traffic that makes use of proxy servers to get around bans currently in place to block them.

Reason to note this is because you do not want say Google bot to crawl your entire site though a proxy and get hit with a duplicate content penalty or have someone else earn money by inserting there own Google Adsense ads in.

Geo Location Checks

Depending on business requirements and what type of content is being protected, there is a possibility to exclude whole regions which are not of any potential benefit.

So for example if the company can only do business inside the United States. It might be safe to exclude all foreign visitors. This is a risk that has to be determined on a site by site bases.

Run all the known IP address though the Geo Location software to map the ip's to country of where this IP address's are registered. The list of IP's include originating IP and "X-Forwarded-For" IP. Then compare this list of Countries against the ones which you will be doing business in, and if any of them are not you have some options. 1). The most common action is to simply block them. 2). Redirecting them to a nicely worded page stating the reason why they can not order/view the site.

By not allowing this group of visitors you reduce the amount problems you have to deal with by many fold, depending how many country's are being blocked.

Robots.txt checks

Most good bots will read robotst.txt to see where they are not wanted. Others use this as a way to find out where the bot traps are and work around them, so they can steal your content. The only true way I have come up with to protect my content is to cloak the robots.txt file, and have different versions served up for different bots. So only the bots which are allowed in receive the true robots.txt, all others receive a highly restrictive version that bans all access to the website. I usually assume anything that reads the robots.txt file is a bot or someone snooping around who is up to no good. And if they are not in my white list of allowed bots, using a non-bot user-agent, or supply no user-agent at all they will receive the highly restrictive version of robots.txt. In my opinion since robots.txt was never designed for humans to actively look at it for anything I feel safe in my treatment of people who download it for any reason.

If they read robots.txt log the IP addresses and user-agent. And assuming a cloaked robots.txt also log weather by the rules defined in the robots.txt if they should be banned or not based. Keep this audit log of robots.txt activity for 24 hours for the ip and user-agents in question. 24 hours is a accepted and reasonable amount of time to cache the contents of the robots.txt by a crawler.

Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4. The current major players in my opinion are Google, Yahoo, MSN, and Ask Jeeves.

Please note do not check the robots.txt file on this website because you will be banned from further access until the timeout period has elapsed.

Captcha's

The purpose of a captcha is to help weed out bots, by making the browser proof they are human by doing something that is normally too time consuming for a computer to figure out with any accuracy. Captcha's are not fool proof, but they usually raise the bar much higher then most ruin of the mill scrapers are willing to deal with.

Check if the IP or user-agent has previously been given a captcha check and has not answered it successfully. And if it has not successfully answered one then send it another captcha check and make a note of how many captcha checks it triggered.

If the recorded IP has not answered a predetermined amount of captcha checks, it would be safe the ban the IP from further request for a predetermined amount of time. Note the IP ban, date and time in a ban audit log for later return visits by this IP.

Only if it has been banned follow the steps listed previously in Common blocking actions.

Common Bot Identifiers

Check to see if the User-Agent contains one of the following terms so it is possible to flag the User-Agent as a possible bot. These may catch some malformed User-Agents but at this point it is only being flagged as being a possible bot, it is not known for sure if it is or not yet at this stage.

"Crawler"
"Bot"
"Spider"
"Ask Jeeves"
"Search"
"Indexer"
"Archiver"
"Larbin" <-- Email scraper
"Nutch" <-- Open source web crawler which is abused.
"Libwww" <-- Used by a lot of scrapers
"User-Agent" <-- Badly formed User-Agent

Review the supplied User-Agent

Check if was previously flagged as a possible bot perform analysis of the supplied User-Agent. The purpose is to weed out bots that are not on a white list of allowed bots, or bots that have been explicitly banned from accessing the site.

This is done by seeing if the User-Agent matches a known string which an allowed bot uses, and letting it continue on to go though further checks.
And if the User-Agent matches a known disallowed bot mark the ip as banned and give it a proper message.
If it does not match a known bot User-Agents which have been coded as disallowed or allowed, the proper thing here would be show it a captcha page where a human may continue on but a bot would get stuck. Mark the IP and User-Agent as being giving a captcha check and note if they answer it properly.

DNS Validation

Check the allowed bots that have made it this far against the list of bots that support DNS Checks to validate them.

The following checks will also stop major search engines which are crawling though a transparent proxy server unknowingly, thus saving duplicate content penalties for the website as a side benefit.

(A) DNS check, require looking up the IP to get the Hostname. Check resolved hostname against the known patterns for the search engine in question. And if they do not match mark the ip as banned and give it a proper message.

(B) Then doing a look up on the Hostname to see if it resolves back with a list of ip address’s that contain the ip which you started with.

Something to watch out for with some fake bots will have there ip address resolve to a hostname which matches there ip’s address and thus would pass the test, so this must be explicitly tested for to bounce these results by default. For example ip "10.0.0.1" would resolve to "10.0.0.1" hostname.

MSN, Yahoo, Google, Ask Jeeves all support this functionality currently, others may as well. The purpose of this check is prevent others from spoofing well known Crawlers and setting up there DNS records to resolve there ip’s to a well known Search Engine hostname, but since they will not control the reversing of the Hostname to ip they will get caught with this check.

More Checks

Check if the "Accept" header is present, and make a note of this for later.

Check if the browser is identified as one of the following browsers "IE", "Opera", "Firefox" and is missing the "Accept" header. There are two options that one could take in this case they are listed below. If checking all non-bot User-Agents excluding mobiles, it is advised to use the first method just in case.

One would be show it a captcha page where a human may continue on but a bot would get stuck. Mark the IP and User-Agent as being giving a captcha check and note if they answer it properly.
Mark the IP and block them from accessing the site. And send them a 403 status error and no further content.

Most major web browsers will send this header along with the request, which tells the web servers what the browser can accept. I have only listed the major browser providers but this is usually safe for all known browser except for a few mobile browsers which is why the mobile browser checks are in place earlier.

Identifying And Analyzing Bot Activity

Background: Content Scrapers

Things to take a note of

Common blocking actions

Check for Proxy's

Geo Location Checks

Robots.txt checks

Check Robots.txt Audit

Captcha's

Banned Audit Log

From who??

Common Bot Identifiers

Review the supplied User-Agent

From Header

DNS Validation

Mobile Checks

x-wap-profile

More Checks