Detect Bots By Parsing The User Agent With PHP
Because I’ve been starting to keep a closer eye on my traffic I’ve been logging everything to my databases. When writing reports for this information I noticed that there was huge amounts of traffic from bots. Yahoo, Google, MSN and many others have been hammering my sites quite a lot lately. Since I’m writing my reports in PHP I needed a quick little function to identify whether the visitor was real traffic or some machine scraping my site. I couldn’t find one after a few quick Google queries but the solution was trivial so I wrote my own. Hopefully this will save someone a few minutes. The function is as simple as possible and it seems to be working so far. I’ve been watching it for awhile and it hasn’t missed any yet. I’m assuming that it’s not going to catch everything but it would be nice to get most. Any bot user agent string suggestions would be helpful.
//returns 1 if the user agent is a bot
function is_bot($user_agent)
{
//if no user agent is supplied then assume it's a bot
if($user_agent == "")
return 1;
//array of bot strings to check for
$bot_strings = Array( "google", "bot",
"yahoo", "spider",
"archiver", "curl",
"python", "nambu",
"twitt", "perl",
"sphere", "PEAR",
"java", "wordpress",
"radian", "crawl",
"yandex", "eventbox",
"monitor", "mechanize",
"facebookexternal"
);
foreach($bot_strings as $bot)
{
if(strpos($user_agent,$bot) !== false)
{ return 1; }
}
return 0;
}
June 14th, 2009 at
It seems unecessary to log things in your database. First of all it will probably increase the time for your page to generate. And it also seems a bit uncessary when the web server logs everything anyway?
Why don’t you just parse the web servers logs?
If you need more information than is available, you can add information to the apache logs directly:
http://ca.php.net/manual/en/function.apache-note.php
June 14th, 2009 at
I insert at the very end of the page after I flush everything out so there will be no delay of content even if the db is bogged down. And if I was give it to someone else who is using iis I would have to recode the entire thing. Also when doing reporting on large sets of data I don’t want to have to parse huge log files for every report.
June 15th, 2009 at
Detect Bots By Parsing The User Agent With PHP…
Because I’ve been starting to keep a closer eye on my traffic I’ve been logging everything to my databases. When writing reports for this information I noticed that there was huge amounts of traffic from bots. Yahoo, Google, MSN and many others have be…
June 15th, 2009 at
First one point about your coding style: if you’re using a function where you expect a boolean result, please return booleans. Though php is waked typed, its much better to have your controller functions returning the right type. When you’re inserting the data inside your database, you can always cast it back to anything you want.
The second thing: if you want to determine the visitor based on the user agent, it’s smart to use browscap (http://code.google.com/p/phpbrowscap/). That’s a very reliable solution for user agent detection (and even will you help for further statistics!).
August 19th, 2009 at
Jurian is right in that you ought to return true/false instead of 1/0.
Interesting idea, but maintaining a list of bots is painful and annoying, hence why Jurian recommends browscap.
However, please examine these test results before using browscap for any large project. In order for it to be a viable solution — even with caching — it would have to be several orders of magnitude faster.
Dual core Xeon E3110 3.16GHz, 4GB RAM, Apache/2.2.11 (Unix) and PHP 5.2.9 – Load Avg: 0.20, 0.20, 0.13.
Using PHP’s get_browser() and php_browscap.ini: 0.0370211601257 sec
Using Browscap.php class and php_browscap.ini: 0.0388770103455 sec
Using PHP’s get_browser() and lite_php_browscap.ini: 0.0258259773254 sec
Using Browscap.php class and lite_php_browscap.ini: 0.0213708877563 sec