Detect Bots By Parsing The User Agent With PHP

Because I’ve been starting to keep a closer eye on my traffic I’ve been logging everything to my databases. When writing reports for this information I noticed that there was huge amounts of traffic from bots. Yahoo, Google, MSN and many others have been hammering my sites quite a lot lately. Since I’m writing my reports in PHP I needed a quick little function to identify whether the visitor was real traffic or some machine scraping my site. I couldn’t find one after a few quick Google queries but the solution was trivial so I wrote my own. Hopefully this will save someone a few minutes. The function is as simple as possible and it seems to be working so far. I’ve been watching it for awhile and it hasn’t missed any yet. I’m assuming that it’s not going to catch everything but it would be nice to get most. Any bot user agent string suggestions would be helpful.


//returns 1 if the user agent is a bot
function is_bot($user_agent)
{
  //if no user agent is supplied then assume it's a bot
  if($user_agent == "")
    return 1;

  //array of bot strings to check for
  $bot_strings = Array(  "google",     "bot",
            "yahoo",     "spider",
            "archiver",   "curl",
            "python",     "nambu",
            "twitt",     "perl",
            "sphere",     "PEAR",
            "java",     "wordpress",
            "radian",     "crawl",
            "yandex",     "eventbox",
            "monitor",   "mechanize",
            "facebookexternal"
          );
  foreach($bot_strings as $bot)
  {
    if(strpos($user_agent,$bot) !== false)
    { return 1; }
  }
  
  return 0;
}

Share

5 Responses to “Detect Bots By Parsing The User Agent With PHP”

  • Old Man Says:

    It seems unecessary to log things in your database. First of all it will probably increase the time for your page to generate. And it also seems a bit uncessary when the web server logs everything anyway?

    Why don’t you just parse the web servers logs?

    If you need more information than is available, you can add information to the apache logs directly:
    http://ca.php.net/manual/en/function.apache-note.php

  • Cody Taylor Says:

    I insert at the very end of the page after I flush everything out so there will be no delay of content even if the db is bogged down. And if I was give it to someone else who is using iis I would have to recode the entire thing. Also when doing reporting on large sets of data I don’t want to have to parse huge log files for every report.

  • abcphp.com Says:

    Detect Bots By Parsing The User Agent With PHP…

    Because I’ve been starting to keep a closer eye on my traffic I’ve been logging everything to my databases. When writing reports for this information I noticed that there was huge amounts of traffic from bots. Yahoo, Google, MSN and many others have be…

  • Jurian Sluiman Says:

    First one point about your coding style: if you’re using a function where you expect a boolean result, please return booleans. Though php is waked typed, its much better to have your controller functions returning the right type. When you’re inserting the data inside your database, you can always cast it back to anything you want.

    The second thing: if you want to determine the visitor based on the user agent, it’s smart to use browscap (http://code.google.com/p/phpbrowscap/). That’s a very reliable solution for user agent detection (and even will you help for further statistics!).

  • anon Says:

    Jurian is right in that you ought to return true/false instead of 1/0.

    Interesting idea, but maintaining a list of bots is painful and annoying, hence why Jurian recommends browscap.

    However, please examine these test results before using browscap for any large project. In order for it to be a viable solution — even with caching — it would have to be several orders of magnitude faster.

    Dual core Xeon E3110 3.16GHz, 4GB RAM, Apache/2.2.11 (Unix) and PHP 5.2.9 – Load Avg: 0.20, 0.20, 0.13.

    Using PHP’s get_browser() and php_browscap.ini: 0.0370211601257 sec

    Using Browscap.php class and php_browscap.ini: 0.0388770103455 sec

    Using PHP’s get_browser() and lite_php_browscap.ini: 0.0258259773254 sec

    Using Browscap.php class and lite_php_browscap.ini: 0.0213708877563 sec