Jun 4 2009

Make Web Sites Think that PHP CURL is a Browser.

I’ve been doing a bit of site scraping using curl and PHP lately. I’ve found that most sites will ban your ip if they think you’re a bot (good thing I’m on DSL) so you need to make them think that your script is a browser. The easiest way to do this is to add a user agent header to your script. Here is an example of getting a results page from google for a specific search query.


function get_google_result($search_term)
{
  $ch = curl_init();
  $url = 'http://www.google.ca/search?hl=en&safe=off&q='
                  .urlencode($search_term);
  $useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1)".
                            " Gecko/20061204 Firefox/2.0.0.1";
  
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_HEADER, 1);
  curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  
  $google_string = curl_exec($ch);
  
  curl_close($ch);
  $google_string = utf8_encode($google_string);
  return $google_string;
}

So this function will take the search term that was provided and request the results from google for it. Our user agent header makes the script look like firefox.
I put the returned string (The entire google results page) into the google_string variable and return it for parsing out what is needed. My newest scrape experiment is a site called Quick Content and basically scrapes some results from google based on some parameters from google hot trends and posts it all as a feed. It was fun to code up but it is desperate need of a makeover.

Share