I’ve been doing a bit of site scraping using curl and PHP lately. I’ve found that most sites will ban your ip if they think you’re a bot (good thing I’m on DSL) so you need to make them think that your script is a browser. The easiest way to do this is to add a user agent header to your script. Here is an example of getting a results page from google for a specific search query.
$ch = curl_init();
$url = 'http://www.google.ca/search?hl=en&safe=off&q='
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:22.214.171.124)".
" Gecko/20061204 Firefox/126.96.36.199";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$google_string = curl_exec($ch);
$google_string = utf8_encode($google_string);
So this function will take the search term that was provided and request the results from google for it. Our user agent header makes the script look like firefox.
I put the returned string (The entire google results page) into the google_string variable and return it for parsing out what is needed. My newest scrape experiment is a site called Quick Content
and basically scrapes some results from google based on some parameters from google hot trends and posts it all as a feed. It was fun to code up but it is desperate need of a makeover.
With the unveiling of Microsofts new search engine Bing I was curious to see which site uses more bandwidth per load and query. These results are taken from the firefox plugin firebug’s Net feature. This plugin will tell you what files were downloaded and how big they were along with how long they took. It’s a great way to see how fast your site is going to be.
Google Main Page 20 KB
Search for Cody Taylor 38 KB
Search for asdf asdf 10 KB
The search for Cody Taylor was so large because of images displayed at the top of the pages.
Bing Main Page 109 KB
Search for Cody Taylor 22 KB
Over 5 times the data of Google. The search for Cody Taylor on Bing didn’t show any pictures but was still twice the size of a normal Google seearch. If we take into account the New Ajax onMouseOver event on Bing for each search result it becomes 27 KB still without any images. Remember those articles that went off about how much energy every google query uses? Looks like Bing more than triples that amount.
Of course both these sites are much more effecient when we take into account that after the first visit most of the large data is already cached in our browser. For this test I was clearing all my cache between each transaction.
Bing does have quite a few redeeming features and for a first impression it looks like it may be a serious contender but it still lacks the simplicity and speed of google.
Here are a few more little known tricks that can be used to get better results from the google search engine. Save some bandwidth and tell your friends.
You can add all sorts of arguments to your google search query. The most useful that I’ve found so far is filetype. This allows you to specify the type of file that you want to search for.
If I type :
into the google search engine then I only get pdf files in my search results, most of which are useful instructional manuals on the iphone.
There are many other arguments that can be useful:
Google automatically removes certain words from searches. These are called stop words and consist of words like ‘I’, ‘a’, ‘the’, and ‘of’. To force google to use these words then add a ‘+’ to the begining of the word. So searching for a statement with ‘+the’ in it would force the query to look for the ‘the’. If you don’t care wheter these words are included in the search then why even enter them?
Order and repetition matter.
emphasizes the “codytaylor” and produces different results than searching
The keywords to the left are always given higher precedence in the query.
produces different results than searching
“codytaylor” scp scp
So if you’re looking for a page that is saturated with a specific keyword then you’ll have much more luck if you type it in more than once.
Basic Google Syntax Explanations
Ever see someone spending hours trying to find something in google and just giving up due to the enormous amount of content for any given keyword? I’m amazed at how little everyone knows about using the Google search engine. Most of the population uses google every day but are still unaware of some very basic but extremely simple and effective syntax rules for google queries. This takes energy and bandwidth. In the following I try and outline two of the most common methods of narrowing your search results down to only what you want.
- Basic Boolean: Use ‘AND’ and ‘OR’ in your query. The ‘AND’ will require the result to include both keywords and the ‘OR’ will allow results that have either keywords in them. You can also use the ‘|’ (pipe) character to specify ‘OR’. To make sure that none of the results include a specific word then use the ‘-‘ character in front of the word. So searching ‘cody AND taylor AND -yoyo’ will return results for cody taylor that do not include yoyo.
- Quotes: Use quotes on a query to specify that you only want to search results that are exactly as you write them. If I google codytaylor most of my results are for cody taylor but if I google “codytaylor” then I get results only containing codytaylor without any spaces. Googles forethought in displaying results and splitting up words is very useful but a lot of the time you will want your results to be exactly as you specified. Quotes are also used to specify keyword order. If I wanted results for only the useful and not some sentence or combination of words that include those three words then I would specify “only the useful”. Try it and you’ll notice a huge difference. Try a couple queries to see how much more specific your results become.
These two basic features are surprisingly little known yet so straight forward. Save everyone some bandwidth and explain this to the people around you.
Yes, Google is a verb.
Check out Part 2
Google Search and Google News performance slowed to a crawl, while an outage seemed to spread from Gmail to Google Maps and Google Reader. Comments about the failure were flying on Twitter, with “googlefail” quickly became one of the most searched terms on the popular micro-blogging site.
peterwayner writes “Six of the top ten links on a Google search for one of my books points to a pirate site when I type in ‘wayner data compression textbook.’ Others search strings actually locate pages that are selling legit copies including digital editions for the Kindle. I’ve started looking around for suggestions. Any thoughts from the Slashdot crowd? The free copies aren’t boosting sales for my books. Do I (1) get another job, (2) sue people, or (3) invent some magic spell? Is society going to be able to support people who synthesize knowledge or will we need to rely on the Wikipedia for everything? I’m open to suggestions.”
Read more of this story at Slashdot.