Jun 4 2009

Make Web Sites Think that PHP CURL is a Browser.

I’ve been doing a bit of site scraping using curl and PHP lately. I’ve found that most sites will ban your ip if they think you’re a bot (good thing I’m on DSL) so you need to make them think that your script is a browser. The easiest way to do this is to add a user agent header to your script. Here is an example of getting a results page from google for a specific search query.


function get_google_result($search_term)
{
  $ch = curl_init();
  $url = 'http://www.google.ca/search?hl=en&safe=off&q='
                  .urlencode($search_term);
  $useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1)".
                            " Gecko/20061204 Firefox/2.0.0.1";
  
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_HEADER, 1);
  curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  
  $google_string = curl_exec($ch);
  
  curl_close($ch);
  $google_string = utf8_encode($google_string);
  return $google_string;
}

So this function will take the search term that was provided and request the results from google for it. Our user agent header makes the script look like firefox.
I put the returned string (The entire google results page) into the google_string variable and return it for parsing out what is needed. My newest scrape experiment is a site called Quick Content and basically scrapes some results from google based on some parameters from google hot trends and posts it all as a feed. It was fun to code up but it is desperate need of a makeover.

Share

Jun 1 2009

Bing VS Google Bandwidth Comparison

With the unveiling of Microsofts new search engine Bing I was curious to see which site uses more bandwidth per load and query. These results are taken from the firefox plugin firebug’s Net feature. This plugin will tell you what files were downloaded and how big they were along with how long they took. It’s a great way to see how fast your site is going to be.

Google Main Page 20 KB
Search for Cody Taylor 38 KB
Search for asdf asdf 10 KB

The search for Cody Taylor was so large because of images displayed at the top of the pages.

Bing Main Page 109 KB
Search for Cody Taylor 22 KB

Over 5 times the data of Google. The search for Cody Taylor on Bing didn’t show any pictures but was still twice the size of a normal Google seearch. If we take into account the New Ajax onMouseOver event on Bing for each search result it becomes 27 KB still without any images. Remember those articles that went off about how much energy every google query uses? Looks like Bing more than triples that amount.

Of course both these sites are much more effecient when we take into account that after the first visit most of the large data is already cached in our browser. For this test I was clearing all my cache between each transaction.

Bing does have quite a few redeeming features and for a first impression it looks like it may be a serious contender but it still lacks the simplicity and speed of google.

Share

May 18 2009

Google Query Syntax Explanations: Part 2

Here are a few more little known tricks that can be used to get better results from the google search engine. Save some bandwidth and tell your friends.

Operators
You can add all sorts of arguments to your google search query. The most useful that I’ve found so far is filetype. This allows you to specify the type of file that you want to search for.
If I type :
“iphone” filetype:pdf
into the google search engine then I only get pdf files in my search results, most of which are useful instructional manuals on the iphone.

There are many other arguments that can be useful:

intitle:”tech stuff”
inurl:”codytaylor”
intext:”iphone”
inanchor:”tech stuff”
site:codytaylor.org
link:www.codytaylor.org
cache:codytaylor.org
daterange:2452389-2452389
related:codytaylor.org
info:codytaylor.org
phonebook:”someone”

Stop Words

Google automatically removes certain words from searches. These are called stop words and consist of words like ‘I’, ‘a’, ‘the’, and ‘of’. To force google to use these words then add a ‘+’ to the begining of the word. So searching for a statement with ‘+the’ in it would force the query to look for the ‘the’. If you don’t care wheter these words are included in the search then why even enter them?

Order and repetition matter.

Searching
“codytaylor” scp
emphasizes the “codytaylor” and produces different results than searching
scp “codytaylor”
The keywords to the left are always given higher precedence in the query.

Searching
“codytaylor” scp
produces different results than searching
“codytaylor” scp scp

So if you’re looking for a page that is saturated with a specific keyword then you’ll have much more luck if you type it in more than once.

Share

May 17 2009

Google Query Syntax Explanations: Part 1

Basic Google Syntax Explanations

Ever see someone spending hours trying to find something in google and just giving up due to the enormous amount of content for any given keyword? I’m amazed at how little everyone knows about using the Google search engine. Most of the population uses google every day but are still unaware of  some very basic but extremely simple and effective syntax rules for google queries. This takes energy and bandwidth. In the following I try and outline two of the most common methods of narrowing your search results down to only what you want.

  • Basic Boolean: Use ‘AND’ and ‘OR’ in your query. The ‘AND’ will require the result to include both keywords and the ‘OR’ will allow results that have either keywords in them. You can also use the ‘|’ (pipe) character to specify ‘OR’. To make sure that none of the results include a specific word then use the ‘-‘ character in front of the word. So searching ‘cody AND taylor AND -yoyo’ will return results for cody taylor that do not include yoyo.
  • Quotes: Use quotes on a query to specify that you only want to search results that are exactly as you write them. If I google codytaylor most of my results are for cody taylor but if I google “codytaylor” then I get results only containing codytaylor without any spaces. Googles forethought in displaying results and splitting up words is very useful but a lot of the time you will want your results to be exactly as you specified. Quotes are also used to specify keyword order. If I wanted results for only the useful and not some sentence or combination of words that include those three words then I would specify “only the useful”. Try it and you’ll notice a huge difference. Try a couple queries to see how much more specific your results become.

These two basic features are surprisingly little known yet so straight forward. Save everyone some bandwidth and explain this to the people around you.

Yes, Google is a verb.

Check out Part 2

Share

May 14 2009

Lawyer challenges Mulroney's 1996 testimony during Airbus lawsuit – CBC.ca


National Post

Lawyer challenges Mulroney's 1996 testimony during Airbus lawsuit
CBC.ca
Brian Mulroney testifies at the Oliphant commission, which is looking into business dealings between Karlheinz Schreiber and the former prime minister.
Mulroney in hot seat over selective answers The Canadian Press
Oh, cry me a river, Mr. Mulroney National Post
Globe and Mail – Toronto Star – CBC.ca – CBC.ca
all 696 news articles  Langue : Français
Share

May 14 2009

Caregiver advocate questions Dhalla's story – Canada.com


CTV.ca

Caregiver advocate questions Dhalla's story
Canada.com
By Juliet O'Neill, Canwest News ServiceMay 14, 2009 7:21 PM Brampton Liberal MP Ruby Dhalla during a press conference in Toronto, Friday.
Dhalla lawyer hints scandal was politically motivated StarPhoenix
Dhalla lying, advocate implies Toronto Star
Globe and Mail – The Canadian Press – Winnipeg Sun – Brampton Guardian
all 69 news articles
Share

May 14 2009

Domestic abuse – Ottawa Citizen


CTV.ca

Domestic abuse
Ottawa Citizen
There's blood in the waters, and the sharks are circling for Liberal MP Ruby Dhalla. As fascinating as the spectacle might be for political junkies, the importance of this episode will ultimately be measured by whether it has any effect on the lives of
Dhalla abuse charges 'false' Edmonton Sun
Dhalla a victim of campaign: lawyer National Post
Globe and Mail – Toronto Star – Toronto Star – Toronto Star
all 545 news articles
Share

May 14 2009

Google suffers major failure

Google Search and Google News performance slowed to a crawl, while an outage seemed to spread from Gmail to Google Maps and Google Reader. Comments about the failure were flying on Twitter, with “googlefail” quickly became one of the most searched terms on the popular micro-blogging site.


Share

May 14 2009

What Can I Do About Book Pirates?

peterwayner writes “Six of the top ten links on a Google search for one of my books points to a pirate site when I type in ‘wayner data compression textbook.’ Others search strings actually locate pages that are selling legit copies including digital editions for the Kindle. I’ve started looking around for suggestions. Any thoughts from the Slashdot crowd? The free copies aren’t boosting sales for my books. Do I (1) get another job, (2) sue people, or (3) invent some magic spell? Is society going to be able to support people who synthesize knowledge or will we need to rely on the Wikipedia for everything? I’m open to suggestions.”

Read more of this story at Slashdot.


Share

May 14 2009

Advocate contradicts Dhalla on caregivers – The Canadian Press


CBC.ca

Advocate contradicts Dhalla on caregivers
The Canadian Press
OTTAWA – An advocate for immigrant caregivers is contradicting testimony by Liberal MP Ruby Dhalla. Agatha Mason told a parliamentary committee today that she dealt with Dhalla directly in demanding the return of a caregiver's passport.
Foreign-caregiver advocate contradicts Dhalla Globe and Mail
Dhalla lying, advocate implies Toronto Star
Winnipeg Sun – Toronto Star
all 19 news articles
Share