Saturday, September 09, 2006


Mind-Boogling Search Results

I use the most popular (and populous) search engine on the Internet quite often; I thought I knew how. Now I'm not so sure, because its logic does not correspond to what years of set theory and symbolic logic have conditionned me to expect.

Here's what happened. I searched for a phrase: "get the net". The resulting list included several categories of pages: some about fishing, some about the film "Wayne's World", some about the Internet and some about Microsoft's ".net".

The first unexpected result is the inclusion of ".NET Framework Developer Center: Get the .NET Framework 1.1". "Get the .NET" is not the same as "get the net" even when converted to lower case; " ." is not the same as " ". Punctuation is essential in phrases, and should not be ignored when seeking matches. I may disagree with the rule applied, but I can understand what happened.

The second and more problematic result came when I refined the previous search by adding "Wayne" as a criterion; there might be some pages about anglers or informatics experts named Wayne, but mostly I should get the pages about the phrase in the film. Indeed, that is what the search results seem to be, but wait a second! After excluding all the pages from the first search which do not contain the word "Wayne", the new search returned more pages! A list of 266,000 is a subset of a list of 148,000.

Determined to understand how this could be, I hypothesized that, whereas queries on a list of terms automatically require all terms from the list, a query mixing a phrase and a term could return pages with either the phrase or the term. If that were so, a query on "Wayne" alone should find 118,000 (266000-148000). Tada! No way: 147,000,000. The previous result set cannot be the union of the result sets, nor the intersection of them. What is it?

To recap the results of the various searches:

about how many pagesQueryRemarks
about 145,000 "get the net" -Wayne.Ergo, 3,000 were excluded by the condition '-Wayne'
about 148,000 "get the net"
about 266,000 "get the net" Wayne17,000 were excluded, but why?
about 249,000 "get the net" "Wayne"
about 147,000,000 "Wayne" None excluded (vs. next row), or 17,000 of 147,000,000 was too insignificant to mention.
about 147,000,000 Wayne
about 142,000,000 Wayne -"get the net" 5,000,000 were excluded for containing "get the net".

Anyone with a plausible explanation, please comment!

Tags: : : : :

StumbleUpon Toolbar Stumble It!
Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?