Saturday, September 09, 2006
Mind-Boogling Search Results
I use the most popular (and populous) search engine on the Internet quite often; I thought I knew how. Now I'm not so sure, because its logic does not correspond to what years of set theory and symbolic logic have conditionned me to expect.
Here's what happened. I searched for a phrase: "get the net". The resulting list included several categories of pages: some about fishing, some about the film "Wayne's World", some about the Internet and some about Microsoft's ".net".
The first unexpected result is the inclusion of ".NET Framework Developer Center: Get the .NET Framework 1.1". "Get the .NET" is not the same as "get the net" even when converted to lower case; " ." is not the same as " ". Punctuation is essential in phrases, and should not be ignored when seeking matches. I may disagree with the rule applied, but I can understand what happened.
The second and more problematic result came when I refined the previous search by adding "Wayne" as a criterion; there might be some pages about anglers or informatics experts named Wayne, but mostly I should get the pages about the phrase in the film. Indeed, that is what the search results seem to be, but wait a second! After excluding all the pages from the first search which do not contain the word "Wayne", the new search returned more pages! A list of 266,000 is a subset of a list of 148,000.
Determined to understand how this could be, I hypothesized that, whereas queries on a list of terms automatically require all terms from the list, a query mixing a phrase and a term could return pages with either the phrase or the term. If that were so, a query on "Wayne" alone should find 118,000 (266000-148000). Tada! No way: 147,000,000. The previous result set cannot be the union of the result sets, nor the intersection of them. What is it?
To recap the results of the various searches:
|about how many pages||Query||Remarks|
|about 145,000||"get the net" -Wayne.||Ergo, 3,000 were excluded by the condition '-Wayne'|
|about 148,000||"get the net"|
|about 266,000||"get the net" Wayne||17,000 were excluded, but why?|
|about 249,000||"get the net" "Wayne"|
|about 147,000,000||"Wayne"||None excluded (vs. next row), or 17,000 of 147,000,000 was too insignificant to mention.|
|about 142,000,000||Wayne -"get the net"||5,000,000 were excluded for containing "get the net".|
Anyone with a plausible explanation, please comment!
Tags: logic : search queries : get the net : Google : quotations