Saturday, September 09, 2006
Mind-Boogling Search Results
I use the most popular (and populous) search engine on the Internet quite often; I thought I knew how. Now I'm not so sure, because its logic does not correspond to what years of set theory and symbolic logic have conditionned me to expect.
Here's what happened. I searched for a phrase: "get the net". The resulting list included several categories of pages: some about fishing, some about the film "Wayne's World", some about the Internet and some about Microsoft's ".net".
The first unexpected result is the inclusion of ".NET Framework Developer Center: Get the .NET Framework 1.1". "Get the .NET" is not the same as "get the net" even when converted to lower case; " ." is not the same as " ". Punctuation is essential in phrases, and should not be ignored when seeking matches. I may disagree with the rule applied, but I can understand what happened.
The second and more problematic result came when I refined the previous search by adding "Wayne" as a criterion; there might be some pages about anglers or informatics experts named Wayne, but mostly I should get the pages about the phrase in the film. Indeed, that is what the search results seem to be, but wait a second! After excluding all the pages from the first search which do not contain the word "Wayne", the new search returned more pages! A list of 266,000 is a subset of a list of 148,000.
Determined to understand how this could be, I hypothesized that, whereas queries on a list of terms automatically require all terms from the list, a query mixing a phrase and a term could return pages with either the phrase or the term. If that were so, a query on "Wayne" alone should find 118,000 (266000-148000). Tada! No way: 147,000,000. The previous result set cannot be the union of the result sets, nor the intersection of them. What is it?
To recap the results of the various searches:
about how many pages | Query | Remarks |
---|---|---|
about 145,000 | "get the net" -Wayne. | Ergo, 3,000 were excluded by the condition '-Wayne' |
about 148,000 | "get the net" | |
about 266,000 | "get the net" Wayne | 17,000 were excluded, but why? |
about 249,000 | "get the net" "Wayne" | |
about 147,000,000 | "Wayne" | None excluded (vs. next row), or 17,000 of 147,000,000 was too insignificant to mention. |
about 147,000,000 | Wayne | |
about 142,000,000 | Wayne -"get the net" | 5,000,000 were excluded for containing "get the net". |
Anyone with a plausible explanation, please comment!
Tags: logic : search queries : get the net : Google : quotations