The principles at work behind a search engine are really simple, but, I think, seldom understood.
This is a series of posts about the subject; my goal is to keep each post really short. This first post is about principles; in the other parts we’ll build a search engine from scratch.
Google or Bing employ hundreds of the best engineers in the world and their products are far from perfect: so the problem must really be difficult?
Well, yes, the many problems Google or Bing are trying to solve are extremely complex, esp. ranking and spam-fighting.
(Ranking is useful if you’re searching a very big corpus; spam-fighting is relevant if there are people out there trying to game your results. But if you’re trying to search a closed corpus of reasonable size, then you don’t have these problems.)
I insist, the basics are simple. A search engine:
An index is a list of words, and for each word, a list of documents containing these words. So for example if our corpus contains the following two documents:
the index for this corpus will be:
That’s it. That’s the single most important concept of how a search engine actually works.
(We’ll come back to how to build an index further in this series).
Once we have indexed our corpus, answering queries is just a matter of reading the index:
Let’s see how this works for sample queries:
query1: “dog cat”
meaning of the query: give me the list of documents that contain both words “dog” and “cat”
how to solve the query:
query2: “big dog”
meaning of the query: give me the list of documents that contain both words “big” and “dog”
how to solve the query:
The astute reader will notice that in the sample queries above we always imply an “AND” operator, which means that we ask the search engine which documents contain ALL of the words in the query (“dog AND cat”). This is how modern search engines work — they work that way because it makes more sense, and it reduces noise.
But of course you could want to have a list of documents that contain ANY of the words of the query (dog OR cat). AltaVista operated that way by default: I don’t think it helped them stay in business, but let’s try do do that anyway.
All we have to do is to build a union of the lists of documents instead of an intersection; the solution to query2 would therefore be:
Quick summary: a search engine