The principles at work behind a search engine are really simple, but, I think, seldom understood.
This is a series of posts about the subject; my goal is to keep each post really short. This first post is about principles; in the other parts we’ll build a search engine from scratch.
Simple, you say?
Google or Bing employ hundreds of the best engineers in the world and their products are far from perfect: so the problem must really be difficult?
Well, yes, the many problems Google or Bing are trying to solve are extremely complex, esp. ranking and spam-fighting.
(Ranking is useful if you’re searching a very big corpus; spam-fighting is relevant if there are people out there trying to game your results. But if you’re trying to search a closed corpus of reasonable size, then you don’t have these problems.)
I insist, the basics are simple. A search engine:
- indexes a group of documents for the words it contains
- uses this index to answer queries
What is an index?
An index is a list of words, and for each word, a list of documents containing these words. So for example if our corpus contains the following two documents:
- doc1 “The white cat ate a mouse.”
- doc2 “My dog is more lazy than my cat.”
the index for this corpus will be:
- a doc1
- ate doc1
- cat doc1 doc2
- dog doc2
- is doc2
- lazy doc2
- mouse doc1
- more doc2
- my doc2
- than doc2
- the doc1
- white doc1
That’s it. That’s the single most important concept of how a search engine actually works.
(We’ll come back to how to build an index further in this series).
How do queries work?
Once we have indexed our corpus, answering queries is just a matter of reading the index:
- retrieving the list of documents for each word in the query
- crossing those lists and returning the documents common to all lists.
Let’s see how this works for sample queries:
-
query1: “dog cat”
-
meaning of the query: give me the list of documents that contain both words “dog” and “cat”
-
how to solve the query:
- retrieve the list of docs for “dog”: doc2
- retrieve the list of docs for “cat”: doc1 doc2
- compute the intersection of those two lists: doc2
- return the result: doc2
-
query2: “big dog”
-
meaning of the query: give me the list of documents that contain both words “big” and “dog”
-
how to solve the query:
- retrieve the list of docs for “big”: -nil-
- return the answer: -nil-
- there are no other steps because any intersection containing an empty list will always result in an empty list
But what about “OR”?
The astute reader will notice that in the sample queries above we always imply an “AND” operator, which means that we ask the search engine which documents contain ALL of the words in the query (“dog AND cat”). This is how modern search engines work — they work that way because it makes more sense, and it reduces noise.
But of course you could want to have a list of documents that contain ANY of the words of the query (dog OR cat). AltaVista operated that way by default: I don’t think it helped them stay in business, but let’s try do do that anyway.
All we have to do is to build a union of the lists of documents instead of an intersection; the solution to query2 would therefore be:
- query2b: “big OR dog”
- meaning of the query: give me the list of documents that contain either “big” or “dog”
- how to solve the query:
- retrieve the list of docs for “big”: -nil-
- retrieve the list of docs for “dog”: doc2
- compute the union of those two lists: doc2
- return the result: doc2
And that’s it!
Quick summary: a search engine
- indexes documents
- (therefore, needs an index)
- returns lists of documents from its index that match the words in the query