Invalid Domain

You are here :

Roll Your Own Search Engine
by Brian Slesinsky 23 Apr 1997

Brian Slesinsky is a former HotWired engineer. He left the company to work full-time on his death ray.

Page 1

With all the hype about Web search engines, you might think writing one is a feat not to be attempted by mere mortals. But writing a small search engine is actually quite easy. Don't believe me? Well, in the space of this column, I'll show you how to add a search page to your own Web site with just a few lines of Perl. (For this to work, your site should be hosted on a computer running Unix, and you need the ability to install CGI scripts.)

Coming up with an algorithm

For the most part, a Web site is just a directory tree. If you didn't care about efficiency or response time, you could write a script using find and grep that would search your site the same way you would from the command line. But this brute-force method will bog down as your site grows, since each search has to read every single file on your site.

It's much better to build what's known as an inverted index. This is a table of words much like the index in the back of a book. For example, suppose we had a very simple Web site containing only two pages that look like this:

one.html:

   <html><head>
     <title>Doc One</title>
   </head><body>
     <p>This is document one.
   </body></html>

two.html:

    <html><head>
      <title>Doc Two</title>
    </head><body>
      <p>Here is another document.
    </body></html>

To index this Web site, we need to create two tables. First, we number each page and list its title and URL. This saves space by allowing us to use numbers to refer to the pages later on:

  1 => /one.html, "Doc One"
  2 => /two.html, "Doc Two"

Next, we create the inverted index itself by listing each word and the documents it appears in:

  another => 2
  doc => 1,2
  document => 1,2
  here => 2
  is => 1,2
  one => 1
  this => 1
  two => 2

To do a search, look in the inverted index for the word you want, then look up the Web page listed after that word. So, if you search for the word "here," our script will look up "here" in the inverted index, find "2," then look up "2" and print information about the document as a link:

Similarly, if you type two words, the script would look up both words in the inverted index and list only the pages that contain both.

next page»