With all the hype about Web search engines, you might think writing
one is a feat not to be attempted by mere mortals. But writing a
small search engine is actually quite easy. Don't believe me? Well, in
the space of this column, I'll show you how to add a search page to
your own Web site with just a few lines of Perl. (For
this to work, your site should be hosted on a computer running Unix,
and you need the ability to install CGI scripts.)
Coming up with an
algorithm
For the most part, a Web site is just a directory tree. If you didn't care
about efficiency or response time, you could write a script using find and grep that would search your
site the same way you would from the command line. But this brute-force method
will bog down as your site grows, since each search has to read every single file on
your site.
It's much better to build what's known as an inverted index. This is a table of words much like the index in the
back of a book. For example, suppose we had a very simple Web site
containing only two pages that look like this:
one.html:
<html><head>
<title>Doc One</title>
</head><body>
<p>This is document one.
</body></html>
two.html:
<html><head>
<title>Doc Two</title>
</head><body>
<p>Here is another document.
</body></html>
To index this Web site, we need to create two tables. First, we
number each page and list its title and URL. This saves space by
allowing us to use numbers to refer to the pages later on:
1 => /one.html, "Doc One"
2 => /two.html, "Doc Two"
Next, we create the inverted index itself by listing each word and
the documents it appears in:
another => 2
doc => 1,2
document => 1,2
here => 2
is => 1,2
one => 1
this => 1
two => 2
To do a search, look in the inverted index for the word you
want, then look up the Web page listed after that word. So, if you
search for the word "here," our script will look up "here" in the
inverted index, find "2," then look up "2" and print information
about the document as a link:
- <a href="/two.html">Doc Two</a>
Similarly, if you type two words, the script would look up
both words in the inverted index and list only the pages that
contain both.
next page»