Invalid Domain

You are here :

Roll Your Own Search Engine

Page 3 — Building the Crawler

Now we've got two scripts to write: the code that reads all the files on your Web site and builds the inverted index (the crawler), and the CGI script that looks up the words the user enters in the search form. Let's do the crawler first.

[Brain building the crawler gif]

First, we open the DBM file where the inverted index will be stored. I'm going to use the Berkeley DB implementation because it's fast and allows records to be of arbitrary length. We need that capability because there's no limit to how many times a common word such as "the" can appear on a Web site.

Open the index file like this:

  use DB_File;
  dbmopen(%db,"search_index.db", 0644) or die "dbmopen: $!";

The easiest way to find files on a Unix box is, of course, the Unix find command. In this case, we're using it to list all the .html files on the Web site:

  open(FILES, "find . -name '*.html' -print|") or die "open for find: $!";

We open each HTML file in turn and load its contents into a variable:

  my $filename;
  while(defined($filename = <FILES>)) {
      print "indexing $filename";
      chop $filename;
      open(HTML, $filename) or do { warn "open $filename: $!"; next; };
      my $html = join('', <HTML>);
      close HTML;

Then use a regular expression to extract the title and make an entry for this page in the table of Web pages:

      my ($title) = ($html =~ /<title>([^<]*)/i);
      $title = $filename if(!defined $title);
      $db{--$fileno} = "<a href=\"$filename\">$title</a>";

Now we need to make a list of all the words in the page. First we remove the HTML tags:

      $html =~ s/<[^>]+>//g;

If we want this to be a case-insensitive search, it will be easier if all the words are stored in the same case, so let's convert the document to lowercase:

      $html =~ tr/A-Z/a-z/;

Next, we'll make a list of all the words in the document:

      my @words = ($html =~ /\w+/g);

Finally, we'll append the words to the appropriate row in the inverted index, making sure that we don't index the same word twice.

      my $last = "";
      for (sort @words) {
          next if($_ eq $last);
          $last = $_;
          $db{$_} = defined $db{$_} ? $db{$_}.$fileno : $fileno;
      }

That's basically it. When you run it on your Web site, it will create a file named "search_index.db" in the top-level directory of your site. This file will contain an index of all the words used on your site.

(Note: The script takes a long time to run and, depending on how many documents you have, will create a rather large file. In one test I did, the index file took 40 percent of the disk space used by the original files.)

next page»