Now we've got two scripts to write: the code that reads all the files on your
Web site and builds the inverted index (the crawler), and the CGI script that
looks up the words the user enters in the search form. Let's do the crawler
first.
First, we open the DBM file where the inverted index will be
stored. I'm going to use the Berkeley DB implementation because
it's fast and allows records to be of arbitrary length. We need that
capability because there's no limit to how many times a common word
such as "the" can appear on a Web site.
Open the index file like this:
use DB_File;
dbmopen(%db,"search_index.db", 0644) or die "dbmopen: $!";
The easiest way to find files on a Unix box is, of course, the Unix
find command. In this case, we're using it to
list all the .html files on the Web site:
open(FILES, "find . -name '*.html' -print|") or die "open for find: $!";
We open each HTML file in turn and load its contents into a variable:
my $filename;
while(defined($filename = <FILES>)) {
print "indexing $filename";
chop $filename;
open(HTML, $filename) or do { warn "open $filename: $!"; next; };
my $html = join('', <HTML>);
close HTML;
Then use a regular expression to extract the title and make an entry
for this page in the table of Web pages:
my ($title) = ($html =~ /<title>([^<]*)/i);
$title = $filename if(!defined $title);
$db{--$fileno} = "<a href=\"$filename\">$title</a>";
Now we need to make a list of all the words in the page. First
we remove the HTML tags:
$html =~ s/<[^>]+>//g;
If we want this to be a case-insensitive search, it will be easier if all
the words are stored in the same case, so let's convert the document
to lowercase:
$html =~ tr/A-Z/a-z/;
Next, we'll make a list of all the words in the document:
my @words = ($html =~ /\w+/g);
Finally, we'll append the words to the appropriate row in the inverted index,
making sure that we don't index the same word twice.
my $last = "";
for (sort @words) {
next if($_ eq $last);
$last = $_;
$db{$_} = defined $db{$_} ? $db{$_}.$fileno : $fileno;
}
That's basically it.
When you run it on your Web site, it will create a file
named "search_index.db" in the top-level directory of your site. This file will
contain an index of all the words used on your site.
(Note: The script takes a long time to run and, depending on
how many documents you have, will create a rather large file. In one
test I did, the index file took 40 percent of the disk space
used by the original files.)
next page»