Invalid Domain

You are here :

Roll Your Own Search Engine

Page 2 — Deciding on a Data Structure

It wouldn't be hard to store the inverted index in an ordinary file and grep through it for the words we want. Since the index would be smaller than the site, this would be an improvement over using find and grep to search the whole site.

For large sites, however, even the index file can be huge, and reading the whole file on every search just to find a few words would be rather wasteful. So, let's use a DBM file. It so happens that both the table of Web pages and the inverted index consist entirely of (name => value) records, which makes it quite easy to map them to strings in the DBM file. In Perl's notation, our example index would now look like this:

  %dbm = (
    '-1' => '<a href="/one.html">Doc One</a>',
    '-2' => '<a href="/two.html">Doc Two</a>',

    'another' => '-2',
    'doc' => '-1-2',
    'document' => '-1-2',
    'here' => '-2',
    'is' => '-1-2',
    'one' => '-1',
    'this' => '-1',
    'two' => '-2'
  );

(I've switched to using negative numbers so any numbers that appear as words within Web pages won't conflict with the document index.)

next page»