Great Circle Associates Majordomo-Workers
(March 1999)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: Search engine hooks
From: Oliver Xymoron <oxymoron @ waste . org>
Date: Sun, 28 Mar 1999 10:32:46 -0600 (CST)
To: Jason L Tibbitts III <tibbs @ math . uh . edu>
Cc: majordomo-workers @ greatcircle . com
In-reply-to: <ufasoaqrsjf.fsf@epithumia.math.uh.edu>

On 28 Mar 1999, Jason L Tibbitts III wrote:

> OX> I'd like to be able to send a command like this:
> 
> OX> search listname "query string"
> 
> I've planned on implementing something like this, but it's been a pretty
> low priority for me as Wilma takes care of most of my needs in that area
> (though it is of course web-only).

I have many private lists I'd like to search, in addition to the public
ones. Therefore, access control has to be through Majordomo.
 
> OX> The first, obviously, is a the search front-end that takes the listname
> OX> and the query, calls a search engine, and returns some result text.
> 
> I can hook something up pretty quickly, probably.
> 
> OX> The second is a hook called for each message archived, which passes the
> OX> necessary information to the search back-end for it to pull it out of
> OX> the archive and add it to the index.
> 
> Hmmm.  This is more interesting.  When cooking up the archive format I had
> two basic requirements:
> 
> 1) It has to be mbox format (compatibility).
> 
> 2) It should be easy to go from a byte, line or message number to the
>    starting offset of the message and its byte length (so you can just seek
>    and sysread.)
> 
> So each mbox file has an associated index file, which contains message
> number, line number of start of message, byte offset of start of message,
> length in bytes of message and some header data (Subject:, From:,
> Message-ID:, References:, Date:) of each message.  If the search engine is
> simply run periodically over the mbox files and gives back line numbers of
> matches then it should be easy to sift through the index and give the
> subjects and message numbers of the hits.  We already have retrieval of
> message numbers (in digests) although this is currently pretty rough.

Not quite generic enough to be pluggable, though it works perfectly with
my current implementation, which has an auxillary module for index mbox
files. What I'd like is:

 indexmsg mbox-file byte-offset length

  or

 indexmsg list-name msg-number 

  with a corresponding

 mjshell getmessage list-name msg-number

This will allow the index to always be current.

> OX> By the way, I've got a cute little full-text search engine module that
> OX> I'm nearly ready to unleash on the world..
> 
> What's the indexer?  I've basically given up on Glimpse as being totally
> useless; the new versions don't work and the old versions don't build on
> new OSes.  I looked at FreeWAIS-sf but the documentation is useless.  Or
> is your module itself the indexer?

It's completely homebrew with a tied GNU DBM backend. Unlike Glimpse,
all data for performing searches, including abstracts, etc., are stored in
the database, which means it can be used for indexing remote files,
but also means the indexes are somewhat larger.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 



Follow-Ups:
References:
Indexed By Date Previous: Re: Search engine hooks
From: Jason L Tibbitts III <tibbs@math.uh.edu>
Next: Re: Search engine hooks
From: Nick Simicich <njs@scifi.squawk.com>
Indexed By Thread Previous: Re: Search engine hooks
From: Jason L Tibbitts III <tibbs@math.uh.edu>
Next: Re: Search engine hooks
From: Jason L Tibbitts III <tibbs@math.uh.edu>

Google
 
Search Internet Search www.greatcircle.com