All Databases MacTech Vol 14-1998

Using AIAT

Volume Number: 14

Issue Number: 4

Column Tag: Emerging Technologies

Using Apple Information Access Toolkit

by Mark Holtz

Apple's new indexing technology provides powerful ways to

search and retrieve your data, regardless of how it's

stored

Exactly What is AIAT, and What Can It Do For You?

The Apple Information Access Toolkit (AIAT) is a library of routines designed to

distill, index, and query collections of textual data. The elegance of the technology

stems from it's complete independence from the actual data source it is indexing. It can

be used in a fully-interfaced, stand-alone application, or in a memory-constrained

plug-in. You can feed it everything from your next-generation database to a catalog of

your Compact Disc collection. Best of all, rumor has it that AIAT will be a part of

Rhapsody, so your hard work now will pay off in the future.

Sounds great! How do I use it?

If you haven't guessed by now, you're going to have to get your hands dirty and write

some code. AIAT provides a set of C++ objects that can be sub-classed to provide the

desired functionality. Strange runtime quirks notwithstanding, the overall process is

actually easier than it may first appear. In a recent development project, I created an

AIAT-based plug-in that indexed and queried data from a 3rd party database in less

than two days. This article will focus on that development project and some of the

lessons I learned. The most important lesson was that AIAT will do a lot of the work for

you, but you have to tell it exactly what you want.

AIAT: A Technology Overview

Upon unwrapping the AIAT package, you will find a nice, clean set of components to

speed you on your way. Documentation is provided in Adobe PDF format, and is

relatively comprehensive in scope, but occasionally lacks the necessary depth to help

you fully understand a particular object. Next, a set of 68K and PowerPC libraries for

Metrowerks' CodeWarrior 11 and Pro 1. Where there are libraries, there are also

headers. AIAT's headers comprise what I consider to be the missing portions of the

documentation. Not only are they slightly more current than the documentation, but

also provide the necessary implementation details one needs to avoid MacsBug. The

C++ classes in AIAT are robust, but you only need to deal with a small number of

methods to create a working application. Finally, there are two example applications,

one that indexes files in a folder and another that lets you query against the generated

indices. They are fairly complete in their coverage of basic AIAT concepts from a client

perspective, but give only marginal insight into how new data sources can be

interfaced to AIAT.

The AIAT documentation stresses thorough design and analysis. Having a clear picture

of which objects deal with which data and where that data goes will make your coding

efforts much more enjoyable. Now that you've done that, let's get down to architecture.

AIAT functionality is broken up into six major categories: Index, Analysis, Corpus,

Accessor, Storage, and Storable. The Index classes handle creation and maintenance of

keyword lists and document references. The Analysis classes are responsible for

generating and filtering keywords based on various criteria. The Corpus classes are an

abstraction layer for obtaining data from various sources. The Accessor classes

provide query and statistical information from indices. Finally, the Storage and

Storable classes provide a flexible mechanism for the storage and retrieval of

arbitrary sets of data.A quick browse through the headers shows an abstract C++

object for each of these categories, such as the class IACorpus.

You will also find various utility functions, including a set of memory management

calls. You can substitute your own allocator and deallocator routines easily, and AIAT

will happily use them for block and object allocations. There are a few robust concrete

subclasses provided, which you can use directly or subclass to enhance their

functionality. The HFSCorpus and HFSTextFolderCorpus classes allow AIAT to index

collections of text documents. The EnglishAnalysis class allows AIAT to filter document

keywords with arbitrary stop-word and word-stem lists. AIAT makes use of its own

exception handling mechanism, based on C++ exceptions with a few additions. A little

exploration through all the components of the AIAT distribution is time well spent, as

there are a number of classes and functions not specifically called out in the

documentation.

The Task at Hand

I was first introduced to AIAT early in 1997 during a conversation with a good friend

of mine. He suggested that I look into it as a possible full-text indexing solution for a

Mac-based Web site we were creating. As the months passed, we settled on Purity

Software's WebSiphon product as the back-end scripting engine for this Web site.

WebSiphon includes a fast flat-file database called Verona. It had basic search

capabilities, but they were not robust enough to support the kind of ranked queries

you'd find on more powerful Full-Text Indexing Systems. WebSiphon's language is

extensible via Code Fragment libraries, so the idea for an AIAT library for WebSiphon

became an appealing solution. We could collect information from a variety of sources

with WebSiphon, store it in Verona, index it using AIAT, and then query against that

data using WebSiphon scripts. Since AIAT is simply a set of libraries with little

runtime environment dependency, the task was not daunting. The project was distilled

into a few discrete tasks: Create the scripting interface for WebSiphon, write the C++

code to interface with the AIAT accessor and index classes, write the C++ code for

interfacing AIAT to Verona, and make the entire thing work in a multi-threaded

environment. Each phase of the project involved one of AIAT's major areas of

functionality, so I was able to focus on one set of concepts at a time, a tribute to AIAT's

modular design.

Habeas Corpus

The first task to tackle was to provide AIAT with an interface to Verona so it could

access the reams of data our site would produce. This involved getting familiar with

AIAT's Corpus classes. In AIAT, the Corpus provides access to a collection of

"documents" and the data they contain. The IADoc class is the abstract representation

for these documents, and it consists of a name for the document and access methods for

the data. In this case, our documents were records in the Verona database, and the data

would be accessed via an API to the Verona application instead of reading it directly

from a file. After a little bit of work with the Code Fragment Manager, I had a

functional C++ interface to Verona, so it was time to tell AIAT how to deal with it.

First, I created the CVeronaCorpus class, and defined the two pure virtual methods

required to make it work: GetProtoDoc and GetDocText.

Listing 1: CVeronaCorpus.h

CVeronaCorpus

Class definition and required methods for our Corpus interface to a Verona database.

class CVeronaCorpus : public IACorpus

{
public:

CVeronaCorpus(CVeronaGlue* inVeronaGlue,

const char* inDBName);

virtual ~CVeronaCorpus();

virtual IADoc* GetProtoDoc();

virtual IADocText* GetDocText(const IADoc* doc);

virtual IADocIterator* GetDocIterator();

};

IADoc* CVeronaCorpus::GetProtoDoc()

{

return new CVeronaDoc(this, 0);

}

IADocText* CVeronaCorpus::GetDocText(const IADoc* doc)

{

return new CVeronaDocText(this, (CVeronaDoc*) doc);

}

IADocIterator* CVeronaCorpus::GetDocIterator()

{

return new CVeronaDocIterator(this);

}

Phew! That wasn't so bad. GetProtoDoc() returns a new CVeronaDoc object, and

GetDocText() returns a new CVeronaDocText object. But what are these objects?

CVeronaDoc is a subclass of IADoc, AIAT's abstract representation of a document within

a Corpus. CVeronaDocText is a subclass of IADocText, which is responsible for

providing the actual text of the document to AIAT's indexing functions. The third

method, which is not required by AIAT to make a valid Corpus mechanism, is

GetDocIterator(). The IADocIterator class is used to implement a Corpus that consists

of multiple documents, and for providing each of those documents to AIAT in a

consistent, ordered fashion. You may notice as you peruse the AIAT documentation that

many functions deal with IADoc's as a fundamental unit of data. It is up to the Corpus to

determine what that unit of data is and how to return it to AIAT when it's requested.

Here are the definitions of the other Corpus subclasses I created:

Listing 2: CVeronaCorpus.h (cont'd.)

CVeronaDoc, CVeronaDocText, CVeronaDocInterator

Classes for representing the various sub-elements of CVeronaCorpus.

class CVeronaDocIterator : public IADocIterator

{
public:

CVeronaDocIterator(CVeronaCorpus* inCorpus);

virtual ~CVeronaDocIterator();

virtual IADoc* GetNextDoc();

private:

CVeronaCorpus* mCorpus;

unsigned long mCurrentIndex;

};

class CVeronaDoc : public IADoc

{
public:

CVeronaDoc(CVeronaCorpus* inCorpus,

unsigned long inRecRef);

virtual ~CVeronaDoc();

IAStorable* DeepCopy() const;

IABlockSize StoreSize() const;

void Store(IAOutputBlock* output) const;

IAStorable* Restore(IAInputBlock* input) const;

bool LessThan(const IAOrderedStorable* neighbor) const;

bool Equal(const IAOrderedStorable* neighbor) const;

virtual byte* GetName(uint32 *length) const;

unsigned long GetRecRef(void) { return mRecRef; }

protected:

virtual void DeepCopying(const IAStorable* source);

virtual void Restoring(IAInputBlock* input,

const IAStorable* proto);

private:

CVeronaCorpus* mCorpus;

unsigned long mRecRef;

};

class CVeronaDocText : public IADocText

{
public:

CVeronaDocText(CVeronaCorpus* inCorpus,

CVeronaDoc* inDoc);

virtual ~CVeronaDocText();

virtual uint32 GetNextBuffer(byte* buffer,

uint32 bufferLen);

virtual IADocText* DeepCopy() const;

protected:

private:

CVeronaCorpus* mCorpus;

CVeronaDoc* mDoc;

byte* mBuffer;

unsigned long mAmtRead;

unsigned long mBufSize;

};

At this point, we've got all the elements for our Corpus implementation. However, it

may still be unclear how these items work together. When AIAT receives a request to

update a particular index, it starts a dialog with the Corpus object that is tied to that

index. It starts by asking, "What sort of documents do you contain?" By calling

GetProtoDoc(), the Corpus can supply AIAT with a "sample" document. AIAT then asks

for an object that can iterate through all the documents in the Corpus' collection. If one

is available, it is returned via the GetDocIterator() method. Since AIAT knows nothing

about the particular data set it's indexing, the Corpus needs to provide these

mechanisms. If a document iterator is available (which is true in this case), AIAT

begins asking the iterator for successive documents in the collection.

AIAT makes two assumptions about documents that you must keep in mind. First, all

documents in a particular Corpus are of the same type (i.e. CVeronaDoc), and second,

that the order of the documents is always the same for a particular set of documents.

The latter is important because AIAT uses the document sequence for the indexing

mechanism. Hence, the notion of a document being "Less than" another document really

has to do with it's order in this sequence. Be sure to be consistent for whatever kind of

data you're delivering, and this should not be a problem. Getting back to the dialog,

AIAT now has an IADoc it can work with. It starts by asking the document for an

IADocText object that contains the document's data. In our example, the

CVeronaDocText class knows how to access the text in a Verona database record, so our

CVeronaDoc object hands a fresh CVeronaDocText object back to AIAT. This object can

access its parent CVeronaDoc object, and uses that link to obtain the record number in

the database that the CVeronaDoc represents.

Finally, it is time for AIAT to get the text of the document. It does this by calling

CVeronaDocText's GetNextBuffer() method. This method returns the specified number

of bytes from the document. Note that the CVeronaDocText object must maintain its own

information about what data AIAT has already requested. It may help to think of

CVeronaDocText as a one-way stream of data that is read in chunks of arbitrary size.

AIAT will continue to call GetNextBuffer() until the method returns zero, indicating

the end of the data.

AIAT will continue to call the iterator and resulting document and document text

objects to obtain the complete set of data in the Corpus' collection. The other methods of

IADoc are used to determine how a particular document should be placed in the index. It

is necessary to override these in your subclasses so that they are meaningful to the

data you're representing. In my case, the LessThan() and EqualTo() methods compare

documents based on their specific Verona index number. Also amongst the methods of

IADoc you'll have to override are Store(), Restore() and DeepCopy(). These methods

handle converting the object into a data stream, creating an object based on a data

stream, and making a complete and independent copy of an object. These functions are

nicely explained in the AIAT documentation, except for the DeepCopying() and

Restoring() methods, which are used to construct the superclasses of your document

class in a DeepCopy or Restore situation. The only place these functions seem to be

documented is within the header file of IADoc itself. This may be a perfect time to go

back and browse the AIAT headers again.

Now we have a complete Corpus structure for AIAT to handle Verona Databases. A bit of

implementation here and there, and we're ready to start indexing and querying the

data. However, before moving on to the next section, it is important to keep in mind

that there is a lot of possible functionality within the Corpus classes that I have not

Referenced by (6):