Using AIAT
Volume Number: 14
Issue Number: 4
Column Tag: Emerging Technologies
Using Apple Information Access Toolkit
by Mark Holtz
Apple's new indexing technology provides powerful ways to
search and retrieve your data, regardless of how it's
stored
Exactly What is AIAT, and What Can It Do For You?
The Apple Information Access Toolkit (AIAT) is a library of routines designed to
distill, index, and query collections of textual data. The elegance of the technology
stems from it's complete independence from the actual data source it is indexing. It can
be used in a fully-interfaced, stand-alone application, or in a memory-constrained
plug-in. You can feed it everything from your next-generation database to a catalog of
your Compact Disc collection. Best of all, rumor has it that AIAT will be a part of
Rhapsody, so your hard work now will pay off in the future.
Sounds great! How do I use it?
If you haven't guessed by now, you're going to have to get your hands dirty and write
some code. AIAT provides a set of C++ objects that can be sub-classed to provide the
desired functionality. Strange runtime quirks notwithstanding, the overall process is
actually easier than it may first appear. In a recent development project, I created an
AIAT-based plug-in that indexed and queried data from a 3rd party database in less
than two days. This article will focus on that development project and some of the
lessons I learned. The most important lesson was that AIAT will do a lot of the work for
you, but you have to tell it exactly what you want.
AIAT: A Technology Overview
Upon unwrapping the AIAT package, you will find a nice, clean set of components to
speed you on your way. Documentation is provided in Adobe PDF format, and is
relatively comprehensive in scope, but occasionally lacks the necessary depth to help
you fully understand a particular object. Next, a set of 68K and PowerPC libraries for
Metrowerks' CodeWarrior 11 and Pro 1. Where there are libraries, there are also
headers. AIAT's headers comprise what I consider to be the missing portions of the
documentation. Not only are they slightly more current than the documentation, but
also provide the necessary implementation details one needs to avoid MacsBug. The
C++ classes in AIAT are robust, but you only need to deal with a small number of
methods to create a working application. Finally, there are two example applications,
one that indexes files in a folder and another that lets you query against the generated
indices. They are fairly complete in their coverage of basic AIAT concepts from a client
perspective, but give only marginal insight into how new data sources can be
interfaced to AIAT.
The AIAT documentation stresses thorough design and analysis. Having a clear picture
of which objects deal with which data and where that data goes will make your coding
efforts much more enjoyable. Now that you've done that, let's get down to architecture.
AIAT functionality is broken up into six major categories: Index, Analysis, Corpus,
Accessor, Storage, and Storable. The Index classes handle creation and maintenance of
keyword lists and document references. The Analysis classes are responsible for
generating and filtering keywords based on various criteria. The Corpus classes are an
abstraction layer for obtaining data from various sources. The Accessor classes
provide query and statistical information from indices. Finally, the Storage and
Storable classes provide a flexible mechanism for the storage and retrieval of
arbitrary sets of data.A quick browse through the headers shows an abstract C++
object for each of these categories, such as the class IACorpus.
You will also find various utility functions, including a set of memory management
calls. You can substitute your own allocator and deallocator routines easily, and AIAT
will happily use them for block and object allocations. There are a few robust concrete
subclasses provided, which you can use directly or subclass to enhance their
functionality. The HFSCorpus and HFSTextFolderCorpus classes allow AIAT to index
collections of text documents. The EnglishAnalysis class allows AIAT to filter document
keywords with arbitrary stop-word and word-stem lists. AIAT makes use of its own
exception handling mechanism, based on C++ exceptions with a few additions. A little
exploration through all the components of the AIAT distribution is time well spent, as
there are a number of classes and functions not specifically called out in the
documentation.
The Task at Hand
I was first introduced to AIAT early in 1997 during a conversation with a good friend
of mine. He suggested that I look into it as a possible full-text indexing solution for a
Mac-based Web site we were creating. As the months passed, we settled on Purity
Software's WebSiphon product as the back-end scripting engine for this Web site.
WebSiphon includes a fast flat-file database called Verona. It had basic search
capabilities, but they were not robust enough to support the kind of ranked queries
you'd find on more powerful Full-Text Indexing Systems. WebSiphon's language is
extensible via Code Fragment libraries, so the idea for an AIAT library for WebSiphon
became an appealing solution. We could collect information from a variety of sources
with WebSiphon, store it in Verona, index it using AIAT, and then query against that
data using WebSiphon scripts. Since AIAT is simply a set of libraries with little
runtime environment dependency, the task was not daunting. The project was distilled
into a few discrete tasks: Create the scripting interface for WebSiphon, write the C++
code to interface with the AIAT accessor and index classes, write the C++ code for
interfacing AIAT to Verona, and make the entire thing work in a multi-threaded
environment. Each phase of the project involved one of AIAT's major areas of
functionality, so I was able to focus on one set of concepts at a time, a tribute to AIAT's
modular design.
Habeas Corpus
The first task to tackle was to provide AIAT with an interface to Verona so it could
access the reams of data our site would produce. This involved getting familiar with
AIAT's Corpus classes. In AIAT, the Corpus provides access to a collection of
"documents" and the data they contain. The IADoc class is the abstract representation
for these documents, and it consists of a name for the document and access methods for
the data. In this case, our documents were records in the Verona database, and the data
would be accessed via an API to the Verona application instead of reading it directly
from a file. After a little bit of work with the Code Fragment Manager, I had a
functional C++ interface to Verona, so it was time to tell AIAT how to deal with it.
First, I created the CVeronaCorpus class, and defined the two pure virtual methods
required to make it work: GetProtoDoc and GetDocText.
Listing 1: CVeronaCorpus.h
CVeronaCorpus
Class definition and required methods for our Corpus interface to a Verona database.
class CVeronaCorpus : public IACorpus
{
public:
CVeronaCorpus(CVeronaGlue* inVeronaGlue,
const char* inDBName);
virtual ~CVeronaCorpus();
virtual IADoc* GetProtoDoc();
virtual IADocText* GetDocText(const IADoc* doc);
virtual IADocIterator* GetDocIterator();

.
.
.
};
IADoc* CVeronaCorpus::GetProtoDoc()
{
return new CVeronaDoc(this, 0);
}
IADocText* CVeronaCorpus::GetDocText(const IADoc* doc)
{
return new CVeronaDocText(this, (CVeronaDoc*) doc);
}
IADocIterator* CVeronaCorpus::GetDocIterator()
{
return new CVeronaDocIterator(this);
}
Phew! That wasn't so bad. GetProtoDoc() returns a new CVeronaDoc object, and
GetDocText() returns a new CVeronaDocText object. But what are these objects?
CVeronaDoc is a subclass of IADoc, AIAT's abstract representation of a document within
a Corpus. CVeronaDocText is a subclass of IADocText, which is responsible for
providing the actual text of the document to AIAT's indexing functions. The third
method, which is not required by AIAT to make a valid Corpus mechanism, is
GetDocIterator(). The IADocIterator class is used to implement a Corpus that consists
of multiple documents, and for providing each of those documents to AIAT in a
consistent, ordered fashion. You may notice as you peruse the AIAT documentation that
many functions deal with IADoc's as a fundamental unit of data. It is up to the Corpus to
determine what that unit of data is and how to return it to AIAT when it's requested.
Here are the definitions of the other Corpus subclasses I created:
Listing 2: CVeronaCorpus.h (cont'd.)
CVeronaDoc, CVeronaDocText, CVeronaDocInterator
Classes for representing the various sub-elements of CVeronaCorpus.
class CVeronaDocIterator : public IADocIterator
{
public:
CVeronaDocIterator(CVeronaCorpus* inCorpus);
virtual ~CVeronaDocIterator();

virtual IADoc* GetNextDoc();

private:
CVeronaCorpus* mCorpus;
unsigned long mCurrentIndex;

};
class CVeronaDoc : public IADoc
{
public:
CVeronaDoc(CVeronaCorpus* inCorpus,
unsigned long inRecRef);
virtual ~CVeronaDoc();
IAStorable* DeepCopy() const;
IABlockSize StoreSize() const;
void Store(IAOutputBlock* output) const;
IAStorable* Restore(IAInputBlock* input) const;
bool LessThan(const IAOrderedStorable* neighbor) const;
bool Equal(const IAOrderedStorable* neighbor) const;
virtual byte* GetName(uint32 *length) const;
unsigned long GetRecRef(void) { return mRecRef; }
protected:
virtual void DeepCopying(const IAStorable* source);
virtual void Restoring(IAInputBlock* input,
const IAStorable* proto);
.
.
.
private:
CVeronaCorpus* mCorpus;
unsigned long mRecRef;
};
class CVeronaDocText : public IADocText
{
public:
CVeronaDocText(CVeronaCorpus* inCorpus,
CVeronaDoc* inDoc);
virtual ~CVeronaDocText();
virtual uint32 GetNextBuffer(byte* buffer,
uint32 bufferLen);
virtual IADocText* DeepCopy() const;
protected:

private:
CVeronaCorpus* mCorpus;
CVeronaDoc* mDoc;
byte* mBuffer;
unsigned long mAmtRead;
unsigned long mBufSize;
};
At this point, we've got all the elements for our Corpus implementation. However, it
may still be unclear how these items work together. When AIAT receives a request to
update a particular index, it starts a dialog with the Corpus object that is tied to that
index. It starts by asking, "What sort of documents do you contain?" By calling
GetProtoDoc(), the Corpus can supply AIAT with a "sample" document. AIAT then asks
for an object that can iterate through all the documents in the Corpus' collection. If one
is available, it is returned via the GetDocIterator() method. Since AIAT knows nothing
about the particular data set it's indexing, the Corpus needs to provide these
mechanisms. If a document iterator is available (which is true in this case), AIAT
begins asking the iterator for successive documents in the collection.
AIAT makes two assumptions about documents that you must keep in mind. First, all
documents in a particular Corpus are of the same type (i.e. CVeronaDoc), and second,
that the order of the documents is always the same for a particular set of documents.
The latter is important because AIAT uses the document sequence for the indexing
mechanism. Hence, the notion of a document being "Less than" another document really
has to do with it's order in this sequence. Be sure to be consistent for whatever kind of
data you're delivering, and this should not be a problem. Getting back to the dialog,
AIAT now has an IADoc it can work with. It starts by asking the document for an
IADocText object that contains the document's data. In our example, the
CVeronaDocText class knows how to access the text in a Verona database record, so our
CVeronaDoc object hands a fresh CVeronaDocText object back to AIAT. This object can
access its parent CVeronaDoc object, and uses that link to obtain the record number in
the database that the CVeronaDoc represents.
Finally, it is time for AIAT to get the text of the document. It does this by calling
CVeronaDocText's GetNextBuffer() method. This method returns the specified number
of bytes from the document. Note that the CVeronaDocText object must maintain its own
information about what data AIAT has already requested. It may help to think of
CVeronaDocText as a one-way stream of data that is read in chunks of arbitrary size.
AIAT will continue to call GetNextBuffer() until the method returns zero, indicating
the end of the data.
AIAT will continue to call the iterator and resulting document and document text
objects to obtain the complete set of data in the Corpus' collection. The other methods of
IADoc are used to determine how a particular document should be placed in the index. It
is necessary to override these in your subclasses so that they are meaningful to the
data you're representing. In my case, the LessThan() and EqualTo() methods compare
documents based on their specific Verona index number. Also amongst the methods of
IADoc you'll have to override are Store(), Restore() and DeepCopy(). These methods
handle converting the object into a data stream, creating an object based on a data
stream, and making a complete and independent copy of an object. These functions are
nicely explained in the AIAT documentation, except for the DeepCopying() and
Restoring() methods, which are used to construct the superclasses of your document
class in a DeepCopy or Restore situation. The only place these functions seem to be
documented is within the header file of IADoc itself. This may be a perfect time to go
back and browse the AIAT headers again.
Now we have a complete Corpus structure for AIAT to handle Verona Databases. A bit of
implementation here and there, and we're ready to start indexing and querying the
data. However, before moving on to the next section, it is important to keep in mind
that there is a lot of possible functionality within the Corpus classes that I have not