All Databases develop - 1996

September 96 - The Speech Recognition Manager Revealed

The Speech Recognition Manager Revealed

Matt Pallakoff and Arlo Reeves

As any Star Trek fan knows, the computer of the future will talk

and listen. Macintosh computers have already been talking for a

decade, using speech synthesis technologies such as MacinTalk or

the Speech Synthesis Manager. Now any Power Macintosh

application can use Apple's new Speech Recognition Manager to

recognize and respond to spoken commands as well. We'll show you

how easy it is to add speech recognition to your application.

Speech recognition technology has improved significantly in the last few years. It may

still be a long while before you'll be able to carry on arbitrary conversations with

your computer. But if you understand the capabilities and limitations of the new

Speech Recognition Manager, you'll find it easy to create speech recognition

applications that are fast, accurate, and robust.

With code samples from a simple speech recognition application, SRSample, this

article shows you how to quickly get started using the Speech Recognition Manager.

You'll also get some tips on how to make your application's use of speech recognition

compelling, intuitive, and reliable. For everything you need in order to use the Speech

Recognition Manager in your application (including SRSample and detailed

documentation), see this issue's CD or Apple's speech technology Web site.

WHAT THE SPEECH RECOGNITION MANAGER CAN AND CANNOT

The Speech Recognition Manager consists of an API and a recognition engine. Under

System 7.5, these are packaged together in version 1.5 or later of the Speech

Recognition extension. (This packaging may change in future OS versions.)

The Speech Recognition Manager runs only on Power Macintosh computers with

16-bit sound input. Speech recognition is simply too computation-intensive to run

well on most 680x0 systems. The installed base of Power Macs is growing by about

five million a year, however, so plenty of machines -- including the latest

PowerPC(TM) processor-based PowerBooks -- can run speech recognition.

The current version of the Speech Recognition Manager has the following capabilities

and limitations:

• It's speaker independent, meaning that users don't need to train it before

they can use it.

• It recognizes continuous speech, so users can speak naturally, without --

pausing -- between -- words.

• It's designed for North American adult speakers of English. It's not

localized yet, and in general it won't work as well for children.

• It supports command-and-control recognition, not dictation. It works

well when your application asks it to listen for at most a few dozen phrases at

a time; however, it can't recognize arbitrary sentences and its accuracy

decreases substantially if the number of utterances it's asked to listen for

grows too large. For example, it won't accurately recognize one name out of a

list of five thousand names.

OVERVIEW OF THE SPEECH RECOGNITION MANAGER API

To use the Speech Recognition Manager, you must first open a recognition system,

which loads and initializes the recognition toolbox. You then allocate a recognizer,

which listens to a speech source for sound input. A recognizer might also display a

feedback window that shows the user when to speak and what the recognizer thinks was

said.

To define the spoken utterances that the recognizer should listen for, you build a

language model and pass it to the recognizer. A language model is a flexible network of

words and phrases that defines a large number of possible utterances in a compact and

efficient way. The Speech Recognition Manager lets your application rapidly change the

active language model, so that at different times your application can listen for

different things.

After the recognizer is told to start listening, it sends your application a recognition

result whenever it hears the user speak an utterance contained in the current language

model. A recognition result contains the part of the language model that was recognized

and is typically sent to your application via Apple events. (Alternatively, you can

request notification using callbacks if you cannot support Apple events.) Your

application then processes the recognition result to examine what the user said and

responds appropriately.

Figure 1 shows how the Speech Recognition Manager works. Note that the telephone

speech source is not supported in version 1.5 of the Speech Recognition extension.

Figure 1. How the Speech Recognition Manager works

SPEECH OBJECTS

The recognition system, recognizer, speech source, language models, and recognition

results are all objects belonging to classes derived from the SRSpeechObject class, in

accordance with object-oriented design principles. These and other objects are

arranged into the class hierarchy shown in Figure 2. The class hierarchy gives the

Speech Recognition Manager API the flexibility of polymorphism. For example, you

can call the routine SRReleaseObject to dispose of any SRSpeechObject.

Figure 2. The speech object class hierarchy

The most important speech objects are as follows:

• SRRecognitionSystem -- An application typically opens one of these at

startup (by calling SROpenRecognitionSystem) and closes it at shutdown (by

calling SRCloseRecognitionSystem). Applications allocate other kinds of

objects by calling routines like SRNewWord, which typically take the

SRRecognitionSystem object as their first argument.

• SRRecognizer -- An application gets an SRRecognizer from an

SRRecognitionSystem by calling SRNewRecognizer. The SRRecognizer does the

work of recognizing utterances and sending recognition results back to the

application. It begins doing this whenever the application calls

SRStartListening and stops whenever the application calls SRStopListening.

• SRLanguageModel, SRPath, SRPhrase, SRWord -- An application builds

its language models from these object types, which are all subclasses of

SRLanguageObject. (A phrase is a sequence of one or more words, and a path is

a sequence of words, phrases, and language models.) A language model, in turn,

describes what a user can say at any given moment. For example, if an

application displayed ten animals and wanted to allow the user to say any of the

animals' names, it might build a language model containing ten phrases, each

corresponding to an animal's name.

• SRRecognitionResult -- When an utterance is recognized, an

SRRecognitionResult object is sent (using either an Apple event or a callback

routine, whichever the application prefers) to the application that was

listening for that utterance. The SRRecognitionResult object describes what

was recognized. An application can then look at the result in several forms: as

text, as SRWords and SRPhrases, or as an SRLanguageModel, which can assist

in quickly interpreting the uttered phrase.

Each class of speech object has a number of properties that define how the objects

behave. For example, all descendants of SRLanguageObject have a kSRSpelling property

that shows how they're spelled. Your application uses the SRSetProperty and

SRGetProperty routines to set and get the various properties of each these objects.

RELEASING OBJECT REFERENCES

You create objects by calling routines like SRNewRecognizer and SRNewWord. When

you've finished using them, you dispose of them by calling SRReleaseObject. You can

also acquire references to existing objects by calling routines like SRGetIndexedItem

(for example, to get the second word in a phrase of several words). The Speech

Recognition Manager maintains a reference count for each object. An object's reference

count is incremented by SRNew... and SRGet... calls, and is decremented by calls to

SRReleaseObject. An object gets disposed of only when its reference count is

decremented to 0. Thus, to avoid memory leaks, your application must balance every

SRNew... or SRGet... call with a call to SRReleaseObject.

A SIMPLE SPEECH RECOGNITION EXAMPLE

It's easy to add simple speech recognition capabilities to your application. All you need

to do is perform a small number of operations in sequence:

• Initialize speech recognition by determining whether a valid version of

the Speech Recognition Manager is installed, opening an SRRecognitionSystem,

allocating an SRRecognizer, and installing an Apple event handler to handle

recognition result notifications.

• Build a language model that specifies the utterances your application is

listening for.

• Set the recognizer's active language model to the one you built and call

SRStartListening to start listening and processing recognition result

notifications.

We'll describe each of these operations in more detail.

INITIALIZING SPEECH RECOGNITION

First, you must verify that a valid version of the Speech Recognition Manager is

installed on the target machine. Listing 1 shows how to do this. Note that only versions

1.5 and later of the Speech Recognition Manager adhere to the API used in this article.

______________________________

Listing 1. Determining the Speech Recognition Manager version

Boolean HasValidSpeechRecognitionVersion (void)

{

OSErr status;

long theVersion;

Boolean validVersion = false;

const unsigned long kMinimumRequiredSRMVersion = 0x00000150;

status = Gestalt(gestaltSpeechRecognitionVersion, &theVersion);

if (!status)

if (theVersion >= kMinimumRequiredSRMVersion)

validVersion = true;

return validVersion;

}

______________________________

Listing 2. Initializing the Speech Recognition Manager

/* Our global variables */

SRRecognitionSystem gRecognitionSystem = NULL;

SRRecognizer gRecognizer = NULL;

SRLanguageModel gTopLanguageModel = NULL;

AEEventHandlerUPP gAERoutineDescriptor = NULL;

OSErr InitSpeechRecognition (void)

{

OSErr status = kBadSRMVersion;

/* Ensure that the Speech Recognition Manager is available. */

if (HasValidSpeechRecognitionVersion()) {

/* Open the default recognition system. */

status = SROpenRecognitionSystem(&gRecognitionSystem,

kSRDefaultRecognitionSystemID);

/* Use standard feedback window and listening modes. */

if (!status) {

short feedbackNeeded = kSRHasFeedbackHasListenModes;

status = SRSetProperty(gRecognitionSystem,

kSRFeedbackAndListeningModes, &feedbackNeeded,

sizeof(feedbackNeeded));

}

/* Create a new recognizer. */

if (!status)

status = SRNewRecognizer(gRecognitionSystem, &gRecognizer,

kSRDefaultSpeechSource);

/* Install our Apple event handler for recognition results. */

if (!status) {

status = memFullErr;

gAERoutineDescriptor =

NewAEEventHandlerProc(HandleRecognitionDoneAE);

if (gAERoutineDescriptor)

status = AEInstallEventHandler(kAESpeechSuite,

kAESpeechDone, gAERoutineDescriptor, 0,

false);

}

return status;

}

______________________________

Listing 2 shows how to open an SRRecognitionSystem, allocate an SRRecognizer, and

install your Apple event handler. All of this happens when your application starts up.

The Apple event handler HandleRecognitionDoneAE is shown later (in Listing 4).

Notice in Listing 2 how we call SRSetProperty to request Apple's standard feedback and

listening modes for the recognizer. To have a successful experience with speech

recognition, users need good feedback indicating when the recognizer is ready for them

to talk and what utterances the recognizer has recognized (for more on giving

feedback, see "Speech Recognition Tips"). In addition, because of the recognizer's

tendency to misinterpret background conversation and noises as speech, it's usually a

good idea to let the user tell the recognizer when to listen by pressing a predefined key

(the "push-to-talk" key). Your application can get all of this important behavior for

free, simply by setting the kSRFeedbackAndListeningModes property.

______________________________

SPEECH RECOGNITION TIPS

Speech recognition is a completely new input mode, and using it properly isn't

always as straightforward as it might seem. While we don't yet have a

complete set of human interface guidelines to guarantee a consistent and

intuitive speech recognition user experience, there are a few simple rules

that all speech recognition applications should follow.

GIVE FEEDBACK

Your application must always provide feedback to let users know when they

can speak, when their utterance has been recognized, and how it was

interpreted. The feedback services in the Speech Recognition Manager perform

this for you, using the standard feedback window shown in Figure 3. (The

user's recognized utterances are shown in italics, and the displayed feedback is

in plain text. The string under the feedback character's face indicates the

push-to-talk key.) All you need to do is set the

kSRFeedbackAndListeningModes property as shown in Listing 2.

Figure 3. Standard feedback window

Your application should use this standard feedback behavior unless you have a

very good reason to provide your own feedback and custom push-to-talk

options. (Fast action games that take over the entire screen and don't call

WaitNextEvent are examples of applications that wouldn't use the standard

feedback.) Not only will users enjoy the benefits of consistent behavior, but as

Apple improves the feedback components, your speech recognition applications

will automatically inherit this improved behavior without having to be

Referenced by (4):