September 96 - The Speech Recognition Manager Revealed
The Speech Recognition Manager Revealed
Matt Pallakoff and Arlo Reeves
As any Star Trek fan knows, the computer of the future will talk
and listen. Macintosh computers have already been talking for a
decade, using speech synthesis technologies such as MacinTalk or
the Speech Synthesis Manager. Now any Power Macintosh
application can use Apple's new Speech Recognition Manager to
recognize and respond to spoken commands as well. We'll show you
how easy it is to add speech recognition to your application.
Speech recognition technology has improved significantly in the last few years. It may
still be a long while before you'll be able to carry on arbitrary conversations with
your computer. But if you understand the capabilities and limitations of the new
Speech Recognition Manager, you'll find it easy to create speech recognition
applications that are fast, accurate, and robust.
With code samples from a simple speech recognition application, SRSample, this
article shows you how to quickly get started using the Speech Recognition Manager.
You'll also get some tips on how to make your application's use of speech recognition
compelling, intuitive, and reliable. For everything you need in order to use the Speech
Recognition Manager in your application (including SRSample and detailed
documentation), see this issue's CD or Apple's speech technology Web site.
WHAT THE SPEECH RECOGNITION MANAGER CAN AND CANNOT
DO
The Speech Recognition Manager consists of an API and a recognition engine. Under
System 7.5, these are packaged together in version 1.5 or later of the Speech
Recognition extension. (This packaging may change in future OS versions.)
The Speech Recognition Manager runs only on Power Macintosh computers with
16-bit sound input. Speech recognition is simply too computation-intensive to run
well on most 680x0 systems. The installed base of Power Macs is growing by about
five million a year, however, so plenty of machines -- including the latest
PowerPC(TM) processor-based PowerBooks -- can run speech recognition.
The current version of the Speech Recognition Manager has the following capabilities
and limitations:
• It's speaker independent, meaning that users don't need to train it before
they can use it.
• It recognizes continuous speech, so users can speak naturally, without --
pausing -- between -- words.
• It's designed for North American adult speakers of English. It's not
localized yet, and in general it won't work as well for children.
• It supports command-and-control recognition, not dictation. It works
well when your application asks it to listen for at most a few dozen phrases at
a time; however, it can't recognize arbitrary sentences and its accuracy
decreases substantially if the number of utterances it's asked to listen for
grows too large. For example, it won't accurately recognize one name out of a
list of five thousand names.
OVERVIEW OF THE SPEECH RECOGNITION MANAGER API
To use the Speech Recognition Manager, you must first open a recognition system,
which loads and initializes the recognition toolbox. You then allocate a recognizer,
which listens to a speech source for sound input. A recognizer might also display a
feedback window that shows the user when to speak and what the recognizer thinks was
said.
To define the spoken utterances that the recognizer should listen for, you build a
language model and pass it to the recognizer. A language model is a flexible network of
words and phrases that defines a large number of possible utterances in a compact and
efficient way. The Speech Recognition Manager lets your application rapidly change the
active language model, so that at different times your application can listen for
different things.
After the recognizer is told to start listening, it sends your application a recognition
result whenever it hears the user speak an utterance contained in the current language
model. A recognition result contains the part of the language model that was recognized
and is typically sent to your application via Apple events. (Alternatively, you can
request notification using callbacks if you cannot support Apple events.) Your
application then processes the recognition result to examine what the user said and
responds appropriately.
Figure 1 shows how the Speech Recognition Manager works. Note that the telephone
speech source is not supported in version 1.5 of the Speech Recognition extension.
Figure 1. How the Speech Recognition Manager works
SPEECH OBJECTS
The recognition system, recognizer, speech source, language models, and recognition
results are all objects belonging to classes derived from the SRSpeechObject class, in
accordance with object-oriented design principles. These and other objects are
arranged into the class hierarchy shown in Figure 2. The class hierarchy gives the
Speech Recognition Manager API the flexibility of polymorphism. For example, you
can call the routine SRReleaseObject to dispose of any SRSpeechObject.
Figure 2. The speech object class hierarchy
The most important speech objects are as follows:
• SRRecognitionSystem -- An application typically opens one of these at
startup (by calling SROpenRecognitionSystem) and closes it at shutdown (by
calling SRCloseRecognitionSystem). Applications allocate other kinds of
objects by calling routines like SRNewWord, which typically take the
SRRecognitionSystem object as their first argument.
• SRRecognizer -- An application gets an SRRecognizer from an
SRRecognitionSystem by calling SRNewRecognizer. The SRRecognizer does the
work of recognizing utterances and sending recognition results back to the
application. It begins doing this whenever the application calls
SRStartListening and stops whenever the application calls SRStopListening.
• SRLanguageModel, SRPath, SRPhrase, SRWord -- An application builds
its language models from these object types, which are all subclasses of
SRLanguageObject. (A phrase is a sequence of one or more words, and a path is
a sequence of words, phrases, and language models.) A language model, in turn,
describes what a user can say at any given moment. For example, if an
application displayed ten animals and wanted to allow the user to say any of the
animals' names, it might build a language model containing ten phrases, each
corresponding to an animal's name.
• SRRecognitionResult -- When an utterance is recognized, an
SRRecognitionResult object is sent (using either an Apple event or a callback
routine, whichever the application prefers) to the application that was
listening for that utterance. The SRRecognitionResult object describes what
was recognized. An application can then look at the result in several forms: as
text, as SRWords and SRPhrases, or as an SRLanguageModel, which can assist
in quickly interpreting the uttered phrase.
Each class of speech object has a number of properties that define how the objects
behave. For example, all descendants of SRLanguageObject have a kSRSpelling property
that shows how they're spelled. Your application uses the SRSetProperty and
SRGetProperty routines to set and get the various properties of each these objects.
RELEASING OBJECT REFERENCES
You create objects by calling routines like SRNewRecognizer and SRNewWord. When
you've finished using them, you dispose of them by calling SRReleaseObject. You can
also acquire references to existing objects by calling routines like SRGetIndexedItem
(for example, to get the second word in a phrase of several words). The Speech
Recognition Manager maintains a reference count for each object. An object's reference
count is incremented by SRNew... and SRGet... calls, and is decremented by calls to
SRReleaseObject. An object gets disposed of only when its reference count is
decremented to 0. Thus, to avoid memory leaks, your application must balance every
SRNew... or SRGet... call with a call to SRReleaseObject.
A SIMPLE SPEECH RECOGNITION EXAMPLE
It's easy to add simple speech recognition capabilities to your application. All you need
to do is perform a small number of operations in sequence:
• Initialize speech recognition by determining whether a valid version of
the Speech Recognition Manager is installed, opening an SRRecognitionSystem,
allocating an SRRecognizer, and installing an Apple event handler to handle
recognition result notifications.
• Build a language model that specifies the utterances your application is
listening for.
• Set the recognizer's active language model to the one you built and call
SRStartListening to start listening and processing recognition result
notifications.
We'll describe each of these operations in more detail.
INITIALIZING SPEECH RECOGNITION
First, you must verify that a valid version of the Speech Recognition Manager is
installed on the target machine. Listing 1 shows how to do this. Note that only versions
1.5 and later of the Speech Recognition Manager adhere to the API used in this article.
______________________________
Listing 1. Determining the Speech Recognition Manager version
Boolean HasValidSpeechRecognitionVersion (void)
{
OSErr status;
long theVersion;
Boolean validVersion = false;
const unsigned long kMinimumRequiredSRMVersion = 0x00000150;

status = Gestalt(gestaltSpeechRecognitionVersion, &theVersion);
if (!status)
if (theVersion >= kMinimumRequiredSRMVersion)
validVersion = true;

return validVersion;
}
______________________________
Listing 2. Initializing the Speech Recognition Manager
/* Our global variables */
SRRecognitionSystem gRecognitionSystem = NULL;
SRRecognizer gRecognizer = NULL;
SRLanguageModel gTopLanguageModel = NULL;
AEEventHandlerUPP gAERoutineDescriptor = NULL;
OSErr InitSpeechRecognition (void)
{
OSErr status = kBadSRMVersion;

/* Ensure that the Speech Recognition Manager is available. */
if (HasValidSpeechRecognitionVersion()) {
/* Open the default recognition system. */
status = SROpenRecognitionSystem(&gRecognitionSystem,
kSRDefaultRecognitionSystemID);

/* Use standard feedback window and listening modes. */
if (!status) {
short feedbackNeeded = kSRHasFeedbackHasListenModes;

status = SRSetProperty(gRecognitionSystem,
kSRFeedbackAndListeningModes, &feedbackNeeded,
sizeof(feedbackNeeded));
}

/* Create a new recognizer. */
if (!status)
status = SRNewRecognizer(gRecognitionSystem, &gRecognizer,
kSRDefaultSpeechSource);
/* Install our Apple event handler for recognition results. */
if (!status) {
status = memFullErr;
gAERoutineDescriptor =
NewAEEventHandlerProc(HandleRecognitionDoneAE);
if (gAERoutineDescriptor)
status = AEInstallEventHandler(kAESpeechSuite,
kAESpeechDone, gAERoutineDescriptor, 0,
false);
}
}
return status;
}
______________________________
Listing 2 shows how to open an SRRecognitionSystem, allocate an SRRecognizer, and
install your Apple event handler. All of this happens when your application starts up.
The Apple event handler HandleRecognitionDoneAE is shown later (in Listing 4).
Notice in Listing 2 how we call SRSetProperty to request Apple's standard feedback and
listening modes for the recognizer. To have a successful experience with speech
recognition, users need good feedback indicating when the recognizer is ready for them
to talk and what utterances the recognizer has recognized (for more on giving
feedback, see "Speech Recognition Tips"). In addition, because of the recognizer's
tendency to misinterpret background conversation and noises as speech, it's usually a
good idea to let the user tell the recognizer when to listen by pressing a predefined key
(the "push-to-talk" key). Your application can get all of this important behavior for
free, simply by setting the kSRFeedbackAndListeningModes property.
______________________________
SPEECH RECOGNITION TIPS
Speech recognition is a completely new input mode, and using it properly isn't
always as straightforward as it might seem. While we don't yet have a
complete set of human interface guidelines to guarantee a consistent and
intuitive speech recognition user experience, there are a few simple rules
that all speech recognition applications should follow.
GIVE FEEDBACK
Your application must always provide feedback to let users know when they
can speak, when their utterance has been recognized, and how it was
interpreted. The feedback services in the Speech Recognition Manager perform
this for you, using the standard feedback window shown in Figure 3. (The
user's recognized utterances are shown in italics, and the displayed feedback is
in plain text. The string under the feedback character's face indicates the
push-to-talk key.) All you need to do is set the
kSRFeedbackAndListeningModes property as shown in Listing 2.
Figure 3. Standard feedback window
Your application should use this standard feedback behavior unless you have a
very good reason to provide your own feedback and custom push-to-talk
options. (Fast action games that take over the entire screen and don't call
WaitNextEvent are examples of applications that wouldn't use the standard
feedback.) Not only will users enjoy the benefits of consistent behavior, but as
Apple improves the feedback components, your speech recognition applications
will automatically inherit this improved behavior without having to be