June 93 - WRITING LOCALIZABLE APPLICATIONS
WRITING LOCALIZABLE APPLICATIONS
JOSEPH TERNASKY AND BRYAN K. ("BEAKER")
RESSLER
JOSEPH TERNASKY AND BRYAN K. ("BEAKER") RESSLERMore and more
software companies are finding rich new markets overseas. Unfortunately, many of
these developers have also discovered that localizing an application involves a lot more
than translating a bunch of STR# resources. In fact, localization often becomes an
unexpectedly long, complex, and expensive development cycle. This article describes
some common problems and gives proactive engineering advice you can use during
initial U.S. development to speed your localization efforts later on.
Most software localization headaches are associated with text drawing and character
handling, so that's what this article stresses. Four common areas of difficulty are:
• keyboard input (specifically for two-byte scripts)
• choice of fonts and sizes for screen display
• date, time, number, and currency formats and sorting order
• character encodings
We discuss each of these potential pitfalls in detail and provide data structures and
example code.
PRELIMINARIES
Throughout the discussion, we assume you're developing primarily for the U.S.
market, but you're planning to publish internationally eventually (or at least you're
trying to keep your options open). As you're developing your strategy, here are a few
points to keep in mind:
• Don't dismiss any markets out of hand -- investigate the potential
rewards for entry into a particular market and the features required for that
market.
• The amount of effort required to support western Europe is relatively
small. Depending on the type of application you're developing, the additional
effort required for other countries isn't that much more. There's also a
growing market for non-Roman script systems inside the U.S.
• The labor required to build atruly globalprogram is much less if you do
the work up front, rather than writing quick-and-dirty code for the U.S. and
having to rewrite it later.
• Consider market growth trends. A market that's small now may be big
later.
This article concentrates on features for western Europe and Japan because those are
the markets we're most familiar with. We encourage you to investigate other markets
on your own.
LINGO LESSON 101
This international software thing is rife with specialized lingo. For a complete
explanation of all the terms, see the hefty "Worldwide Software Overview," Chapter
14 ofInside MacintoshVolume VI. But we're not here to intimidate, so let's go over a
few basic terms.
Script. A writing system that can be used to represent one or more human languages.
For example, the Roman script is used to represent English, Spanish, Hungarian, and
so on. Scripts fall into several categories, as described in the next section, "Script
Categories.
Script code. An integer that identifies a script on the Macintosh.
Encoding. A mapping between characters and integers. Each character in the
character set is assigned a unique integer, called itscharacter code. If a character
appears in more than one character set it may have more than one encoding, a situation
discussed later in the section "Dealing With Character Encodings." Since each script
has a unique encoding, sometimes the termsscript and encodingare used
interchangeably.
Character code. An integer that's associated with a given character in a script.
Glyph. The displayed form of a character. The glyph for a given character code may
not always be the same -- in some scripts the codes of the surrounding characters
provide a context for choosing a particular glyph.
Line orientation. The overall direction of text flow within a line. For instance,
English has left-to-right line orientation, while Japanese can use either
top-to-bottom (vertical) or left-to-right (horizontal) line orientation.
Character orientation. The relationship between a character's baseline and the
line orientation. When the line orientation and the character baselines go in the same
direction, it's calledwith-streamcharacter orientation. When the line orientation
differs from the character baseline direction, it's called cross-stream character
orientation. For instance, in Japanese, when the line orientation is left- to-right,
characters are also oriented left-to-right (with-stream). Japanese can also be
formatted with a top-to-bottom (vertical) line orientation, in which case character
baselines can be left-to-right (cross-stream) or top-to-bottom (with-stream). See
Figure 1.
Figure 1 Line and Character Orientation in Mixed Japanese/English Text
SCRIPT CATEGORIES
Scripts fall into different categories that require different software solutions. Here
are the basic categories:
Simple scriptshave small character sets (fewer than 256 characters),
and no context information is required to choose a glyph for a given character
code. They have left-to-right lines and top-to-bottom pages. Simple scripts
encompass the languages of the U.S. and Europe, as well as many other
countries worldwide. For example, some simple scripts are Roman, Cyrillic,
and Greek.
Two-byte scriptshave large character sets (up to 28,000 characters)
and require no context information for glyph choice. They use various
combinations of left-to- right or top-to-bottom lines and top-to-bottom or
right-to-left pages. Two-byte scripts include the languages of Japan, China,
Hong Kong, Taiwan, and Korea.
Context-sensitive scriptshave a small character set (fewer than 256
characters) but may have a larger glyph set, since there are potentially
several graphic representations for any given character code. The mapping
from a given character code to a glyph depends on surrounding characters.
Most languages that use a context-sensitive script have left-to-right lines and
top-to-bottom pages, such as Devanagari and Bengali.
Bidirectional scriptscan have runs of left-to-right and right-to-left
characters appearing simultaneously in a single line of text. These scripts
have small character sets (fewer than 256 characters) and require no context
information for glyph choice. Bidirectional scripts are used for languages such
as Hebrew that have both left-to-right and right-to-left characters, with
top-to-bottom pages.
There are a few exceptional scripts that fall into more than one of these categories,
such as Arabic and Urdu. Arabic, for instance, is both context sensitive and
bidirectional.
Now with the preliminaries out of the way, we're ready to discuss some localization
pitfalls.
KEYBOARD INPUT
Sooner or later, your users are going to start typing. You can't stop them. So now what
do you do? One approach is to simply ignore keyboard input. While perfectly
acceptable to open-minded engineers like yourself, your Marketing colleagues may
find this approach unacceptable. So, let's examine what happens when two-byte script
users type on their keyboards.
Obviously, a Macintosh keyboard doesn't have enough keys to allow users of two-byte
script systems to simply press the key corresponding to the one character they want
out of 28,000. Instead, two- byte systems are equipped with a softwareinput method,
also called a front-end processoror FEP, which allows users to type phonetically on a
keyboard similar to the standard U.S. keyboard. (Some input methods use strokes or
codes instead of phonetics, but the mechanism is the same.)
As soon as the user begins typing, a smallinput windowappears at the bottom of the
screen. When the user signals the input method, it displays variousreadingsthat
correspond to the typed input. These readings may include one or more two-byte
characters. There may be more than one valid reading of a given "clause" of input, in
which case the user must choose the appropriate reading.
When satisfied, the user accepts the readings, which are then flushed from the input
window and sent to the application as key-down events. Since the Macintosh was never
really designed for two-byte characters, a two-byte character is sent to the
application as two separate one-byte key-down events. Interspersed in the stream of
key-down events there may also be one-byte characters, encoded as ASCII.
Before getting overwhelmed by all this, consider two important points. First,the input
method is taking the keystrokes for you. The keystrokes the user types are not being
sent directly into your application -- they're being processed first. Also, since the
user can type a lot into the input method before accepting the processed input, you can
get a big chunk of key-down events at once.
So let's see what your main event loop should look like in its simplest form if you want
to properly accept mixed one- and two-byte characters:
// Globals
unsigned short gCharBuf; // Buffer that holds our (possibly
// two-byte) character
Boolean gNeed2ndByte; // Flag that tells us we're waiting
// for the second byte of a two-byte
// character
void EventLoop(void)
{
EventRecord event; // The current event
short cbResult; // The result of our CharByte call
unsigned char oneByte; // Single byte extracted from event
Boolean processChar; // Whether we should send our
// application a key message

if (WaitNextEvent(everyEvent, &event, SleepTime(), nil)) {
switch (event.what) {
. . .
case keyDown:
case autoKey:
. . .
// Your code checks for Command-key equivalents here.
. . .
processChar = false;
oneByte = (event.message & charCodeMask);
if (gNeed2ndByte) {
// We're expecting the second byte of a two-byte
// character. So OR the byte into the low byte of
// our accumulated two-byte character.
gCharBuf = (gCharBuf << 8) | oneByte;
cbResult = CharByte((Ptr)&gCharBuf, 1);
if (cbResult == smLastByte)
processChar = true;
gNeed2ndByte = false;
} else {
// We're not expecting anything in particular. We
// might get a one-byte character, or we might
// get the first byte of a two-byte character.
gCharBuf = oneByte;
cbResult = CharByte((Ptr)&gCharBuf, 1);
if (cbResult == smFirstByte)
gNeed2ndByte = true;
else if (cbResult == smSingleByte)
processChar = true;
}

// Now possibly send the typed character to the rest
// of the application.
if (processChar)
AppKey(gCharBuf);
break;
case . . .
}
}
}
CharByte returns smSingleByte, smFirstByte, or smLastByte. You use this
information to determine what to do with a given key event. Notice that the AppKey
routine takes an unsigned short as a parameter. That's very important. For an
application to be two-byte script compatible, you need toalwayspass unsigned shorts
around for a single character. This example is also completelyone-bytecompatible --
if you put this event loop in your application, it works in the U.S.
The example assumes that the grafPort is set to the document window and the port's
font is set correctly, which is important because the Script Manager's behavior is
governed by the font of the current grafPort (see "Script Manager Caveats"). Although
this event loop works fine on both one- byte and two-byte systems, it could be made
more efficient. For example, since input methods sometimes send you a whole mess of
characters at a time, you could buffer up the characters into a string and send them
wholesale to AppKey, making it possible for your application to do less redrawing on
the screen.
AVOIDING FONT TYRANNY
Have you ever written the following lines of code?
void DrawMessage(short messageNum)
{
Str255theString;
GetIndString(theString, kMessageStrList, messageNum);
TextFont(geneva);
TextSize(9);
MoveTo(kMessageXPos, kMessageYPos);
DrawString(theString);
}
If so, you're overdue for a good spanking. While we're very proud of you for putting
that string into a resource like a good international programmer, the font, size, and
pen position are a little too, well, specific. Granted, it's hard to talk yourself out of
using all those nice constants defined in Fonts.h, but if you're trying to write a
localizable application, this is definitely thewrong approach.
A better approach is to do this:
TextFont(applFont);
TextSize(0);
GetFontInfo(&fontInfo);
MoveTo(kMessageXPos, kMessageYMargin + fontInfo.ascent +
fontInfo.leading);
Since applFont is always a font in the system script, and TextSize(0) gives a size
appropriate to the system script, you get the right output. Plus, you're now
positioning the pen based on the font, instead of using absolute coordinates. This is
important. For instance, on a Japanese systemTextSize(0) results in a point size of
12, so the code in the preceding example might not work if the pen-positioning
constants were set up to assume a 9-point font height.
If you want to make life even easier for your localizers, you could eliminate the
pen-positioning constants altogether. Instead, use an existing resource type (the
'DITL' type is appropriate for this example) to store the layout of the text items in the
window. Even though you're drawing the items yourself, you can still use the
information in the resource to determine the layout, and the localizers can then change
the layout using a resource editor -- which is a lot better than hacking your code.
There are some other interesting ways to approach this problem. Depending on what