All Databases develop - 1993

June 93 - WRITING LOCALIZABLE APPLICATIONS

WRITING LOCALIZABLE APPLICATIONS

JOSEPH TERNASKY AND BRYAN K. ("BEAKER")

RESSLER

JOSEPH TERNASKY AND BRYAN K. ("BEAKER") RESSLERMore and more

software companies are finding rich new markets overseas. Unfortunately, many of

these developers have also discovered that localizing an application involves a lot more

than translating a bunch of STR# resources. In fact, localization often becomes an

unexpectedly long, complex, and expensive development cycle. This article describes

some common problems and gives proactive engineering advice you can use during

initial U.S. development to speed your localization efforts later on.

Most software localization headaches are associated with text drawing and character

handling, so that's what this article stresses. Four common areas of difficulty are:

• keyboard input (specifically for two-byte scripts)

• choice of fonts and sizes for screen display

• date, time, number, and currency formats and sorting order

• character encodings

We discuss each of these potential pitfalls in detail and provide data structures and

example code.

PRELIMINARIES

Throughout the discussion, we assume you're developing primarily for the U.S.

market, but you're planning to publish internationally eventually (or at least you're

trying to keep your options open). As you're developing your strategy, here are a few

points to keep in mind:

• Don't dismiss any markets out of hand -- investigate the potential

rewards for entry into a particular market and the features required for that

market.

• The amount of effort required to support western Europe is relatively

small. Depending on the type of application you're developing, the additional

effort required for other countries isn't that much more. There's also a

growing market for non-Roman script systems inside the U.S.

• The labor required to build atruly globalprogram is much less if you do

the work up front, rather than writing quick-and-dirty code for the U.S. and

having to rewrite it later.

• Consider market growth trends. A market that's small now may be big

later.

This article concentrates on features for western Europe and Japan because those are

the markets we're most familiar with. We encourage you to investigate other markets

on your own.

LINGO LESSON 101

This international software thing is rife with specialized lingo. For a complete

explanation of all the terms, see the hefty "Worldwide Software Overview," Chapter

14 ofInside MacintoshVolume VI. But we're not here to intimidate, so let's go over a

few basic terms.

Script. A writing system that can be used to represent one or more human languages.

For example, the Roman script is used to represent English, Spanish, Hungarian, and

so on. Scripts fall into several categories, as described in the next section, "Script

Categories.

Script code. An integer that identifies a script on the Macintosh.

Encoding. A mapping between characters and integers. Each character in the

character set is assigned a unique integer, called itscharacter code. If a character

appears in more than one character set it may have more than one encoding, a situation

discussed later in the section "Dealing With Character Encodings." Since each script

has a unique encoding, sometimes the termsscript and encodingare used

interchangeably.

Character code. An integer that's associated with a given character in a script.

Glyph. The displayed form of a character. The glyph for a given character code may

not always be the same -- in some scripts the codes of the surrounding characters

provide a context for choosing a particular glyph.

Line orientation. The overall direction of text flow within a line. For instance,

English has left-to-right line orientation, while Japanese can use either

top-to-bottom (vertical) or left-to-right (horizontal) line orientation.

Character orientation. The relationship between a character's baseline and the

line orientation. When the line orientation and the character baselines go in the same

direction, it's calledwith-streamcharacter orientation. When the line orientation

differs from the character baseline direction, it's called cross-stream character

orientation. For instance, in Japanese, when the line orientation is left- to-right,

characters are also oriented left-to-right (with-stream). Japanese can also be

formatted with a top-to-bottom (vertical) line orientation, in which case character

baselines can be left-to-right (cross-stream) or top-to-bottom (with-stream). See

Figure 1.

Figure 1 Line and Character Orientation in Mixed Japanese/English Text

SCRIPT CATEGORIES

Scripts fall into different categories that require different software solutions. Here

are the basic categories:

• Simple scriptshave small character sets (fewer than 256 characters),

and no context information is required to choose a glyph for a given character

code. They have left-to-right lines and top-to-bottom pages. Simple scripts

encompass the languages of the U.S. and Europe, as well as many other

countries worldwide. For example, some simple scripts are Roman, Cyrillic,

and Greek.

• Two-byte scriptshave large character sets (up to 28,000 characters)

and require no context information for glyph choice. They use various

combinations of left-to- right or top-to-bottom lines and top-to-bottom or

right-to-left pages. Two-byte scripts include the languages of Japan, China,

Hong Kong, Taiwan, and Korea.

• Context-sensitive scriptshave a small character set (fewer than 256

characters) but may have a larger glyph set, since there are potentially

several graphic representations for any given character code. The mapping

from a given character code to a glyph depends on surrounding characters.

Most languages that use a context-sensitive script have left-to-right lines and

top-to-bottom pages, such as Devanagari and Bengali.

• Bidirectional scriptscan have runs of left-to-right and right-to-left

characters appearing simultaneously in a single line of text. These scripts

have small character sets (fewer than 256 characters) and require no context

information for glyph choice. Bidirectional scripts are used for languages such

as Hebrew that have both left-to-right and right-to-left characters, with

top-to-bottom pages.

There are a few exceptional scripts that fall into more than one of these categories,

such as Arabic and Urdu. Arabic, for instance, is both context sensitive and

bidirectional.

Now with the preliminaries out of the way, we're ready to discuss some localization

pitfalls.

KEYBOARD INPUT

Sooner or later, your users are going to start typing. You can't stop them. So now what

do you do? One approach is to simply ignore keyboard input. While perfectly

acceptable to open-minded engineers like yourself, your Marketing colleagues may

find this approach unacceptable. So, let's examine what happens when two-byte script

users type on their keyboards.

Obviously, a Macintosh keyboard doesn't have enough keys to allow users of two-byte

script systems to simply press the key corresponding to the one character they want

out of 28,000. Instead, two- byte systems are equipped with a softwareinput method,

also called a front-end processoror FEP, which allows users to type phonetically on a

keyboard similar to the standard U.S. keyboard. (Some input methods use strokes or

codes instead of phonetics, but the mechanism is the same.)

As soon as the user begins typing, a smallinput windowappears at the bottom of the

screen. When the user signals the input method, it displays variousreadingsthat

correspond to the typed input. These readings may include one or more two-byte

characters. There may be more than one valid reading of a given "clause" of input, in

which case the user must choose the appropriate reading.

When satisfied, the user accepts the readings, which are then flushed from the input

window and sent to the application as key-down events. Since the Macintosh was never

really designed for two-byte characters, a two-byte character is sent to the

application as two separate one-byte key-down events. Interspersed in the stream of

key-down events there may also be one-byte characters, encoded as ASCII.

Before getting overwhelmed by all this, consider two important points. First,the input

method is taking the keystrokes for you. The keystrokes the user types are not being

sent directly into your application -- they're being processed first. Also, since the

user can type a lot into the input method before accepting the processed input, you can

get a big chunk of key-down events at once.

So let's see what your main event loop should look like in its simplest form if you want

to properly accept mixed one- and two-byte characters:

// Globals

unsigned short gCharBuf; // Buffer that holds our (possibly

// two-byte) character

Boolean gNeed2ndByte; // Flag that tells us we're waiting

// for the second byte of a two-byte

// character

void EventLoop(void)

{

EventRecord event; // The current event

short cbResult; // The result of our CharByte call

unsigned char oneByte; // Single byte extracted from event

Boolean processChar; // Whether we should send our

// application a key message

    if (WaitNextEvent(everyEvent, &event, SleepTime(), nil)) {
        switch (event.what) {

. . .

case keyDown:

case autoKey:

. . .

// Your code checks for Command-key equivalents here.

. . .

processChar = false;

oneByte = (event.message & charCodeMask);

if (gNeed2ndByte) {

// We're expecting the second byte of a two-byte

// character. So OR the byte into the low byte of

// our accumulated two-byte character.

gCharBuf = (gCharBuf << 8) | oneByte;

cbResult = CharByte((Ptr)&gCharBuf, 1);

if (cbResult == smLastByte)

processChar = true;

gNeed2ndByte = false;

} else {

// We're not expecting anything in particular. We

// might get a one-byte character, or we might

// get the first byte of a two-byte character.

gCharBuf = oneByte;

cbResult = CharByte((Ptr)&gCharBuf, 1);

if (cbResult == smFirstByte)

gNeed2ndByte = true;

else if (cbResult == smSingleByte)

processChar = true;

}

// Now possibly send the typed character to the rest

// of the application.

if (processChar)

AppKey(gCharBuf);

break;

case . . .

}

CharByte returns smSingleByte, smFirstByte, or smLastByte. You use this

information to determine what to do with a given key event. Notice that the AppKey

routine takes an unsigned short as a parameter. That's very important. For an

application to be two-byte script compatible, you need toalwayspass unsigned shorts

around for a single character. This example is also completelyone-bytecompatible --

if you put this event loop in your application, it works in the U.S.

The example assumes that the grafPort is set to the document window and the port's

font is set correctly, which is important because the Script Manager's behavior is

governed by the font of the current grafPort (see "Script Manager Caveats"). Although

this event loop works fine on both one- byte and two-byte systems, it could be made

more efficient. For example, since input methods sometimes send you a whole mess of

characters at a time, you could buffer up the characters into a string and send them

wholesale to AppKey, making it possible for your application to do less redrawing on

the screen.

AVOIDING FONT TYRANNY

Have you ever written the following lines of code?

void DrawMessage(short messageNum)

{
    Str255theString;

GetIndString(theString, kMessageStrList, messageNum);

TextFont(geneva);

TextSize(9);

MoveTo(kMessageXPos, kMessageYPos);

DrawString(theString);

}

If so, you're overdue for a good spanking. While we're very proud of you for putting

that string into a resource like a good international programmer, the font, size, and

pen position are a little too, well, specific. Granted, it's hard to talk yourself out of

using all those nice constants defined in Fonts.h, but if you're trying to write a

localizable application, this is definitely thewrong approach.

A better approach is to do this:

TextFont(applFont);

TextSize(0);

GetFontInfo(&fontInfo);

MoveTo(kMessageXPos, kMessageYMargin + fontInfo.ascent +

fontInfo.leading);

Since applFont is always a font in the system script, and TextSize(0) gives a size

appropriate to the system script, you get the right output. Plus, you're now

positioning the pen based on the font, instead of using absolute coordinates. This is

important. For instance, on a Japanese systemTextSize(0) results in a point size of

12, so the code in the preceding example might not work if the pen-positioning

constants were set up to assume a 9-point font height.

If you want to make life even easier for your localizers, you could eliminate the

pen-positioning constants altogether. Instead, use an existing resource type (the

'DITL' type is appropriate for this example) to store the layout of the text items in the

window. Even though you're drawing the items yourself, you can still use the

information in the resource to determine the layout, and the localizers can then change

the layout using a resource editor -- which is a lot better than hacking your code.

There are some other interesting ways to approach this problem. Depending on what

Referenced by (4):