All Databases Inside Mac - WWSO Mgr

An Introduction to Scripts

A script, as used in this section, is a writing system for a human language.

There are about 30 living scripts that are used to represent the official

languages of one or more regions and countries. Examples of writing systems

are Roman, Chinese, Japanese, Hebrew, and Arabic. They all have distinct

attributes. Simple scripts, such as Roman, Greek, and Cyrillic, usually have

fewer than 256 characters; the Japanese script theoretically contains more

than 40,000. The characters of printed Roman are relatively independent of

each other; Arabic characters change shape depending on the characters that

surround them.

Scripts may vary in other attributes: the direction in which their characters

and lines run, the size of the character set used to represent the script, and the

context sensitivity of the script. Some scripts, such as Japanese, actually

include multiple sub scripts. (A sub script is a distinguishable subset of

characters that is included within a script. Subscripts in the Japanese script

include Hiragana, Kanji, Katakana, and Romaji.) Each of these attributes

significantly affects the script's representation on the computer, and each is

discussed in the following sections. The Figure below shows notations for the

names of various scripts, languages, and regions in the appropriate script.

Scripts

Character Representation

Scripts differ in the kind and number of characters they require to represent

words. Some scripts are basically alphabetic: the characters in the script

symbolize, more or less, the discrete phonemic elements in the languages

represented by the script. Other scripts, such as Japanese Hiragana and

Katakana, are syllabic: the characters stand for syllables in the language. The

languages that syllabic scripts represent tend to have relatively simple

syllables.

Other scripts-namely, Japanese Kanji, Chinese Hanzi, and Korean

Hanja-include ideographic characters. These do not represent pronunciation

alone, but are also related to the component meanings of words. A typical

character set for ideographic scripts is quite large, ranging from 7,000 to

30,000 characters. Obviously, a standard single-byte encoding (limited to

256 distinct values) cannot be used to represent these characters, nor can a

keyboard be used to enter so many characters directly.

The Figure below shows examples of alphabetic, syllabic, and ideographic

character representations.

Alphabetic, syllabic, and ideographic representations of characters

Text Direction

Scripts also vary in the direction in which characters are written. In Roman

scripts, characters are inscribed from left to right, with horizontal lines of

characters written from top to bottom. However, scripts like Arabic and

Hebrew have most characters written from right to left, although the

horizontal lines of text are still written from top to bottom. In Japanese and

Chinese, characters are traditionally written from top to bottom, with vertical

lines (columns) of characters written from right to left. The Figure below

shows three text directions. These three script types (that is, left- right

top-bottom, right-left top-bottom, and top-bottom right-left) are the most

common of the eight possible combinations of character and line directions.

Different scripts can occur in the same line on a screen. Thus a line of text

containing both Arabic and English is actually bidirectional: some characters

go from left to right, and some from right to left.

The Macintosh script systems, accessed through the Script Manager,

provide the capability to write from right to left, as required by Arabic,

Hebrew, and other bidirectional scripts, to mix right-to-left and left-to- right

directional text within lines and blocks of text, and to use ideographic text.

Your application can add the capability to handle vertical text, if desired.

Three text directions

Contextual Forms

The displayed form, or glyph, that represents a character in printed English

does not usually depend on bordering characters. This is not the case for many

scripts. Even in cursive English, for example, when one letter is joined to the

preceding letter, the connecting line varies according to which letters are

being joined. Characters may also have considerably different shapes depending

on where they occur within a word, for example, at the beginning (initial

form) or elsewhere in the word (noninitial form). The Figure below

illustrates two of these variations in cursive English, which are called

contextual forms.

Contextual forms in cursive English

The ability to represent contextual forms is required for the proper display

of Arabic text. The Figure below shows stand-alone and contextual forms in

Arabic.

Stand-alone and contextual forms in Arabic

Furthermore, certain character forms may be combined into a new form

when they occur together. The Figure below provides an example of how

characters combine to form ligatures or conjunct characters in Roman text.

A ligature in Roman text

The use of ligatures can be highly developed in Arabic text, and some ligatures

are required for the proper display of Arabic text. The Figure below provides

examples of ligatures in Arabic text.

Ligatures in Arabic text

In script systems, context dependence means that character forms may be

modified by the values of preceding and following characters in the input

stream. In Arabic, the displayed form of many characters changes depending on

other characters nearby. Context analysis is usually handled by the script

system under the control of the Script Manager.

Diacritical Marks

Many scripts use diacritical marks, that is, signs that modify the implicit

sound or value of the characters with which they are associated. Some

diacritical marks are often referred to as accents in Roman scripts: the acute

accent in é, for instance. Others, such as certain Viet namese diacritical marks,

may indicate pitch, while certain Arabic diacritical marks, such as shadda,

specify the doubling of consonants. See Macintosh Worldwide Development:

Guide to System Software for details on diacritical marks available in the

standard Roman character set. With system software version 7.0, routines are

provided that strip diacritical marks.

See Converting Case and Stripping Diacritical Marks for details.

Uppercase and Lowercase Characters

English speakers are familiar with uppercase and lowercase characters in

Roman script; however, the majority of the world's scripts do not have

separate uppercase and lowercase forms. The implications for computer

applications are primarily in the areas of searching, sorting, and proofreading

(for example, spell- checking). With system software version 7.0, there are

routines to perform case conversion. See

Converting Case and Stripping Diacritical Marks for details.

Note: In the Roman script, different languages (and even different

regions or countries that use the same language) have different

conventions for the treatment of accents and diacritical marks on

uppercase characters.

Referenced by (2):