An Introduction to Scripts
An Introduction to Scripts
A script, as used in this section, is a writing system for a human language.
There are about 30 living scripts that are used to represent the official
languages of one or more regions and countries. Examples of writing systems
are Roman, Chinese, Japanese, Hebrew, and Arabic. They all have distinct
attributes. Simple scripts, such as Roman, Greek, and Cyrillic, usually have
fewer than 256 characters; the Japanese script theoretically contains more
than 40,000. The characters of printed Roman are relatively independent of
each other; Arabic characters change shape depending on the characters that
surround them.
Scripts may vary in other attributes: the direction in which their characters
and lines run, the size of the character set used to represent the script, and the
context sensitivity of the script. Some scripts, such as Japanese, actually
include multiple sub scripts. (A sub script is a distinguishable subset of
characters that is included within a script. Subscripts in the Japanese script
include Hiragana, Kanji, Katakana, and Romaji.) Each of these attributes
significantly affects the script's representation on the computer, and each is
discussed in the following sections. The Figure below shows notations for the
names of various scripts, languages, and regions in the appropriate script.
Scripts
Character Representation
Scripts differ in the kind and number of characters they require to represent
words. Some scripts are basically alphabetic: the characters in the script
symbolize, more or less, the discrete phonemic elements in the languages
represented by the script. Other scripts, such as Japanese Hiragana and
Katakana, are syllabic: the characters stand for syllables in the language. The
languages that syllabic scripts represent tend to have relatively simple
syllables.
Other scripts-namely, Japanese Kanji, Chinese Hanzi, and Korean
Hanja-include ideographic characters. These do not represent pronunciation
alone, but are also related to the component meanings of words. A typical
character set for ideographic scripts is quite large, ranging from 7,000 to
30,000 characters. Obviously, a standard single-byte encoding (limited to
256 distinct values) cannot be used to represent these characters, nor can a
keyboard be used to enter so many characters directly.
The Figure below shows examples of alphabetic, syllabic, and ideographic
character representations.
Alphabetic, syllabic, and ideographic representations of characters
Text Direction
Scripts also vary in the direction in which characters are written. In Roman
scripts, characters are inscribed from left to right, with horizontal lines of
characters written from top to bottom. However, scripts like Arabic and
Hebrew have most characters written from right to left, although the
horizontal lines of text are still written from top to bottom. In Japanese and
Chinese, characters are traditionally written from top to bottom, with vertical
lines (columns) of characters written from right to left. The Figure below
shows three text directions. These three script types (that is, left- right
top-bottom, right-left top-bottom, and top-bottom right-left) are the most
common of the eight possible combinations of character and line directions.
Different scripts can occur in the same line on a screen. Thus a line of text
containing both Arabic and English is actually bidirectional: some characters
go from left to right, and some from right to left.
The Macintosh script systems, accessed through the Script Manager,
provide the capability to write from right to left, as required by Arabic,
Hebrew, and other bidirectional scripts, to mix right-to-left and left-to- right
directional text within lines and blocks of text, and to use ideographic text.
Your application can add the capability to handle vertical text, if desired.
Three text directions
Contextual Forms
The displayed form, or glyph, that represents a character in printed English
does not usually depend on bordering characters. This is not the case for many
scripts. Even in cursive English, for example, when one letter is joined to the
preceding letter, the connecting line varies according to which letters are
being joined. Characters may also have considerably different shapes depending
on where they occur within a word, for example, at the beginning (initial
form) or elsewhere in the word (noninitial form). The Figure below
illustrates two of these variations in cursive English, which are called
contextual forms.
Contextual forms in cursive English
The ability to represent contextual forms is required for the proper display
of Arabic text. The Figure below shows stand-alone and contextual forms in
Arabic.
Stand-alone and contextual forms in Arabic
Furthermore, certain character forms may be combined into a new form
when they occur together. The Figure below provides an example of how
characters combine to form ligatures or conjunct characters in Roman text.
A ligature in Roman text
The use of ligatures can be highly developed in Arabic text, and some ligatures
are required for the proper display of Arabic text. The Figure below provides
examples of ligatures in Arabic text.
Ligatures in Arabic text
In script systems, context dependence means that character forms may be
modified by the values of preceding and following characters in the input
stream. In Arabic, the displayed form of many characters changes depending on
other characters nearby. Context analysis is usually handled by the script
system under the control of the Script Manager.
Diacritical Marks
Many scripts use diacritical marks, that is, signs that modify the implicit
sound or value of the characters with which they are associated. Some
diacritical marks are often referred to as accents in Roman scripts: the acute
accent in é, for instance. Others, such as certain Viet namese diacritical marks,
may indicate pitch, while certain Arabic diacritical marks, such as shadda,
specify the doubling of consonants. See Macintosh Worldwide Development:
Guide to System Software for details on diacritical marks available in the
standard Roman character set. With system software version 7.0, routines are
provided that strip diacritical marks.
See Converting Case and Stripping Diacritical Marks for details.
Uppercase and Lowercase Characters
English speakers are familiar with uppercase and lowercase characters in
Roman script; however, the majority of the world's scripts do not have
separate uppercase and lowercase forms. The implications for computer
applications are primarily in the areas of searching, sorting, and proofreading
(for example, spell- checking). With system software version 7.0, there are
routines to perform case conversion. See
Converting Case and Stripping Diacritical Marks for details.
Note: In the Roman script, different languages (and even different
regions or countries that use the same language) have different
conventions for the treatment of accents and diacritical marks on
uppercase characters.