All Databases Inside Mac - WWSO Mgr

Defining Word Boundaries and Line Breaks

This section describes how the NFindWord procedure uses state machines

and associated tables to determine word boundaries and line breaks.

The NFindWord procedure examines a block of text to determine the

boundaries of the word that includes a specified character in the block.

Usually, NFindWord uses different state tables to define words for word

selection and words for word wrap (line breaking).

Note: NFindWord considers offsets within a block of text to be

positions between characters-for example, an offset of 1 in Roman

text is between the first and second characters (or on the trailing

edge of the first and on the leading edge of the second).

NFindWord uses a state machine to determine word boundaries. The state

machine must start at a point at or before the beginning of the word that

includes the specified character. This can be accomplished in two ways. First,

if the specified character is sufficiently close to the beginning of the text

buffer, the state machine simply starts from the beginning of the buffer. This

is determined by the doBackupMin parameter in the tables: if the offset

parameter is less than doBackupMin, the state machine starts from the

beginning. Otherwise, NFindWord uses a second state table, BackwardTable,

for backward processing. With BackwardTable, NFindWord starts at the

specified character, moving backward as necessary until it encounters a word

break.

Once determined, this starting point is saved as an initial word break location.

From this point, the NFindWord state machine moves forward using

ForwardTable until it encounters another word break. If that word break is

still before the specified character, its location is saved as the starting point,

and the state machine is restarted from that location. This process repeats

until the state machine finds a word break that is after the specified character.

At that point, NFindWord returns the last saved word break location and the

current word break location as the offset pair defining the word boundaries.

The state machine operates in a similar manner whether moving backward or

forward; any differences in behavior are determined by the tables. The

machine begins in the start state (state 1). It then cycles one character at a

time until it finds a word break and exits. Each cycle proceeds as follows: the

current character is mapped to a class number. The character class and the

current state are used as indices into a two-dimensional array of byte- length

action codes. Each action code specifies the following:

• whether to mark the current offset

• the next state, which may be the exit state (state 0)

When the state machine exits, it has encountered a word break. The location of

the word break is the last character offset that was marked. In general, the

state machine marks a character offset when it determines that the word that

began at its starting point extends at least to the marked offset.

The Figure below gives two examples of the forward operation of the state

machine for word selection. In each case, the state at a given offset and the class

of the character following the offset determine (1) whether to mark that

offset, (2) whether to exit at that point, and, if not exiting, (3) what the state

at the next offset will be. When the state machine exits, the first and last

marked offsets are returned as the word boundaries.

Forward operation of the state machine for word selection

Mapping characters to classes is simple for single-byte characters, but the

process gets a little more involved for double-byte character codes. The byte

value at the current character offset is used as an offset into the ClassTable

array, an array of 256 signed bytes. If the value in ClassTable is positive, it

signifies that the byte at the current character offset is a single-byte

character, and the value in ClassTable is the class number for the character. If

the value in ClassTable is negative, it signifies that the byte at the current

character offset is the first byte of a double-byte character code, and

AuxClassTable must be used to determine the character class.

The AuxClassTable begins with a variable-length word array. The first word

contains a default class number for double-byte character codes. The following

words are offsets to RowTables, which have the same format as ClassTable, but

are used by NFindWord for mapping the second byte of a double-byte

character code to a class number. If the value in ClassTable was -1 (or any

odd, negative number), the double-byte character code is assigned the default

class from the first word of AuxClassTable. For other double-byte characters,

the value in ClassTable is an even negative number; NFindWord negates this

value to provide an offset from the beginning of AuxClassTable to the

appropriate RowTable offset. The RowTable table specified by this offset in this

way is used to map the second byte of the character to a class number.

Note: There is a maximum of 128 classes and 128 states

(including the start and exit states).

The Figure below provides a de scription of the new break table. Note that the

high bit of the first word is set to indicate that this table is in the new format;

otherwise, FindWord assumes that the tables are in the old format. The new

break table begins with an 8-word header, followed by the class and state

tables.

NFindWord header and class tables

The Figure below shows the NFindWord state table. It begins with a list of

words containing byte offsets from the beginning of the state table to the rows

of the state table; this is followed by a C-by-S byte array, where C is the

number of classes and S is the number of states. The bytes in this array are

stored with the column index varying most rapidly; that is, the bytes for the

State 1 row precede the bytes for the State 2 row. Each byte in this array is an

action code whose format is defined in the figure below.

NFindWord state table

The Figure below shows the format of an action code.

Format of NFindWord action code

Referenced by (3):