All Databases MacTech Vol 04-1988

XCMD Import Text

Volume Number: 4

Issue Number: 10

Column Tag: HyperChat®

XCMD Corner

By Donald Koscheka, Apple Computers, Inc.

Importing Text into Hypercard

A new controversy seems to be emerging in the Hypercard community. Some

Hypercard pundits are discouraging the use of XCMDs and XFCNs in stack design.. Their

most convincing argument is that those of us who jump into writing XCMDs aren’t

giving ourselves an opportunity to see if HyperTalk can perform the task, perhaps

equally as well as an XCMD.

I frequently consider writing an XCMD solution to a programming problem

without first considering whether Hypertalk can do the same job for me. Recently, I

needed to import Microsoft WORD files into Hypercard. What a wonderful opportunity

to write an XCMD!

When I sat down to write the script to invoke the XCMD, I realized that I could

write the entire WORD import routine in HyperTalk. Ed Wischmeyer of Apple

Computer Inc. pointed out that although fields in HyperTalk prefer to see straight ASCII

text, there is no such restriction on the contents of containers. Hypercard also allows

you to open and read any file type you want; you aren’t restricted to reading text files.

Of course, you need to figure out how to translate what’s in that container into a

format that can be presented in a field.

The hard part of importing text from a Word file is not reading the data into

hypercard but rather figuring how Word stores its text. By committing the import

code to a simple Hypertalk script, I could concentrate my efforts at decoding Word’s

file format.

To simplify my search through the file format, I made the assumption that I could

ignore any formatting information such as rulers, font and style changes. I was after

was the text portion of the file only. This turns out to be a valid assumption since I

wanted to import the file into a Hypercard field as text.

Finding the text was a snap with John Mitchell’s “FEDIT+”. I created a Word file

using WORD and then examined it in FEDIT+. I noticed that the text always started at

location 256 in the file. Since the size of the file was larger than the size of the text

plus this 256 byte header, I needed to determine where the end of text occurred

(assuming that the formatting and ruler information follows the text in the file). Since

I knew how long the text was, I again used FEDIT+ to search the 256 header portion of

the file. This time I was looking for any portion of the header that contained a count of

the number of bytes in the text. Since I knew that my file contained exactly 100

characters (bytes), all I had to do was find this number somewhere in the header

portion of the file. I found something close to what I was looking for at offset 16 in the

file. This location corresponded to the number of characters in the text portion of the

file plus 256 which was the length of the header.

The creators of Microsoft Word may be reading this and wondering why I’m

assuming that the text size is a 16-bit entity rather than a 32 bit number. I’m not.

Since Hypercard text fields are currently limited to 32K bytes, and since I knew none

of my word files were longer than this, I’m only interested in the low-order word of

the text length.

Reading the text portion of a Microsoft Word file into a hypercard container

requires the following steps: (1) Position the mark at byte 16 of the file. Read the

byte at this position and multiply it by 256 making it the high-order half of the file

length. Read the next byte and add it to the hight-order half of the length. Move 238

more bytes into the file (16+2+238 = 256). This is the start of the text portion of

the file. Read the number of bytes calculated minus 256. The IT container gets the

imported text.

The Hypertalk script in listing 1 performs the above steps for importing up to

16K bytes of text from a Word file. I use Steisplay WORD files only in the GetFile

dialog and to get the full pathname of the file from the user. This script reads in the

text without any looping so an XCMD may not speed things up enough to be warranted.

{1}
on mouseup
  put filename(“WDBN”)into filename
  if filename is not empty then
     open file filename -- filename is the full pathname of a WORD
file
     read from file filename for 16 -- move file mark to the text
length word
     read from file filename for 1  -- read the upper half of the
length
     put chartonum( it ) * 256 into  filesize -- shift up by 8 bits
     read from file filename for 1  -- get the lower half of the
length
     add chartonum( it ) mod 256 to  filesize
     read from file filename for 238-- move to start of text in the
file
     read from file filename for  filesize-256 -- read in the text
     close file filename -- IT now contains the imported data.   
  end if
end mouseup

Listing 1. Script to Import Text from a Microsoft Word File

Not all file formats can be imported quite so simply. Macwrite uses a packed text

format, storing one or two characters per byte using a simple compression scheme.

Because the text is compressed, we can’t just read the file into a container and

return the result to Hypercard. We must first decompress the file a byte at a time.

Such a process suggests looping and loops, as we know, are not particularly fast in

HyperTalk. Although the decompression can be performed in a hypertalk script, we

can write an xcmd that performs the decompression faster.

The key to reading in a MacWrite file is understanding that Macwrite stores its

data by paragraph. Whereas Word files are clearly divided between the text and

formatting information, Macwrite stores formatting information for each paragraph at

the end of the text for that paragraph. Hypercard doesn’t do formatted text; we want to

ignore the formatting information at the end of each paragraph. Our algorithm then

becomes a loop that reads in a paragraph at a time, decompresses the text for that

paragraph ignoring the formatting information. This process is repeated for each

paragraph in the file.

One small “gotcha” to this approach stems from the fact that Rulers and pictures

are also considered paragraphs. When we encounter either of these objects, we just

move on to the next paragraph.

Listing 2 depicts the code for this XFCN. I chose “C” because pointer arithmetic

is easier to perform in “C” and because last month’s example was written in Pascal. I

made every attempt to keep the “C” isomorphic to a Pascal program so that you can

easily convert the code to Pascal.

Finding the paragraph information in the file requires a little arithmetic.

Bytes 2-3 in the file tell us how many paragraphs the main document contains

(MacWrite makes a distinction between the main document, the header document and

the footer document. For our purposes, we only want to read in the main body of text)

If bytes 2-3 contain a 5 then there are 5 paragraphs in the main document.

For each paragraph, MacWrite stores an information array. We start reading the

information arrays at the file position pointed to in file offset $108. An information

array is an array of 16-byte elements that tell us something about each paragraph.

The first two bytes in the information array tell us whether the paragraph contains

text, a ruler or a picture. If this value is positive the paragraph contains text, if this

value is 0 or negative the paragraph is a ruler or a picture respectively and we can

ignore it.

Offset 8 in the information array contains a status byte that provides some

information about the text. If bit 3 is set, the text in this paragraph is compressed.

Bytes 9-11 tell us the absolute file offset for the start of the data in the paragraph and

bytes 12-13 contain the length of the data (paragraph addressing is 24 bits and each

paragraph contains up to 64K of characters or data). The trick is to read in the

number of characters indicated in the information array, determine if the paragraph

contains text and, if so, decompress the text if it’s compressed.

Once we read in the paragraph, we get some more information. The first two

bytes of the paragraph tell us how many characters of text will appear in the

decompressed paragraph. Following the text on an even word boundary is the

formatting information for the paragraph which we ignore in this example.

MacWrite’s text compression is based on a letter frequency scheme stored as STR

resource #700 in MacWrite’s resource fork. For English, this string contains “

etnroaisdlhcfp”. Macwrite maps these characters onto the array [$0..$F]. The space

character ($20) has a value of 0, letter “e” has a value of 1, “t” a value of 2 and so

on. Since any number less than $F can fit into a nibble, the word “eels” can be

represented as “$11A8” rather than the byte-wide representation of $65656C73. In

this example, we realize a 50% space saving (the best case for this algorithm).

This compression scheme only works for lower-case letters since 4 bits is not

enough information to code for word frequency and case for the 14 most popular

letters. This scheme also doesn’t compress non-alphabetic characters such as

numerals and punctuation marks. In these cases, the 16th array element, $F, is used

as a flag to tell indicate that the next 2 nibbles represent one character. “Then”

would be coded as $F55906. Note that the letter “T” crosses byte boundaries, the top

nibble is in byte 0 and the lower nibble is in byte 1. This is of no consequence to the

algorithm.

Armed with this information, you should have little trouble understanding the

XFCN. In fact, I hope you find it useful and informative! (Next month: printing from

XCMDs).

{2}
/*************************
* file:  MWRead.c *
*  *
* an XFCN that imports text  *
* directly from a MacWrite file *
* whose full pathname is passed *
* as an input parameter. *
* *
* -------------------------------- *
* To Build this file: *
* *
* C -q2 -g MWRead.c *
* *
* link -sn Main=MWRead ∂ *
* -sn STDIO=MWRead ∂ *
* -sn INTENV=MWRead ∂ *

Referenced by (3):