All Databases MacTech Vol 06-1990

LZW Compression

Volume Number: 6

Issue Number: 10

Column Tag: Pascal Procedures

LZW Compression/Decompression

By Dennis R. Cohen, Santa Clara, CA

Adaptive Lempel-Zev-Welch Compression/Decompression

Data compression is a topic that becomes of interest to programmers (and the

people who hire us) as our programs get larger and the data sets they manipulate grow.

One of the most prevalent examples of data compression for most of us is compressing

and archiving files either for backup purposes or for transmission via modem or

network. The most common Mac programs for this are StuffIt (by Raymond Lau) and

PackIt (by Harry Chesley). PackIt uses a technique known as Huffman-encoding;

StuffIt employs Lempel-Zev-Welch (LZW, for short), Huffman, or

Run-Length-Encoding (RLE) based upon the characteristics of the input data. While it

is not necessarily the most efficient technique for all files, LZW has proven in practice

to be the most efficient general technique and is the subject of this article.

The seminal discussion concerning LZW compression may be found in:

A Technique for High Performance Data Compression

by Terry Welch

IEEE Computer, June 1984 (v17.6)

Since Macintosh files are really two files (resource fork and data fork) and have

various other pieces of associated data, we start each compressed file with a header

record which contains the original name of the file, the size of each fork, and the

Finder information (creator, file-type, etc) so that we will be able to reconstruct an

indistinguishable copy of the original file.

LZW operates by finding common “substrings” and replaces them by a

fixed-length (frequently 12-bit) code. This technique is deterministic and is

accomplished during a single pass over the input file. The decompression procedure

needs no input table, and rebuilds the table as it goes. A 12-bit code means that there

are only 4K possible codes. We’ll be using a 14-bit code and thus allow 16K possible

encodings before our string table fills up. The table is initialized to contain encodings

of the 256 possible single-byte values.

The decompression algorithm translates each received code into a prefix string

and an extension byte. The extension byte is pushed onto a stack and the prefix

translated again, cycling until the prefix is atomic. The entire (decompressed) code is

then output by popping the stack until it is empty.

Due to the nature of the algorithm, the first code encountered is known to be of a

single byte (atomic) and can be directly converted. An update to the string table is

made for each code received after the first one, which is known to already be in the

table from the initialization. When a code has been translated, its final byte is used as

the suffix and combined with the prior string to add the new entry to the string table so

long as the string table hasn’t been filled. This new string is assigned a new value that

is the same code the compressor assigned to that string. In this manner, the

decompressor incrementally reconstructs the same string table that the compressor

used. Unfortunately, there is an abnormal case for which this algorithm does not

work. The abnormal case occurs whenever an input character string containing a

cyclic sequence of the form byte-code-byte-code-byte already appears in the

compressor string table. The decompression algorithm is modified (see step 3 of the

decompression description, below) to handle the special case.

To Compress

1. Initialize the table to contain all single-byte strings

2. Read the first byte, c, from the input file. Set the prefix, w, to that byte.

3. Read the next input byte into c.

4. If at end of file then go to step 7.

5. If wc is in the string table then set w to wc and go to 3.

6. Output Code(w); Put wc in the string table; Set w to c; go to 3

7. Output Code(w) and signify completion

To Decompress

1. Read the first input code and store in both code and oldcode. Since the first code is

known to be atomic (cf. 2, above), code is the encoding of c, output c, and set the

extension to c.

2. Read the next code and save the value in incode. If at end of file then go to 8.

3. If the code is not in the string table, we have the special case so, output

extension, set code to oldcode, and set incode to the coding of wc

4. If the retrieved code is the code for wc then push c onto the stack, set code to

code of w and repeat this step.

5. If the code is atomic, output the byte it represents and set the extension to that

byte.

6. While the stack is non-empty, output the byte on top and pop the stack.

7. Put oldcodec into the string table, set oldcode to incode and go to 2.

8. Signify completion.

In the case of adaptive LZW (which is demonstrated by the source in this article)

the compression is tested periodically, in our case every 10000 bytes, to see whether

the compression ratio has declined. If it has, a reserved code is saved to the output

file, the string table is reinitialized, and we start the process again. Obviously,

during decompression, if the reset code is encountered then the program reinitializes

the string table and then continues blithely with its processing.

A hashing algorithm is employed to determine the code for a string (prefix +

suffix byte). In the example program, I just shift and xor the two values. This is a

very simplistic hash evaluation and will result in a lot of collisions; however, a more

efficient hashing algorithm could introduce additional processing overhead in its

computation. One such possibility is included as a comment. The choice of hash

algorithm is yours, just be sure that you use the same one for both compression and

decompression.

There are a lot of places where this program could be optimized; however, most

of the optimizations are at the expense of clarity and are avoided here.

The major difference between this straightforward presentation of adaptive LZW

and that used by Raymond Lau in his StuffIt program follows. StuffIt uses the unix

compress code to do its compression with 14-bit compression being selected. The unix

compress code uses up all the 9-bit codes first, then the 10-bit codes, etc. until it has

used up the 14-bit codes, meanwhile packing them in VAX byte/bit order. This

involves a great deal of masking and shifting that obscures the general algorithm. We

are using a general string table into which we hash and always store 14-bit values in

the “natural” 680x0 byte order, although we do buffer our partial bytes and pack the

output data.

While our example is for compressing files, it could be modified to compress and

decompress structures in memory by replacing the GetByte routine on the

compression side and the PutByte routine on the decompression side. All the

compression code cares about is that it is fed data to be compressed one byte at a time,

not from whence the data came. Similarly, all the decompression code cares about is

that it be fed one-byte chunks of data to be decompressed and that it will spew the

result out a byte at a time.

The example programs could also be consolidated into a single application with

the introduction of a less modal (and more Macish) interface, but that is not the

purpose of this article. I used TMLPascal II (vers. 3.0) for this article but have also

recompiled with both MPW 3.0 and Think Pascal 2.01 - Think Pascal requires some

changes in the area of compiler directives, USES statements, and references to

_DataInit; the MPW Pascal compiler does not require any changes to the source (just

the makefile). Turbo Pascal would require a lot of changes in the source due to a

number of incompatible features (OR, AND, SHL, etc. for bit operations, for example).

Listing: LComp.Proj
#############################################################
#
#   Created:      Monday, October 10, 1988 4:08:42 PM
#
#   Project Name: LComp
#   Project Type: Application
#
#   Finder Type:     APPL
#   Finder Creator:  ????
#   Set Bundle Bit:   TRUE
#
#   Project Source Files:
#      LComp.r
#      lcomp.p
lcomp.p.o  ∂
lcomp.p
TMLPascal lcomp.p
LComp  ∂
lcomp.p.o
Link -w -t ‘APPL’ -c ‘LZWC’ ∂
lcomp.p.o ∂
“{Libraries}”Runtime.o ∂
“{Libraries}”Interface.o ∂
“{TMLPLibraries}”PasLib.o ∂
-o LComp
LComp  ∂
LComp.r
Rez -append -o LComp ∂
 LComp.r
SetFile -a B LComp

Listing:  LComp.p
{$R-}
{$D+}
{$DEFC DEBUG}
{$SETC DEBUG=TRUE}
PROGRAM LComp;
{ Simple case LZW compression }
USES
MemTypes,
QuickDraw,
OSIntf,
ToolIntf,
PackIntf;
CONST
maxBuff = 8192; {i/o buffer size}
maxTab = 16383; {Table size minus 1 ($3FFF)}
noPrev = $7FFF;
eofChar = -2;
endList = -1;
empty = -3;
clearCode = 256; {Reserved code to signal adaptive reset ($100) }
checkGap = 10000; {How frequently do we check for adaptive?}
TYPE
StringTableEntry = RECORD
prevByte: Integer;
follByte: Integer;
next: Integer;
used: Boolean;
reserved: Boolean;
END;
StringTableArray = ARRAY [0..maxTab] OF StringTableEntry; {128K
structure unless packed}
StringTablePtr = ^StringTableArray;
IntPtr = ^Integer;
Buffer = PACKED ARRAY [1..maxBuff] OF Char;
BufPtr = ^Buffer;
HeaderRecord = RECORD
name: String[31];
dfSize: LongInt;
rfSize: LongInt;
fndrInfo: FInfo;
END;
Remainder = (none, sixBit, fourBit, twoBit);
VAR
inRef: Integer; {Reference number of input file}
outRef: Integer; {Reference number of output file}
inVRefNum: Integer;{Volume/WD reference num. of input file}
outVRefNum: Integer; {Volume/WD reference number of output file}
eofSignal: Boolean; {Flag that it’s time to clean up}
inBufSize: LongInt; {Count of characters in input buffer }
inputPos: Integer; {Position in input buffer}
outputPos: Integer; {Position in output buffer}
bytesRead: LongInt; {Total bytes read from input file}
bytesWritten: LongInt; {Total bytes written to output file}
ratio: Extended;{Compression ratio (bytesRead/bytesWritten)}
checkPoint: LongInt; {Next time we check to see whether table
adaptation necessary}
inputBuffer: BufPtr; {Dynamically allocated data storage}
outputBuffer: BufPtr; { “ }
stringTable: StringTablePtr;
infileName: Str255; {Name of the file we’re compressing}
tableUsed: Integer;{Number of entries  currently in string table}
outputCode: Integer; {Code (14-bit) that we’re going to output}
carryOver: Remainder; {How many bits we have in the code we’re
building}
doingDFork: Boolean; {Flag that tells which fork of the file we’re
compressing}
fsErr: OSErr;  {Result of last file system call}
dataForkSize: LongInt; {Number of bytes in data fork}
rsrcForkSize: LongInt; {Number of bytes in resource fork}
progWindow: WindowPtr; {Window where we display progress}
boundsRect: Rect; {Bounding rect of the progress  window}
hdrRec: HeaderRecord; {File information so that decompress will
get things right}
resetCode: Integer; {This is the hashCode for clearCode}
PROCEDURE _DataInit; EXTERNAL; {MPW specific}
PROCEDURE FileAlert(str: Str255);
CONST
fsAlert = 1111;
VAR
item: Integer;
BEGIN
ParamText(str, ‘’, ‘’, ‘’);
item := StopAlert(fsAlert, NIL);
fsErr := FSClose(inRef);
fsErr := FSClose(outRef);
fsErr := FlushVol(NIL, outVRefnum);
END {FileAlert} ;
{$IFC DEBUG}
PROCEDURE DebugAlert(l1, l2: LongInt);

Referenced by (3):