LZW Compression
Volume Number: 6
Issue Number: 10
Column Tag: Pascal Procedures
LZW Compression/Decompression 
By Dennis R. Cohen, Santa Clara, CA
Adaptive Lempel-Zev-Welch Compression/Decompression
Data compression is a topic that becomes of interest to programmers (and the
people who hire us) as our programs get larger and the data sets they manipulate grow.
One of the most prevalent examples of data compression for most of us is compressing
and archiving files either for backup purposes or for transmission via modem or
network. The most common Mac programs for this are StuffIt (by Raymond Lau) and
PackIt (by Harry Chesley). PackIt uses a technique known as Huffman-encoding;
StuffIt employs Lempel-Zev-Welch (LZW, for short), Huffman, or
Run-Length-Encoding (RLE) based upon the characteristics of the input data. While it
is not necessarily the most efficient technique for all files, LZW has proven in practice
to be the most efficient general technique and is the subject of this article.
The seminal discussion concerning LZW compression may be found in:
A Technique for High Performance Data Compression
by Terry Welch
IEEE Computer, June 1984 (v17.6)
Since Macintosh files are really two files (resource fork and data fork) and have
various other pieces of associated data, we start each compressed file with a header
record which contains the original name of the file, the size of each fork, and the
Finder information (creator, file-type, etc) so that we will be able to reconstruct an
indistinguishable copy of the original file.
LZW operates by finding common “substrings” and replaces them by a
fixed-length (frequently 12-bit) code. This technique is deterministic and is
accomplished during a single pass over the input file. The decompression procedure
needs no input table, and rebuilds the table as it goes. A 12-bit code means that there
are only 4K possible codes. We’ll be using a 14-bit code and thus allow 16K possible
encodings before our string table fills up. The table is initialized to contain encodings
of the 256 possible single-byte values.
The decompression algorithm translates each received code into a prefix string
and an extension byte. The extension byte is pushed onto a stack and the prefix
translated again, cycling until the prefix is atomic. The entire (decompressed) code is
then output by popping the stack until it is empty.
Due to the nature of the algorithm, the first code encountered is known to be of a
single byte (atomic) and can be directly converted. An update to the string table is
made for each code received after the first one, which is known to already be in the
table from the initialization. When a code has been translated, its final byte is used as
the suffix and combined with the prior string to add the new entry to the string table so
long as the string table hasn’t been filled. This new string is assigned a new value that
is the same code the compressor assigned to that string. In this manner, the
decompressor incrementally reconstructs the same string table that the compressor
used. Unfortunately, there is an abnormal case for which this algorithm does not
work. The abnormal case occurs whenever an input character string containing a
cyclic sequence of the form byte-code-byte-code-byte already appears in the
compressor string table. The decompression algorithm is modified (see step 3 of the
decompression description, below) to handle the special case.
To Compress
1. Initialize the table to contain all single-byte strings
2. Read the first byte, c, from the input file. Set the prefix, w, to that byte.
3. Read the next input byte into c.
4. If at end of file then go to step 7.
5. If wc is in the string table then set w to wc and go to 3.
6. Output Code(w); Put wc in the string table; Set w to c; go to 3
7. Output Code(w) and signify completion
To Decompress
1. Read the first input code and store in both code and oldcode. Since the first code is
known to be atomic (cf. 2, above), code is the encoding of c, output c, and set the
extension to c.
2. Read the next code and save the value in incode. If at end of file then go to 8.
3. If the code is not in the string table, we have the special case so, output
extension, set code to oldcode, and set incode to the coding of wc
4. If the retrieved code is the code for wc then push c onto the stack, set code to
code of w and repeat this step.
5. If the code is atomic, output the byte it represents and set the extension to that
byte.
6. While the stack is non-empty, output the byte on top and pop the stack.
7. Put oldcodec into the string table, set oldcode to incode and go to 2.
8. Signify completion.
In the case of adaptive LZW (which is demonstrated by the source in this article)
the compression is tested periodically, in our case every 10000 bytes, to see whether
the compression ratio has declined. If it has, a reserved code is saved to the output
file, the string table is reinitialized, and we start the process again. Obviously,
during decompression, if the reset code is encountered then the program reinitializes
the string table and then continues blithely with its processing.
A hashing algorithm is employed to determine the code for a string (prefix +
suffix byte). In the example program, I just shift and xor the two values. This is a
very simplistic hash evaluation and will result in a lot of collisions; however, a more
efficient hashing algorithm could introduce additional processing overhead in its
computation. One such possibility is included as a comment. The choice of hash
algorithm is yours, just be sure that you use the same one for both compression and
decompression.
There are a lot of places where this program could be optimized; however, most
of the optimizations are at the expense of clarity and are avoided here.
The major difference between this straightforward presentation of adaptive LZW
and that used by Raymond Lau in his StuffIt program follows. StuffIt uses the unix
compress code to do its compression with 14-bit compression being selected. The unix
compress code uses up all the 9-bit codes first, then the 10-bit codes, etc. until it has
used up the 14-bit codes, meanwhile packing them in VAX byte/bit order. This
involves a great deal of masking and shifting that obscures the general algorithm. We
are using a general string table into which we hash and always store 14-bit values in
the “natural” 680x0 byte order, although we do buffer our partial bytes and pack the
output data.
While our example is for compressing files, it could be modified to compress and
decompress structures in memory by replacing the GetByte routine on the
compression side and the PutByte routine on the decompression side. All the
compression code cares about is that it is fed data to be compressed one byte at a time,
not from whence the data came. Similarly, all the decompression code cares about is
that it be fed one-byte chunks of data to be decompressed and that it will spew the
result out a byte at a time.
The example programs could also be consolidated into a single application with
the introduction of a less modal (and more Macish) interface, but that is not the
purpose of this article. I used TMLPascal II (vers. 3.0) for this article but have also
recompiled with both MPW 3.0 and Think Pascal 2.01 - Think Pascal requires some
changes in the area of compiler directives, USES statements, and references to
_DataInit; the MPW Pascal compiler does not require any changes to the source (just
the makefile). Turbo Pascal would require a lot of changes in the source due to a
number of incompatible features (OR, AND, SHL, etc. for bit operations, for example).
Listing: LComp.Proj
#############################################################
#
# Created: Monday, October 10, 1988 4:08:42 PM
#
# Project Name: LComp
# Project Type: Application
#
# Finder Type: APPL
# Finder Creator: ????
# Set Bundle Bit: TRUE
#
# Project Source Files:
# LComp.r
# lcomp.p
lcomp.p.o ∂
lcomp.p
TMLPascal lcomp.p
LComp ∂
lcomp.p.o
Link -w -t ‘APPL’ -c ‘LZWC’ ∂
lcomp.p.o ∂
“{Libraries}”Runtime.o ∂
“{Libraries}”Interface.o ∂
“{TMLPLibraries}”PasLib.o ∂
-o LComp
LComp ∂
LComp.r
Rez -append -o LComp ∂
LComp.r
SetFile -a B LComp
Listing: LComp.p
{$R-}
{$D+}
{$DEFC DEBUG}
{$SETC DEBUG=TRUE}
PROGRAM LComp;
{ Simple case LZW compression }
USES
MemTypes,
QuickDraw,
OSIntf,
ToolIntf,
PackIntf;
CONST
maxBuff = 8192; {i/o buffer size}
maxTab = 16383; {Table size minus 1 ($3FFF)}
noPrev = $7FFF;
eofChar = -2;
endList = -1;
empty = -3;
clearCode = 256; {Reserved code to signal adaptive reset ($100) }
checkGap = 10000; {How frequently do we check for adaptive?}
TYPE
StringTableEntry = RECORD
prevByte: Integer;
follByte: Integer;
next: Integer;
used: Boolean;
reserved: Boolean;
END;
StringTableArray = ARRAY [0..maxTab] OF StringTableEntry; {128K
structure unless packed}
StringTablePtr = ^StringTableArray;
IntPtr = ^Integer;
Buffer = PACKED ARRAY [1..maxBuff] OF Char;
BufPtr = ^Buffer;
HeaderRecord = RECORD
name: String[31];
dfSize: LongInt;
rfSize: LongInt;
fndrInfo: FInfo;
END;
Remainder = (none, sixBit, fourBit, twoBit);
VAR
inRef: Integer; {Reference number of input file}
outRef: Integer; {Reference number of output file}
inVRefNum: Integer;{Volume/WD reference num. of input file}
outVRefNum: Integer; {Volume/WD reference number of output file}
eofSignal: Boolean; {Flag that it’s time to clean up}
inBufSize: LongInt; {Count of characters in input buffer }
inputPos: Integer; {Position in input buffer}
outputPos: Integer; {Position in output buffer}
bytesRead: LongInt; {Total bytes read from input file}
bytesWritten: LongInt; {Total bytes written to output file}
ratio: Extended;{Compression ratio (bytesRead/bytesWritten)}
checkPoint: LongInt; {Next time we check to see whether table
adaptation necessary}
inputBuffer: BufPtr; {Dynamically allocated data storage}
outputBuffer: BufPtr; { “ }
stringTable: StringTablePtr;
infileName: Str255; {Name of the file we’re compressing}
tableUsed: Integer;{Number of entries currently in string table}
outputCode: Integer; {Code (14-bit) that we’re going to output}
carryOver: Remainder; {How many bits we have in the code we’re
building}
doingDFork: Boolean; {Flag that tells which fork of the file we’re
compressing}
fsErr: OSErr; {Result of last file system call}
dataForkSize: LongInt; {Number of bytes in data fork}
rsrcForkSize: LongInt; {Number of bytes in resource fork}
progWindow: WindowPtr; {Window where we display progress}
boundsRect: Rect; {Bounding rect of the progress window}
hdrRec: HeaderRecord; {File information so that decompress will
get things right}
resetCode: Integer; {This is the hashCode for clearCode}
PROCEDURE _DataInit; EXTERNAL; {MPW specific}
PROCEDURE FileAlert(str: Str255);
CONST
fsAlert = 1111;
VAR
item: Integer;
BEGIN
ParamText(str, ‘’, ‘’, ‘’);
item := StopAlert(fsAlert, NIL);
fsErr := FSClose(inRef);
fsErr := FSClose(outRef);
fsErr := FlushVol(NIL, outVRefnum);
END {FileAlert} ;
{$IFC DEBUG}
PROCEDURE DebugAlert(l1, l2: LongInt);