Aug 95 Challenge
Volume Number: 11
Issue Number: 8
Column Tag: Programmer’s Challenge
Programmer’s Challenge 
By Bob Boonstra, Westford, Massachusetts
Note: Source code files accompanying article are located on MacTech CD-ROM orsource code disks.
Diff-Warrior
Authors frequently use a software tool to compare different versions of the same
document and identify what changes have been made between versions. Something like
the well-known diff utility for programmers, but operating on a words instead of a
lines. This month your Challenge is to write a simplified version of such a file
comparison tool.
The prototype of the code you should write is:
#define ulong unsigned long
/* describes type of text change between old and new versions of text */
deletedText=0, insertedText, movedText } DiffType;
typedef struct {
DiffType type;
/* The meaning of the remaining fields depends on type.
For type == deletedText:
rangeStart - offset in oldText of first char deleted
rangeEnd - offset in oldText of last char deleted
position - offset in newText before which text would be inserted to convert
newText into oldText
For type == insertedText:
rangeStart - offset in newText of first char inserted
rangeEnd - offset in newText of last char inserted
position - offset in oldText before which text would be inserted to convert
oldText into newText
For type == movedText:
rangeStart- offset in oldText of first char moved
rangeEnd- offset in oldText of last char moved
position- offset in oldText before which text would be inserted to convert
oldText into newText
*/
ulong rangeStart;
ulong rangeEnd;
ulong position;
short /* numDiffsFound */ FindWordDifferences (
char *oldText, /* pointer to old version of text */
char *newText, /* pointer to new version of text */
ulong numOldChars, /* number of characters in oldText */
ulong numNewChars, /* number of characters in newText */
DiffRec diffs[], /* pointer to preallocated array where text differences are to
be stored */
ulong maxDiffRecs /* number of DiffRecs preallocated */
);
FindWordDifferences should store a DiffRec in the diffs array for each sequence
of words that were deleted or moved from oldText, and for each sequence inserted into
newText. So, for example, if newText and oldText contained the following characters
oldText->The quick brown fox jumps over the lazy dog.
newText->The brown and white lazy fawn jumps over the dog.
you would return the value 5 and store the following into the diffs array:
diffs[0]={deletedText, 4, 9, 4} (deleted "quick ")
diffs[1]={insertedText,10,19,16} (inserted "and white ")
diffs[2]={movedText, 35,39,16} (moved "lazy ")
diffs[3]={deletedText, 16,18,25} (deleted "fox")
diffs[4]={insertedText,25,28,16} (inserted "fawn")
The solution to this problem is clearly not unique. For example, each problem
has a trivial solution where all of the oldText is deleted, and all of the newText is
inserted. To encourage nontrivial and optimal solutions, selection of the winner will
be based on the quality of the FindWordDifferences job you do, as well as how quickly
you find a solution. Quality will be measured first by the number of characters
deleted, inserted, and moved (the sum of the (rangeEnd-rangeStart+1) values in your
DiffRec array), and next on the number of differences found (numDiffsFound), with
smaller being better in both cases. The fastest solution within 10% of the best quality
score (both parameters) will be declared the winner. These criteria place a premium
on finding movedText (as opposed to deletedText / insertedText pairs) and on finding
larger contiguous sections of inserted / deleted text (as opposed to sequences of smaller
changes).
Now for the fine print. Notice that the comparison is based on words, not just on
characters (look at “fox” and “fawn” above), so be careful when calculating offsets.
Words are delimited by white space (spaces, CR, NL, tabs), or punctuation.
Extra/missing white space and/or punctuation should also result in a DiffRec. Your
solution should be case sensitive (i.e., “The” is different from “the”). You may not
change the input text pointed to by newText and oldText. The DiffRec offsets calculated
by your program should apply to the unchanged input oldText and newText (i.e., they
should be independent of changes implied by any other DiffRecs). Correctness will be
determined by using the DiffRecs you produce to convert the oldText into the newText,
and vice versa (ignoring CRs and NLs). In doing the conversions, the DiffRecs will be
applied in order of decreasing position / rangeStart value, and (in case of duplicates)
in reverse order of the DiffRec array. Testing will use text files averaging 5-20K
characters in size, and at least one test case will exceed 64K.
Thanks to the generosity of Metrowerks, who provided a copy of CodeWarrior for
use in the Challenge, and in response to requests from several readers, this Challenge
will be scored using the CodeWarrior compiler. The target instruction set is the
680x0 family; I’ll be testing on a 68040. (We’ll have a PowerPC Challenge in a
couple of months.)
Finally, I want to express the gratitude that all Challenge participants feel
toward Mike Scanlin for his excellent work writing this column. I am honored to have
the opportunity to take over and hope to maintain the high standard Mike set. If any of
you have suggestions on how to make the column and the Challenge even better, please
send them to me at one of the Programmer’s Challenge e-mail addresses, or directly to
boonstra@ultranet.com.
Two Months Ago Winner
Apparently there are at least a few chess players among MacTech Challenge readers. Of
the six entries I received for the Check Checkmate Challenge, four worked correctly.
Congratulations to Gary Beith (Largo, FL) for his efficient and instructive solution.
Gary’s win is especially notable because this is his first entry in the Programmer’s
Challenge.
Here are the times and code sizes for the entries that worked correctly. Numbers
in parens after a person’s name indicate that person’s cumulative point total for all
previous Challenges, not including this one:
Name time code
Gary Beith 188 4282
Ernst Munter (60) 260 19810
Xan Gregg (24) 639 4242
David Rees 2919 2604
Gary’s solution demonstrates several features associated with efficient solutions.