All Databases MacTech Vol 11-1995

Aug 95 Challenge

Volume Number: 11

Issue Number: 8

Column Tag: Programmer’s Challenge

Programmer’s Challenge

By Bob Boonstra, Westford, Massachusetts

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

Diff-Warrior

Authors frequently use a software tool to compare different versions of the same

document and identify what changes have been made between versions. Something like

the well-known diff utility for programmers, but operating on a words instead of a

lines. This month your Challenge is to write a simplified version of such a file

comparison tool.

The prototype of the code you should write is:

#define ulong unsigned long

typedef enum {

/* describes type of text change between old and new versions of text */

  deletedText=0, insertedText, movedText } DiffType;
typedef struct {
DiffType type;

/* The meaning of the remaining fields depends on type.

For type == deletedText:

rangeStart - offset in oldText of first char deleted

rangeEnd - offset in oldText of last char deleted

position - offset in newText before which text would be inserted to convert

newText into oldText

For type == insertedText:

rangeStart - offset in newText of first char inserted

rangeEnd - offset in newText of last char inserted

position - offset in oldText before which text would be inserted to convert

oldText into newText

For type == movedText:

rangeStart- offset in oldText of first char moved

rangeEnd- offset in oldText of last char moved

position- offset in oldText before which text would be inserted to convert

oldText into newText

ulong rangeStart;

ulong rangeEnd;

ulong position;

} DiffRec;

short /* numDiffsFound */ FindWordDifferences (

char *oldText, /* pointer to old version of text */

char *newText, /* pointer to new version of text */

ulong numOldChars, /* number of characters in oldText */

ulong numNewChars, /* number of characters in newText */

DiffRec diffs[], /* pointer to preallocated array where text differences are to

be stored */

ulong maxDiffRecs /* number of DiffRecs preallocated */

);

FindWordDifferences should store a DiffRec in the diffs array for each sequence

of words that were deleted or moved from oldText, and for each sequence inserted into

newText. So, for example, if newText and oldText contained the following characters

oldText->The quick brown fox jumps over the lazy dog.

newText->The brown and white lazy fawn jumps over the dog.

you would return the value 5 and store the following into the diffs array:

diffs[0]={deletedText,  4, 9, 4}  (deleted "quick ")
diffs[1]={insertedText,10,19,16}  (inserted "and white ")
diffs[2]={movedText,   35,39,16}  (moved "lazy ")
diffs[3]={deletedText, 16,18,25}  (deleted "fox")
diffs[4]={insertedText,25,28,16}  (inserted "fawn")

The solution to this problem is clearly not unique. For example, each problem

has a trivial solution where all of the oldText is deleted, and all of the newText is

inserted. To encourage nontrivial and optimal solutions, selection of the winner will

be based on the quality of the FindWordDifferences job you do, as well as how quickly

you find a solution. Quality will be measured first by the number of characters

deleted, inserted, and moved (the sum of the (rangeEnd-rangeStart+1) values in your

DiffRec array), and next on the number of differences found (numDiffsFound), with

smaller being better in both cases. The fastest solution within 10% of the best quality

score (both parameters) will be declared the winner. These criteria place a premium

on finding movedText (as opposed to deletedText / insertedText pairs) and on finding

larger contiguous sections of inserted / deleted text (as opposed to sequences of smaller

changes).

Now for the fine print. Notice that the comparison is based on words, not just on

characters (look at “fox” and “fawn” above), so be careful when calculating offsets.

Words are delimited by white space (spaces, CR, NL, tabs), or punctuation.

Extra/missing white space and/or punctuation should also result in a DiffRec. Your

solution should be case sensitive (i.e., “The” is different from “the”). You may not

change the input text pointed to by newText and oldText. The DiffRec offsets calculated

by your program should apply to the unchanged input oldText and newText (i.e., they

should be independent of changes implied by any other DiffRecs). Correctness will be

determined by using the DiffRecs you produce to convert the oldText into the newText,

and vice versa (ignoring CRs and NLs). In doing the conversions, the DiffRecs will be

applied in order of decreasing position / rangeStart value, and (in case of duplicates)

in reverse order of the DiffRec array. Testing will use text files averaging 5-20K

characters in size, and at least one test case will exceed 64K.

Thanks to the generosity of Metrowerks, who provided a copy of CodeWarrior for

use in the Challenge, and in response to requests from several readers, this Challenge

will be scored using the CodeWarrior compiler. The target instruction set is the

680x0 family; I’ll be testing on a 68040. (We’ll have a PowerPC Challenge in a

couple of months.)

Finally, I want to express the gratitude that all Challenge participants feel

toward Mike Scanlin for his excellent work writing this column. I am honored to have

the opportunity to take over and hope to maintain the high standard Mike set. If any of

you have suggestions on how to make the column and the Challenge even better, please

send them to me at one of the Programmer’s Challenge e-mail addresses, or directly to

boonstra@ultranet.com.

Two Months Ago Winner

Apparently there are at least a few chess players among MacTech Challenge readers. Of

the six entries I received for the Check Checkmate Challenge, four worked correctly.

Congratulations to Gary Beith (Largo, FL) for his efficient and instructive solution.

Gary’s win is especially notable because this is his first entry in the Programmer’s

Challenge.

Here are the times and code sizes for the entries that worked correctly. Numbers

in parens after a person’s name indicate that person’s cumulative point total for all

previous Challenges, not including this one:

Name time code

Gary Beith 188 4282

Ernst Munter (60) 260 19810

Xan Gregg (24) 639 4242

David Rees 2919 2604

Gary’s solution demonstrates several features associated with efficient solutions.

Referenced by (5):