All Databases MacTech Vol 11-1995

Dec 95 Challenge

Volume Number: 11

Issue Number: 12

Column Tag: Programmer’s Challenge

Programmer’s Challenge

By Bob Boonstra, Westford, Massachusetts

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

Find Again And Again

This month the Challenge is to write a text search engine that is optimized to operate

repeatedly on the same text. You will be given a block of text, some storage for data

structures, and an opportunity to analyze the text before being asked to perform any

searches against that text. Then you will repeatedly be asked to find a specific

occurrence of a given word in that block of text. The prototypes for the code you should

write are:

void InitFind(

char *textToSearch, /* find words in this block of text */

long textLength, /* number of chars in textToSearch */

void *privateStorage, /* storage for your use */

long storageSize /* number of bytes in privateStorage */

);

long FindWordOccurrence(

/* return offset of wordToFind in textToSearch */

char *wordToFind, /* find this word in textToSearch */

long wordLength, /* number of chars in wordToFind */

long occurrenceToFind, /* find this instance of wordToFind */

char *textToSearch, /* same parameter passed to InitFind */

long textLength, /* same parameter passed to InitFind */

void *privateStorage, /* same parameter passed to InitFind */

long storageSize /* same parameter passed to InitFind */

);

The InitFind routine will be called once for a given block of textLength

characters at textToSearch to allow you to analyze the text, create data structures,

and store them in privateStorage. When InitFind is called, storageSize bytes of

memory at privateStorage will have been preallocated and initialized to zero.

FindWordOccurrence is to search for words, where a word is defined as a

continuous sequence of alphanumeric characters delimited by a non-alphanumeric

character (e.g., space, tab, punctuation, hyphen, CR, NL, or other special character).

Your code should look for complete words - it would be incorrect, for example, to

return a value pointing to the word “these” if the wordToFind was “the”. The

wordToFind will be a legal word (i.e., no embedded delimiters).

FindWordOccurrence should return the offset in textToSearch of the

occurrenceToFind-th instance of wordToFind. It should return -1 if wordToFind

does not occur in textToSearch, or if there are fewer than occurrenceToFind

instances of wordToFind.

Both the InitFind and the FindWordOccurrence routines will be timed in

determining the winner. In designing your code, you should assume that

FindWordOccurrence will be called approximately 1000 times for each call to

InitFind (with the same textToSearch, but possibly differing values of wordToFind

and occurrenceToFind).

There is no predefined limit on textLength - you should handle text of arbitrary

length. The amount of privateStorage available could be very large, but is

guaranteed to be at least 64K bytes. While the test cases will include at least one large

textToSearch with a small storageSize, most test cases will provide at least 32

bytes for each occurrence of a word in textToSearch, so you might want to optimize

for that condition.

Other fine print: you may not change the input pointed to by textToSearch or

wordToFind, and you should not use any static storage other than that provided in

privateStorage.

This will be a native PowerPC Challenge, scored using the latest CodeWarrior

compiler. Good luck, and happy searching.

Programmer’s Challenge Mailing List

We are pleased to announce the creation of the Programmer’s Challenge Mailing List.

The list will be used to distribute the latest Challenge, provide answers to questions

about the current Challenge, and discuss suggestions for future Challenges. The

Challenge problem will be posted to the list each month, sometime between the 20th

and the 25th of the month. This should alleviate problems caused by variations in the

publication and mailing date of the magazine, and provide a predictable amount of time

to work on each Challenge.

To subscribe to the list, send a message to autoshare@mactech.com with the

SUBJECT line “sub challenge YourName”, substituting your real name for YourName.

To unsubscribe from the list, send a message to autoshare@mactech.com with the

SUBJECT line “unsub challenge”.

Note: the list server, autoshare, is set to accept commands in the SUBJECT line,

not the body of the message. If you have any problems, please contact

online@mactech.com.

Two Month’s Ago Winner

The Master Mindreader Challenge inspired ten readers to enter, and all ten solutions

gave correct results. Congratulations to Xan Gregg (Durham, N.C.) for producing the

fastest entry and winning the Challenge.

The problem required you to write code that would correctly guess a sequence of

colors using a callback routine provided in the problem statement that returned two

values for each guess: the number of elements of the guess where the correct color is

located in the correct place in the sequence, and the number of elements where the

correct color is in an incorrect place in the sequence. The number of guesses was not

an explicit factor in determining the winner, but the time used by the callback routine

was included in determining the winner. Participants correctly noted that this made

the relative execution time of the guessing routine and the callback routine a factor in

designing a fast solution. A couple of entries went so far as to offer their own, more

efficient, callbacks. Nice try, but I didn’t use them - the callback in the problem was

designed to provide a known time penalty for making a guess, and that was the callback

I used in evaluating solutions.

The callback I supplied had one unanticipated side effect - it permitted callers to

supply an out-of-range value for positions in the sequence that they didn’t care about

for that guess, and six of the entries took advantage of this loophole. This wasn’t what I

had intended, and I gave some thought to giving priority to solutions that did not use the

loophole. In the end, however, I decided not to treat these entries any differently,

because the solution statement permitted and provided a defined callback behavior for

out-of-range guesses. As it turned out, the winning entry and three of the fastest four

entries did not use out-of-range guesses.

Xan’s winning code first makes a sequence of guesses to determine how many

positions are set to each of the possible colors. He then starts with an initial guess

corresponding to these colors and begins swapping positions to determine how the

number of correctly placed colors is affected. Separate logic handles the cases where

the number of correctly placed colors increased or decreased by 0, 1, or 2, all the

while keeping track of which color possibilities have been eliminated for each position.

These and other details of Xan’s algorithm are documented in the comments to his code.

The table of results below indicates, in addition to execution time, the cumulative

number of guesses used by each entry for all test cases. In general, it shows the

expected rough correlation between execution time and the number of guesses, with a

significant exception for the second-place entry from Ernst Munter, which took

significantly fewer guesses. Ernst precalculated tables to define the guessing strategy

for problems of length 5 or less and devised a technique for partitioning larger

problems to use these tables. Normally I try to discourage the use of extensive

precalculated data, but I decided to allow this entry because the amount of data was not

unreasonable, because the tables guided the algorithm but did not precalculate a

solution, and because I thought the approach was innovative and interesting. Although

including the second-place entry in the article is not possible because of length

restrictions, I have included the preamble from Ernst’s solution describing his

approach.

Here are the times and code sizes for each of the entries. Numbers in parentheses

after a person’s name indicate that person’s cumulative point total for all previous

Challenges, not including this one.

Name time guesses code data out-of-range

values used?

Xan Gregg (61) 102 4123 1360 16 no

Ernst Munter (90) 109 2880 6264 5480 limited

Gustav Larsson (60) 116 3700 712 40 no

Greg Linden 127 5002 576 16 no

M. Panchenko (4) 146 5391 344 16 yes

Eric Lengyel (20) 176 6456 312 16 yes

Peter Hance 206 6557 336 16 yes

J. Vineyard (42) 228 9933 328 16 no

Ken Slezak (10) 251 6544 808 16 yes

Stefan Sinclair 259 11058 200 16 yes

Top 20 Contestants of All Time

Here are the Top 20 Contestants for the Programmer’s Challenges to date. The

numbers below include points awarded for this month’s entrants. (Note: ties are listed

alphabetically by last name - there are more than 20 people listed this month because

of ties.)

Rank Name ·Points

1. [Name deleted] 176

2. Munter, Ernst 100

3. Gregg, Xan 81

4. Karsh, Bill 78

5. Larsson, Gustav 67

6. Stenger, Allen 65

7. Riha, Stepan 51

8. Goebel, James 49

9. Nepsund, Ronald 47

10. Cutts, Kevin 46

11. Mallett, Jeff 44

12. Kasparian, Raffi 42

13. Vineyard, Jeremy 42

14. Darrah, Dave 31

15. Landry, Larry 29

16. Elwertowski, Tom 24

17. Lee, Johnny 22

18. Noll, Robert 22

19. Anderson, Troy 20

20. Beith, Gary 20

21. Burgoyne, Nick 20

22. Galway, Will 20

23. Israelson, Steve 20

24. Landweber, Greg 20

25. Lengyel, Eric 20

26. Pinkerton, Tom 20

There are three ways to earn points: (1) scoring in the top 5 of any Challenge,

(2) being the first person to find a bug in a published winning solution or, (3) being

the first person to suggest a Challenge that I use. The points you can win are:

1st place 20 points

2nd place 10 points

3rd place 7 points

4th place 4 points

5th place 2 points

finding bug 2 points

suggesting Challenge 2 points

Here is Xan’s winning solution:

MindReader

By Xan Gregg,Durham, N.C.

I try to minimize the number of guesses without adding too much complexity to the

code. First I figure out how many of each color are present in the answer by

essentially repeatedly guessing all of each color.

Then I figure out the correct positions one at a time starting at slot 0. I exchange it

with each other slot (one at a time) until the correct color is found. When there is a

change in the numCorrect response from checkGuess I can tell which of the two

slots caused the change by looking at my remembered information or, if necessary,

by performing a second guess with one of the colors in both slots.

The “remembered information” includes keeping track of colors that were

determined (via the numCorrectchange) to be wrong before and/or a swap is made.

This doesn’t help out too often, but it doesn’t take much time to record compared to

calling checkGuess.

While the outer loop determines the color of each slot “left-to-right” (0 to n-1), I

found that indexing the inner loop right-to-left instead of left-to-right increased the

speed by 30% - 40%. I wish I understood why!

Oddly, the checkGuess function spends most of its time figuring out the numWrong

value, which we generally ignore.

typedef void (*CheckGuessProcPtr)(

unsigned char *theGuess,

unsigned short *numInCorrectPos,

unsigned short *numInWrongPos);

#define kMaxLength 16

#define Bit(color) (1L << (long) (color))

MindReader

void MindReader(unsigned char guess[],

CheckGuessProcPtr checkGuess,

unsigned short answerLength,

unsigned short numColors)

{

long prevColorsFound;

long colorsFound;

long curColor;

long i, j;

long curCorrect;

long numOfColor[kMaxLength + 1]; /* 1-based */

Boolean isCorrect[kMaxLength];

long possibilities[kMaxLength]; /* bit fields */

long colorBit1;

long colorBit2;

char color1;

char color2;

long delta;

unsigned short newCorrect;

unsigned short newWrong;

/* first find the correct set of colors */

colorsFound = 0;

curColor = 1;

while (colorsFound < answerLength)

{

Referenced by (6):