Oct 96 Challenge
Volume Number: 12
Issue Number: 10
Column Tag: Programmer’s Challenge
Programmer’s Challenge 
By Bob Boonstra
Note: Source code files accompanying article are located on MacTech CD-ROM orsource code disks.
DNA Match
This month’s Challenge is based on a suggestion submitted by Vicente Giles of the
Universidad de Málaga. Vincente faces a real-world problem to look for all the genomic
sequences that match certain
criteria, given a DNA database sequence and a problem sequence. A DNA sequence
is a string of the four different nucleotides involved in the genetic code, denoted ‘A’,
‘C’, ‘G’, and ‘U’, which stand for adenine, cytosine, guanine, and uracil. The problem is
to find all possible matches of the problem sequence in the database sequence, allowing
a specified number of differences.
The prototype for the code you should write is:
long FindMatch(
char *alphabet, /* legal characters for database and fragment strings */
char *database, /* reference string to compare fragments against */
void *storage, /* storage preallocated for your use */
char *fragment, /* string to match against database */
long diffsAllowed, /* differences allowed between fragment and database */
long matchPosition[] /* return match positions in this array*/
);
void InitMatch(
char *alphabet, /* legal characters for database and fragment strings */
char *database, /* reference string to compare fragments against */
void *storage /* storage preallocated for your use */
);
Because we would like our DNA-matching algorithm to be useful even if
scientists discover an extraterrestrial genetic code based on other nucleotides, the
algorithm accepts the genetic alphabet as a parameter. In the problem posed by
Vincente, this would be the string “ACGU”, but in our Challenge it might include any of
the characters ‘a’..’z’ or ‘A’..’Z’ (Extraterrestrial DNA is case sensitive). The
null-terminated reference string contained in the database parameter can be up to
1000000000 (109) characters long. The fragment that you are to match is also
null-terminated, but will be significantly shorter on average (up to 10000
characters) than the database string. You should compare the input fragment against
database, finding all occurrences of fragment that differ in no more than diffsAllowed
positions from a substring of database. Your code should populate one entry in the
preallocated matchPosition array for each match found, storing the offset of the
character in database that corresponds to the first character of fragment. The
FindMatch function should return the number of matches found.
As an example, given the following input
alphabet: ACGU
database: ACGTACGTACGTAAAAAATACGTACGTATA
fragment: ACGTACGTAC
diffsAllowed: 5
your code should find 7 matches and store the following values in
matchPosition:
-4 0 4 8 15 19 23
Notice that partial matches can occur at the beginning or the end of database, and
as a result, the offsets returned in matchPosition can be negative or greater than
strlen(database) - strlen(fragment).
To allow you to do some preprocessing, your InitMatch routine will be called once
before a sequence of calls to FindMatch. InitMatch will be called with the same alphabet
and database parameters provided to subsequent FindMatch calls. Both routines will
also be given the same storage parameter that points to at least 1MB of memory
allocated and initialized to zero by the calling routine. FindMatch will be called
between 100 and 1000 times, on average, for each call to InitMatch. The winning
solution will be the one with the fastest execution time, including the execution time
for both InitMatch and FindMatch.
Other fine print: The alphabet characters will be provided in increasing ASCII
order. The offsets you store in matchPosition need not be in any particular order. The
value for diffsAllowed will typically be smaller than 50% of strlen(fragment).
Finally, you should not allocate any dynamic storage in your solution beyond that
provided in the storage parameter.
This will be a native PowerPC Challenge using the latest Symantec environment.
Solutions may be coded in C or C++.
Two Months Ago Winner
Congratulations to Randy Boring for submitting the fastest entry to the
A-Maze-ing Programmer’s Challenge. The Challenge this month was to write code that
would find a path leading out of a three-dimensional maze. The solutions were provided
with the maze size, an initial position, some storage for use in mapping the maze, and a
callback routine. The callback provided the result of attempting to move in a given
direction, indicating whether the attempt to move succeeded, failed because there was
no opening in the specified direction, resulted in a fall down a shaft in the mine, or
found an exit to the mine. Of the four entries submitted, only two successfully solved
all of my test mazes; one of the entries crashed, and one went into an infinite loop.
The table below summarizes the results for each correct entry, including the
language in which the solution was written, the size of the solution code, the amount of
static data used by the solution, the total execution time for all test cases, and the
number of moves needed to solve the mazes.
Name Language Code Size Data Size Time Moves
Randy Boring C++ 2792 484 343153 33519
Jay Negro C++ 1788 51 40945114 7120802
The test mazes used in the evaluation ranged in size from 10x20x30 to