All Databases MacTech Vol 12-1996

Oct 96 Challenge

Volume Number: 12

Issue Number: 10

Column Tag: Programmer’s Challenge

Programmer’s Challenge

By Bob Boonstra

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

DNA Match

This month’s Challenge is based on a suggestion submitted by Vicente Giles of the

Universidad de Málaga. Vincente faces a real-world problem to look for all the genomic

sequences that match certain

criteria, given a DNA database sequence and a problem sequence. A DNA sequence

is a string of the four different nucleotides involved in the genetic code, denoted ‘A’,

‘C’, ‘G’, and ‘U’, which stand for adenine, cytosine, guanine, and uracil. The problem is

to find all possible matches of the problem sequence in the database sequence, allowing

a specified number of differences.

The prototype for the code you should write is:

long FindMatch(

char *alphabet, /* legal characters for database and fragment strings */

char *database, /* reference string to compare fragments against */

void *storage, /* storage preallocated for your use */

char *fragment, /* string to match against database */

long diffsAllowed, /* differences allowed between fragment and database */

long matchPosition[] /* return match positions in this array*/

);

void InitMatch(

char *alphabet, /* legal characters for database and fragment strings */

char *database, /* reference string to compare fragments against */

void *storage /* storage preallocated for your use */

);

Because we would like our DNA-matching algorithm to be useful even if

scientists discover an extraterrestrial genetic code based on other nucleotides, the

algorithm accepts the genetic alphabet as a parameter. In the problem posed by

Vincente, this would be the string “ACGU”, but in our Challenge it might include any of

the characters ‘a’..’z’ or ‘A’..’Z’ (Extraterrestrial DNA is case sensitive). The

null-terminated reference string contained in the database parameter can be up to

1000000000 (109) characters long. The fragment that you are to match is also

null-terminated, but will be significantly shorter on average (up to 10000

characters) than the database string. You should compare the input fragment against

database, finding all occurrences of fragment that differ in no more than diffsAllowed

positions from a substring of database. Your code should populate one entry in the

preallocated matchPosition array for each match found, storing the offset of the

character in database that corresponds to the first character of fragment. The

FindMatch function should return the number of matches found.

As an example, given the following input

alphabet: ACGU

database: ACGTACGTACGTAAAAAATACGTACGTATA

fragment: ACGTACGTAC

diffsAllowed: 5

your code should find 7 matches and store the following values in

matchPosition:

-4 0 4 8 15 19 23

Notice that partial matches can occur at the beginning or the end of database, and

as a result, the offsets returned in matchPosition can be negative or greater than

strlen(database) - strlen(fragment).

To allow you to do some preprocessing, your InitMatch routine will be called once

before a sequence of calls to FindMatch. InitMatch will be called with the same alphabet

and database parameters provided to subsequent FindMatch calls. Both routines will

also be given the same storage parameter that points to at least 1MB of memory

allocated and initialized to zero by the calling routine. FindMatch will be called

between 100 and 1000 times, on average, for each call to InitMatch. The winning

solution will be the one with the fastest execution time, including the execution time

for both InitMatch and FindMatch.

Other fine print: The alphabet characters will be provided in increasing ASCII

order. The offsets you store in matchPosition need not be in any particular order. The

value for diffsAllowed will typically be smaller than 50% of strlen(fragment).

Finally, you should not allocate any dynamic storage in your solution beyond that

provided in the storage parameter.

This will be a native PowerPC Challenge using the latest Symantec environment.

Solutions may be coded in C or C++.

Two Months Ago Winner

Congratulations to Randy Boring for submitting the fastest entry to the

A-Maze-ing Programmer’s Challenge. The Challenge this month was to write code that

would find a path leading out of a three-dimensional maze. The solutions were provided

with the maze size, an initial position, some storage for use in mapping the maze, and a

callback routine. The callback provided the result of attempting to move in a given

direction, indicating whether the attempt to move succeeded, failed because there was

no opening in the specified direction, resulted in a fall down a shaft in the mine, or

found an exit to the mine. Of the four entries submitted, only two successfully solved

all of my test mazes; one of the entries crashed, and one went into an infinite loop.

The table below summarizes the results for each correct entry, including the

language in which the solution was written, the size of the solution code, the amount of

static data used by the solution, the total execution time for all test cases, and the

number of moves needed to solve the mazes.

Name Language Code Size Data Size Time Moves

Randy Boring C++ 2792 484 343153 33519

Jay Negro C++ 1788 51 40945114 7120802

The test mazes used in the evaluation ranged in size from 10x20x30 to

Referenced by (3):