All Databases MacTech Vol 08-1992

Virus Detection

Volume Number: 8

Issue Number: 2

Column Tag: Cover Story

Related Info: Resource Manager File Manager

Generic Virus Detection

Virus Detection - the state of the art solutions

By William Hsu, Millersville, Maryland

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

About the author

William Hsu is a third-year undergradutate majoring in Computer Science at the

Johns Hopkins University. The following article was based on a paper for an

independent programming workshop under Dr. Steven Salzberg. His Internet address is

hsu@cs.jhu.edu. His research interests include recent advances in computer virus

theory and treatment, program synthesis, and randomized and approximation

algorithms.

Recently, computer viruses have attracted a high volume of public, media, and

scientific attention. This is no surprise considering the explosion in the development

rate of computer virus code. Combine this with the fact that current methods for

detection of viruses have had limited success. A new approach to the detection of such

code is needed. We're going to look at two variations on algorithms originally developed

for string matching. Both modifications allow an increased tolerance for variant

“strains” of known viruses, e specially for the so-called evolutionary class which

mutate themselves at predetermined intervals.

First, a randomization scheme can be applied to an established fast substring

matching procedure (such as the Boyer-Moore algorithm). This randomization allows

mutation-resistant searching. Second, an approximate pattern matching algorithm for a

maximum number of differences can be used. The algorithm is modified by weighting the

edit distance metric to make it robust to character padding and removal. Both functions

are combined to create a generalized detector capable of finding viral clones, whether

produced by human authors or by automatic variations.

This article was produced as part of an undergraduate computer science

workshop at Johns Hopkins University. First and foremost, I would like to thank Dr.

Steven Salzberg, my project advisor, for his guidance, insight, and instruction. Many

thanks to the faculty and staff of the JHU Computer Science Department, and to John

Norstad, Brian Seborg, Ephraim Vishniac, and Jan Christian van Winkel for their help

and comments.

Virus Detection: Classification, Methods, and History

There has been a great deal of interest in the detection of active “viral” code on

both Macs and mainframes, e specially as members of an interconnected network.

Definitions of the set of programs called viruses have been put forth in many recent

articles, most notably in the work by Cohen.1 In this work, detection of the virus is

simply considered a pattern string to be searched for in a larger text (a possibly

infected program).

Current viral detection procedures are classified according to a system put forth

by the Computer Virus Industry Association.2 The association divides anti-viral

products into three categories: Class 1 antiviruses (“Infection Pr evention Products”)

halt the virus replication and prevent the initial infection from occurring - an example

of such a program for the Macintosh operating system is the cdev (Control Panel

Device) Vaccine. Class 2 programs (“Infection Detection Products”) detect infection

soon after it has occurred and mark specific components of system segments that have

become infected - they do so by periodically inspecting executable files, and may or

may not attempt repair of an infected file. An example of a Class 2 program on the

Macintosh is the INIT package GateKeeper and GateKeeper Aid. Finally, Class 3

antiviruses (“Infection Identification Products”) identify specific viral strains (see

below for more information about this topic) on systems that are already infected and

may remove the virus, returning the system to its state prior to infection. This

category is most common, although its effectiveness is dependent on the frequency of

user invocation. McAfee's MS-DOS program SCAN and the Macintosh Disinfectant series

are Class 3 products.

At present, three approaches to viral code detection have been prevalent. The

first, viral signature matching, requires information about the virus. Specifically, a

signature-based detector requires the virus’ code length and the location of its

“contagious” segment (which is essential to its replication and transfer among storage

media, computer memory, and networks). Code enumeration, a second technique,

involves examining known programs periodically to test whether any unknown

segments of code have been added to the original file. It is most effective when applied

before each execution of the program. Finally, checksum methods compare the current

size of a file and the summed value of its bytes to the same attributes of a known

uninfected version - an infection will often change these values. None of the above

methods require the use of the code of the virus itself in its entirety, and all three

require user action upon discovery of a virus. Usually, the computer user is urged to

restore an earlier copy of the infected file or files from backup. In specific cases,

disinfection is possible.

Most software developers claim that the absence of the actual viral code from a

detection program pr events its reuse by other virus authors. These authors could

conceivably design a virus based on the stolen code and thwart the original detection

program. For this reason, developers choose against using a whole (deactivated) virus

in detectors. An opposing viewpoint suggests that the only defense against viral code

that will not inevitably fail is one which does not depend on the secrecy of its internal

mechanisms. A conflict has arisen between these two opinions as each side handles

sensitive information very differently. The former advocates secrecy regarding:

breaches of security, weaknesses in system defenses, and vulnerabilities of protective

software (which will be discussed later in this article), while the latter believes

revealing such details will attract support from developers of anti-viral programs and

prevent unexpected attacks.

MacDTS has argued that attempting to support anti-viral measures is a futile

struggle. This viewpoint fails to factor in the practical results approximate methods

have had with problems which were once considered intractable. On the other hand,

algorithmic detection of all viruses, including those for which no specimens exist, has

been established as an intractable problem. It has been shown that to determine

whether any given program is infected is undecidable.3 However, examination of the

sample code of a virus, once it has been discovered, allows a signature detection schema

to be implemented. Extending the use of this necessary information to both increase the

range of the detectable viral set and decrease the amount of data required to do so is a

logical goal. It stilll makes sense to include established algorithms suited for such

detection as part of the process. By tracking down new viruses quickly, their damage

can be lessened further than by distributing cures long after the spread is under way.

Viral Variant Detection an Algorithmic Approach

Choosing an appropriate class of algorithms depends on the treatment of the

subject data; in this case, it is beneficial to consider viral and program code as text.

For this problem, it is reasonable to use string matching procedures, which are

specifically oriented toward character sequences with a pre-defined alphabet (such as

the ASCII alphabet). In this context, viruses are searched for by two methods similar to

the “Find” and spelling check commands in most word processors. The unique aspect of

visrus detection, however, requires the search to be repeated many times (for

hundreds of small pieces of the virus). The other search, similar to the “Suggest

Word” command in a spelling checker, looks for approximate matches rather than exact

ones. Virus detection gives character deletions more importance over inserted

characters.

We have developed algorithms that will detect known viruses and unknown clones

and mutants of those viruses. For our purposes a viral clone is defined as a known

virus which has been modified prior to its release without changing its viral properties

in any appreciable way. For instance, its replication and infection techniques and

“detonation” effect (damage done when its preset trigger goes off) should remain

identical. An example of a clone is the Hpat virus, a first-generation modified version

of the nVIR A Macintosh virus. A viral mutant is defined as a known virus to which

self-modification parameters have been added which cause it to create successive clones

of itself at intervals or upon a trigger event - mutation may occur after release and

may or may not be limited to a finite number of iterations. No Macintosh viral mutants

currently exist. The term strain is often applied interchangeably to groups of clones,

mutants, and even unrelated viruses (developed separately) which share any common

feature. Due to this ambiguity, the expression is not used except in reference to sets of

viruses previously designated as “strains”, such as nVIR A and B. The word variant is

applied to mean an altered virus which may or may not be known and whose function

may or may not have been changed during modification of its code; a variant is not

necessarily a clone, but all viral clones and generations of mutants are variants of their

ancestor viruses. Note that viruses of a single strain are not necessarily variants of

each other under this definition; an example is the WDEF strain (Macintosh), with

substrains A and B - they are so designated merely because they share the same code

label.

A primary method of creating clones and mutants is character padding, the

addition of code sequences or characters which do not affect the operation of the virus.

Safeguards against this technique are presented in the algorithm discussions. A more

difficult strategy is the removal of segments from a known virus - of course, this

cannot be carried out indefinitely or even for a large portion of the virus. Finally, a

virus may be designed to relocate itself once appended onto or inserted into a host

program; auto-propagation is formally considered a feature of worms, but shifting code

segments is another way of avoiding detection.

To implement variation-tolerant matching, one of several approaches may be

selected. First, approximate string matching for a text of length n, a pattern of length

m, and an integer k is among the most common of these. k is the maximum number of

differences allowed between a pattern string and the text which is being searched.

Algorithms exist which can compute an edit distance, based on the number and type of

differences. The “edit” operations are character deletion, insertion, and “twiddle”

(transformation of one character to any other). This distance metric can be computed

by dynamic programming, a method which breaks down problems which would

otherwise require recursion and solves it by computing a table. A straightforward

implementation requires O(mn) time; a more complex version solves the problem in

O(kn) time4 with some overhead. Parallelization of the procedure allows the values to

be computed in O(k) complexity.

The second approach uses a fast substring matching function for small segments

of the viral code which is being searched for. The length of each segment is proportional

to the expected frequency of variation in the text by addition and deletion of characters.

Since the base algorithms and user interface used in this project have been developed

elsewhere, our work focuses on general methods for virus detection, rather than

implementation issues.

Experimental Input

This code was developed on a UNIX machine before porting to the Mac. Input data

at all stages of program development consisted of ASCII and binary data treated as ASCII

text (the smallest alphabetic unit was one byte). The “text” used in mainframe testing

comprises alphanumeric text, compiled binaries (executable and object files), and

ASCII script files. An application was developed for use on the Mac, for which known

viruses and their variants exist.5 All text used in the microcomputer development

stage consisted of resource data because two primary requirements of viral code - the

abilities to replicate and gain control of the operating system - require the execution of

the infectious resources.

During the experimental stage, our pattern and text strings were obtained by

extracting CODE resource from files found on the average Macintosh hard disk; the

segments that were used are listed below. All input data was processed with ASCII

character-handling functions. Simple character arrays were used to store both

strings. Space requirements were relatively small for the string matching algorithms

used. Internal data structures included: a matrix of O(2m) size in the first algorithm -

the array requires O(mn) space, but the dynamic programming method only needs two

columns at a time - and two arrays for internal computation by the Boyer-Moore

algorithm.

The Boyer-Moore Algorithm

The string-matching algorithm developed by Boyer and Moore6 for substring

matching has proven significantly faster, in practice, than both straightforward

scanning and the finite-state automaton technique implemented by Knuth, Morris, and

Pratt. This advantage applies even to binary strings, and becomes increasingly evident

as the size of the alphabet increases. Thus, the number of character comparisons per

text character scanned is even lower for executables than for alphanumeric text. The

Boyer-Moore algorithm employs right-to-left scanning of the pattern string while

attempting to find a match within the text body. The main savings are achieved by

computing two failure functions which store, for each character in the pattern and the

alphabet, respectively, the number of positions to be skipped when a mismatch occurs.

Boyer and Moore suggest that entries from both arrays be compared and the larger skip

selected. The Boyer-Moore string search requires m+n comparisons in the worst case,

and can reliably use n/m steps for large alphabets and short pattern strings.

Our modification of the Boyer-Moore algorithm involved the introduction of a

randomized system of string selection. An integer l was chosen to be sufficiently large

that an accidental match of a substring of length l was extremely unlikely. We

determined this likelihood experimentally (see the discussion below). The pattern

source was an original (unmodified) sample of a known virus. Strings of length l were

chosen randomly by generating an index between 0 and m-l and designating the next l

characters (including the indexed one) of the source string as the pattern P. It was

postulated that this probabilistic factor would establish tolerance for simple changes

made to viral code by a potential author in possession of existing code. These changes

include:

• Disassembling the viral code and changing variable identifiers.

• Padding null characters or sequences to calibrate the virus checksum.

• Removing small, superfluous amounts of code from the original virus.

• Automatic padding within a viral mutant.

• Pasting viral code segments under new labels or merging segments.

• Reversal of code order using logical jumps.

The probability of a match is experimentally shown (through the tests described

below) to be extremely small when a virus is not present - it is clearly possible to

discriminate with high precision between infected and uninfected files. An exact match

is a very difficult event to duplicate coincidentally; the likelihood of such a match

between random strings is infinitesimal even in practice. False positives are

relatively rare, though more common than false negatives. Thus, use of a randomized

algorithm appears to be a feasible approach to generalized (“inter-clone”) viral

detection.

Manual “mutation” of code is already becoming commonplace, as is evidenced by

the multiple clones of the nVIR strain which already exist on the Apple Macintosh.

Simple self-modification has been accomplished in the Core Wars class of programs,

and it is not at all unfeasible for a simple virus to be programmed to pad itself with

null or checksum-neutral character sequences in an effort to evade detection. Such

changes would appear trivial under human inspection. The straightforward searching

techniques used in current commercial products, by contrast, are unable to handle even

trivial changes. Early efforts to deal with the emergence of viral clones involve

omission of parts of the viral signature or selectively summing or enumerating only

specified portions of the suspected code.7 This approach lacks generality, however; it

is not guaranteed to be proof against even a single revision of a known virus, and is

certain to fail against an evolutionary version.8

Important advantages of randomization include the fact that the instructions of

viruses need not be physically oriented in their order of execution, but may instead be

scrambled by jump instructions (see Figure 1). A second consideration is that

preselection of a single segment of code (i.e., the “signature”), as the search pattern,

renders the anti-viral system susceptible to circumvention. Once the identity of the

target code is discovered, the procedure may be fooled in one of several ways: by

specifically changing or deleting the targeted string; by shifting its physical position;

or by disguising it using character padding. Note that none of the above techniques

requires real knowledge of how the virus works! A slightly more sophisticated author

may easily disassemble the executable code and change certain variable identifiers to

thoroughly mask the virus. These variations, in addition to changes made to hide the

virus without any detector in mind, may be virtually bypassed when the search string

is different for each scan run.

Figure 1

A major concern in refining the probabilistic extension of the Boyer-Moore

search is the selection of a string length l. This choice is affected by at least three

factors, the most important of which is the chance of a false positive result. Since a

false alarm is highly improbable when the pattern and text are unrelated (as is

experimentally demonstrated and documented in the tables below), its likelihood is low

because the vast majority of legitimate code lacks viral aspects and is dissimilar to the

virus search pattern.

Moreover, false alarms are easily avoided by making l as high as possible. On

the other hand, l must be made shorter to make the modified Boyer-Moore procedure

less sensitive to padding. Figure 2 illustrates a padded clone below an original virus.

Each padding sequence insures that at most l - 1 out of m - l strings will fail to be

matched, but paddings within l -1 bytes of each other will overlap and “mask” fewer

strings. An important feature of drawing random strings from the original virus is

that the length of padding sequences is irrelevant; only their frequency must be

considered.

Figure 2

The second factor in determining l is the instruction length of the viral host

machine; the unused space in each segment of a binary executable is filled with null

(neutral) characters, and a selection of sampled pattern strings containing a high

proportion of such characters is likely to contain an excessive number of strings which

match with an uninfected text file. One minor weakness will be present regardless of

the string length chosen: the virus author will always be able to defeat the randomized

filter by increasing padding frequency (although this cannot be done indefinitely). This

is one example of the “strength in secrecy” argument in anti-virus programming. On

the other hand, the dynamic programming method is reasonably tolerant to padding.

A Dynamic Programming Approach

While randomized string-search algorithms present a viable next step in

developing countermeasures to computer virus proliferation, they are only a

refinement of the simple straightforward technique. An exact match is required for

each string, regardless of how short it may be or how many others are selected and

compared. Therefore, it is subject to failure under two conditions, the latter of which

results in a false alert.

First, if the length of the randomly selected strings consistently exceeds the

distance between padded or removed characters, the algorithm will fail to achieve any

matches. Second, the program will erroneously report viruses when the text contains

code which is sufficiently similar to the sample virus data to effect more matches than

the allowable limit. A heuristic is needed which will deterministically verify or refute

the presence of the virus and yield consistent results on every run. Since we are

specifically dealing with variants of known viruses, an approximate matching

procedure is required.

Fast string matching has traditionally been applied to many text search

problems. Where a partial match is available, dynamic programming offers an

efficient solution. The algorithm used in our experiments is a straightforward dynamic

implementation which relies on a matrix whose components are computed based on

previous entries. The scan function is designed to return the boolean true upon

encountering any instance of a k-approximate match between pattern P = p1p2pm

and text T = t1t2tn for a positive integer k. Assume that n is large relative to m. The

following rules are used.9

1. Let Dmxn be a matrix of integers for which D[i,j] equals the minimum number of

differences between p1pi and a segment of T ending at tj.

2. A k-approximate match is detected at any j for which D[m,j] ≤ k.

3. The rules for computing D[i,j] consider each of the possible differences that may

occur at pi and tj, and the instance for which the two characters match. D[i,j] is

assigned the minimum of the following three values:

a) If pi = tj then D[i-1,j-1] else D[i-1,j-1]+a.

b) D[i-1,j]+b (the case where pi is missing from T (deletions)).

c) D[i,j-1]+c (the case where tj is missing from P (insertions)).

a, b, and c are the integer values added whenever a mismatch occurs, and are the

central parameters in our modification. Each entry is updated by inspection of the

entries above it, to its left, and to its upper left.

Figure 3

The computation may be done in O(2m) space since only the current and

previous columns need to be stored. The work requires O(mn) complexity in the

straightforward implementation, but can be achieved in O(kn) time using the improved

serial technique by Landau and Vishkin.10 The standard application of string matching

by dynamic programming uses a constant value for a, b, and c (for instance, 1). Our

method boosts tolerance for padded characters by increasing the ratio between the

parameters b and c.

Let r be this ratio, a positive integer; if c is assigned a unit value i, then the

matching function may be made tolerant to cases for which the characters in the text are

missing from the pattern (i.e., the text has been padded) by setting a and b equal to r •c,

making the effective “price” for padded characters lower. A final consideration is the

selection of the “threshold” k. It may be determined based on the expected frequency of

padding, as is the string length in our randomized Boyer-Moore component. Since r has

already been defined relative to i and m, it is a fairly simple task to assign k a value.

Typically, it should be close to r (actually, slightly lower to ensure against false

alerts) and may be computed using the ratio m/n or simply set to a large fraction of r.

Our padding-tolerant implementation uses 1 for i and c, 100 for r, a, and b, and 50 for

k. A false match is possible whenever the ratio r is greater than i. However, this only

holds in the absolutely worst cases in which an extremely small pattern is matched

against a text string of very high length. The probability of this event is equal to that of

the consecutive occurrence of all m characters in P within k+m positions of each other.

Again, this is experimentally shown to be a statistically rare occurrence, which can

reliably be ignored as long as the viral segment length m is not much smaller than r.

Generally, a value for r that is higher than the threshold k can be expected to yield few

false alarms and will rarely miss a variant created by padding. Conversely, tolerance

for missing characters may be effected by increasing the ratio between c and b; in both

cases, a is assigned the larger of the two values.

Future applications (and viral threats)

The code below introduces two effective methods of computer virus detection,

using newly developed modifications of proven algorithmic techniques. Though

previously used in many other applications of computation, these systems are applied

here for the first time to the problem of viral code identification. Despite the

previously established results on the intractability of universal detection put forth by

Cohen, a new class of post-infection scanning methods seems entirely feasible. Further

investigation into the circumvention of virus concealment techniques produced

experimental results which have supported our assumptions concerning the

probabilities of detection and false positives, and support the main premise of both of

the algorithms used: that a standard string matching program may be adapted for

tolerance toward modification of the text to be scanned. Two significant questions

remain concerning general virus detection: First, can clone and mutation detection be

extended under a strictly algorithmic foundation to include a broader range of detectable

code - e specially groups of viruses which have not yet been developed? Second, what

optimizations may be performed on the programs to increase speed without sacrificing

probabilistic safety? One possible solution is offered through the accompanying

program.

Work in string matching, like work in virus detection, is by no means complete.

Modern algorithms make use of parallel hardware and improved data structures, such as

suffix trees (which may be respectively applied to randomized matching and dynamic

programming). Mutating viruses are by no means prevalent yet and have (fortunately)

not appeared in the Macintosh operating system. All recent research in

“compuvirology”, however, suggests that such programs are feasible and may debut

soon - if not on the Mac, then possibly on a larger-scale machine. The viral

“visibility” threshold (i.e., the typical size of a virus compared to the average

executable size and the machine's general capacity) would even be lower. As an

illustration, consider that current viruses approach an order of 10 kilobytes in length

and would be considered gigantic if they appeared on machines of 20 years ago. As

machine size increases, utilities for virus detection may possess the same precision,

but this is not sufficient - they must also match the increasingly sophisticated products

of virus authors. Using advances in fast string matching and parallel computing, the

software industry can stay not one but many steps ahead of viral attackers.

Code resources used in Randomized

Boyer-Moore Experiments

The randomized Boyer-Moore program was tested on many files to illustrate that

on an average Macintosh system, the likelihood of false positives is low. To draw this

conclusion, resources from several common Macintosh programs were extracted and

searched for variants of nVIR.

Below is a listing of the origin of each code segment used for the generation of

Table 2, detailing the size of the file and the application name and resource type

(required for Macintosh operating system classification) from which it was extracted.

Segment Length

Group Number Source File (bytes)

B 1 MS Word, CODE 1 1474

B 2 Disinfectant 2.0, CODE 7 1116

B 3 Red Ryder 9.4, CODE 37 2164

B 4 SuperPaint 2.0, CODE 42 2480

C 1 WordPerfect, CODE 31 5002

C 2 Font/DA Mover, CODE 1 4670

C 3 WordPerfect File, CODE 1 4542

C 4 ZTerm 0.85, CODE 5 4378

D 1 HyperCard, CODE "HyperTools 2" 26078

D 2 Disinfectant 2.0, CODE 5 18720

D 3 THINK C Debugger, CODE 2 21960

D 4 SuperPaint 2.0, CODE 20 19754

References

[Baase 88] Baase, Sara. Computer Algorithms: Introduction to Design and

Analysis. Addison-Wesley, Reading, MA, 1988.

[Boyer 77] Boyer, R.S., Moore, J.S. A Fast String Searching Algorithm. "In

Communications of the ACM", pages 762-772.

October, 1977.

[Cohen 86] Cohen, Fred B. Computer Viruses. Phd thesis, Electrical

Engineering Department, University of Southern California, December, 1986.

[Landau 86] Landau, Gad M., Vishkin, Uzi. Introducing Efficient Parallelism Into

Approximate String Matching and a New Serial Algorithm. "In Proceedings of the 18th

Annual ACM Symposium on Theory of Computing", pages 220-230. 1986.

[McAfee 88] McAfee, John. Implementing Anti-Viral Programs: Special Report

for the Computer Virus Industry Association. Public Information Packet. InterPath

Corporation, Santa Clara, CA, 1988.

[McAfee 89] McAfee, John. Computer Viruses, Worms, Data Diddlers, Killer

Programs, and Other Threats to Your System: What They Are, How They Work, and How

to Defend Your PC, Mac, or Mainframe. St. Martin's Press, New York, 1989.

Listing 1: Dynamic.c

/* Dynamic.c - Functions for dynamic k-approximate virus infection

detection.

Thanks to John Norstad and Ephraim Vishniac for help and comments.

Portions of this code are based on [Morton 90], which appears in the

May 1990 issue of MacTutor. Reused with permission. You may copy,

alter, use, and distribute all code listed here if you leave the file

unchanged up to this line.

Think C version.

Notes:

• A main advantage of this code, as explained in the "Methods and

History" section, is that its effectiveness is not diminished by its

availability. No matter how many potential virus authors read it,

the algorithm will remain equally effective against circumvention.

• To use this code in your programs:

1. It will be necessary to obtain non-functional but significant

(larger than 300 bytes) resource segments from the virus you are

trying to detect.

2. Using a resource editor, insert the viral data under an unused

type, such as 'VDAT', used in the code below -- this will render the

virus code inactive and most likely invisible to conventional (Class

2 and 3) detection programs. As an added security measure, you may

wish to include only code segments above 300 bytes (or a similar

threshold length) to ensure that the virus is crippled.

3. Both the C routines and the Boyer-Moore routines require an

expanded 512K Mac or later (specifically, System 3.2 or later); they

have been fully tested on the SE, II, and IIcx.

4. This function should be run upon first launching your

application, or, if it is an operating system utility, during a

"dormant" or idle period.

5. The code below assumes that the VDAT resource contains all 5

segments of nVIR A; change this accordingly by adding additional

virus types (under a name other than the original infected type) */

#include "dec.h

FILE *my_file;

void dynamic()

{

char m[MAXSIZE];

int pattern_length, index;

MATRIX table;

short resCount;

ToolBoxInit();

CurResFile();

resCount = Count1Resources ('VDAT');

/* how many of this type are there? */

open_file (&my_file, WRITE_MODE);

#ifdef _REPORT /* developer debug flag */

printf("Searching for <>:\n\n");

fprintf(my_file, "Searching for <>:\n\n");

#endif

while (resCount) /* loop down to 1 */

{

if (resCount == 3)

{

printf("\nSearching for <>:\n\n");

fprintf(my_file, "\nSearching for <>:\n\n");

}

rsrc = Get1IndResource ('VDAT', resCount);

/* get the resource's handle, but don't load it */

index = SizeResource (rsrc);

HLock (rsrc);

/* load next virus segment */

pattern_length = copy_array (*rsrc, m, &index);

#ifdef _REPORT

printf("Next virus segment loaded (length %d). Resources left to

scan: %d\n", pattern_length, resCount);

fprintf(my_file, "Next virus segment loaded (length %d). Resources

left to scan: %d\n", pattern_length, resCount);

#endif

HUnlock (rsrc);

initialize(&table, pattern_length+1);

vResCheck('nVRB', m, pattern_length, table, NO_REPORT);

# ifdef _REPORT

printf("\n");

fprintf(my_file, "\n");

# endif

vResCheck('nVRA', m, pattern_length, table, NO_REPORT);

# ifdef _REPORT

printf("\n\n");

fprintf(my_file, "\n\n");

# endif

--resCount;

}

fclose(my_file);

}

void initialize(table, length)

MATRIX *table;

int length;

{

allocate_table(table, length);

clear_table(*table, length);

}

void allocate_table(table, size)

MATRIX *table;

int size;

{

int i;

*table = (MATRIX)calloc(size, (size_t)sizeof(long *));

for (i = 0; i < size; i++)

(*table)[i] = (long *)calloc(2,(size_t)sizeof(long));

}

void clear_table(table, length)

MATRIX table;

int length;

{

int i;

for (i = 0; i <= length; i++)

table[i][0] = (long)UNIT*i;

}

/* vResCheck - Perform dynamic string search on all resources of a

specified type in the current application. */

void vResCheck (type, m, pattern_length, table, report)

char m[MAXSIZE];

int pattern_length;

MATRIX table;

/* INPUT: >0 => report errors with debugger */

{

/* number of resources of this type */

/* for preserving current resource file */

/* for preserving "ResLoad" flag */

Boolean found;

char n[MAXSIZE];

int text_length, local_count = 1;

int index;

/* Switch to the application's resource file. Note that all

resource calls from here on are the "one deep" calls from Inside Mac,

Volume

IV. */

oldResFile = CurResFile();

/* remember initial resource file */

oldResLoad = ResLoad; /* remember "ResLoad" state */

resCount = Count1Resources (type);

/* how many of this type are there? */

if (report)

{

fprintf(my_file, "Text string ");

printf("Text string ");

}

while (resCount) /* loop down to 1 */

{ /* get the resource's handle, but don't load it */

rsrc = Get1IndResource (type, resCount);

/* see if it's already in memory */

if (!rsrc) /* not available? */

{

if (report > 0) /* debugging flag */

DebugStr ("\pResource not available!");

goto EXIT;

}

else

{

index = SizeResource (rsrc);

HLock (rsrc);

found = FALSE;

while ((text_length = copy_array(*rsrc, n, &index)) &&

(!found))

{

fprintf (my_file, "%d ", local_count);

printf ("%d ", local_count);

local_count++;

clear_table (table, pattern_length);

if (pattern_length <= text_length)

{

if (compare (m, n, pattern_length,

text_length, table))

found = TRUE;

}

HUnlock (rsrc);

}

--resCount; /* get next index number */

} /* end of loop through resources */

EXIT: /* goto here on tampering or error */

UseResFile (oldResFile); /* restore original resource file */

SetResLoad (oldResLoad); /* restore original loading state */

} /* end of vResCheck() */

/* compare: the actual dynamic programming algorithm, modified to a

level of padding tolerance defined by THRESHOLD */

char compare(p, t, pattern_length, text_length, table)

char p[], t[];

int pattern_length, text_length;

MATRIX table;

{

long value1, value2, value3;

int i, j, flip, beep;

flip = TRUE;

for (j = 1; j <= text_length; j++)

{

table[0][flip] = 0;

for (i = 1; i <= pattern_length; i++)

{

if (p[i-1] == t[j-1]) /* initialize */

value1 = table[i-1][!flip];

else

value1 = (table[i-1][!flip])+UNIT;

value2 = (table[i-1][flip])+UNIT;

/* UNIT: the orginal algorithm uses this

weight for all variations in the text */

value3 = (table[i][!flip])+EPSILON;

/* EPSILON: small weighted "distance" --

as opposed to the single unit */

table[i][flip] = MIN3(value1, value2, value3);

/* see discussion of dynamic */

}

if (table[pattern_length][flip] <=

THRESHOLD)

{

if (report)

{

printf("%ld-approximate match found.\n", t

table[pattern_length][flip]);

fprintf(my_file, "%ld-approximate match found.\n",

table[pattern_length][flip]);

}

return (TRUE);

}

flip = !(flip); /* only an O(2m)-sized array is needed to

simulate a "matrix", because only 2 columns are used */

}

return (FALSE);

}

/* data.c: data structure operations (static allocation) for dynamic

AND Boyer-Moore algs */

int read_array(fp, array)

FILE *fp;

char array[];

{

char c;

int n;

n = 0;

while(((c = fgetc(fp)) != EOF) && (n < MAXSIZE)) /* Read one

element */

{

array[n] = c;

n++;

}

if (c != EOF)

ungetc(c, fp);

return(n);

}

/* fileops.c : file operations for dynamic */

#include "dec.h

#include "errors.h

char open_file(fp, operation)

FILE **fp;

char *operation;

{

SFReply reply;

char filename[BUFSIZ];

GetfileName(&reply);

strcpy(filename, (char *)reply.fName);

*fp = fopen(filename, operation);

if (!(reply.good)) {

fprintf(stderr, CANNOT_OPEN_FILE, filename);

exit_cleanly(NO_ERROR, EXIT_FAILURE);

}

else return(TRUE);

}

/* The following routines deal with the filea. This is all using the

Macintosh HFS. */

/* GetfileName: read a file name usign the HFS */

GetfileName(reply)

SFReply *reply;

{

Point dlgPoint;

Str255 defName = "\pDynamic Output";

int numTypes = 1;

dlgPoint.h = 100; /* position of the 'open' dialog box */

dlgPoint.v = 100;

SFPutFile (dlgPoint, "\pSave output file as", defName,

NIL_POINTER,

reply);

PtoCstr ((char *) (*reply).fName);

/* convert from PASCAL to 'C' string */

} /* GetfileName */

/* dec.h - dynamic definitions and declarations */

#include

#define FALSE 0

#define TRUE 1

#define NO_REPORT -1

#define K 1024

#define MAXSIZE 8*K

/* All of the tweaking is done here */

#define UNIT 1

#define EPSILON 1

#define THRESHOLD UNIT*50

#define READ_MODE "r

#define WRITE_MODE "w

#define APPEND_MODE "a

#define MIN2(a, b) (((a) < (b)) ? (a) : (b))

#define MIN3(a, b, c) ((MIN2((a), (b)) < (c)) ? MIN2((a), (b)) :

(c))

#define MAX2(a, b) (((a) > (b)) ? (a) : (b))

#define NIL_POINTER 0L

#define NIL_STRING "\p

#define IGNORED_STRING NIL_STRING

#define NIL_FILE_FILTER NIL_POINTER

#define NIL_DIALOG_HOOK NIL_POINTER

#define VDAT_RES_ID 0

typedef long **MATRIX;

void initialize(), clear_table(), vResCheck(), allocate_table(),

error_message(), exit_cleanly(), main();

char open_file(), compare();

int read_array();

int copy_array(array1, array2, bytes_left)

char array1[], array2[];

int *bytes_left;

{

int bytes_gotten = 0;

if (!(*bytes_left))

return (FALSE);

if (*bytes_left < MAXSIZE)

{

memmove ((void *)array2, (void *)array1, (

(size_t)(*bytes_left));

bytes_gotten = *bytes_left;

*bytes_left = 0;

}

else

{

memmove ((void *)array2, (void *)array1, (size_t)MAXSIZE);

bytes_gotten = MAXSIZE;

*bytes_left -= MAXSIZE;

}

return (bytes_gotten);

}

Listing 2: BoyerMoore.c
/* BoyerMoore.c - Functions for fast, variable- randomized virus
infection detection.
 Copyright © 1992 by William H. Hsu.
 Think C version.
 Notes:
 • As explained in the dynamic algorithm code, these routines are
tolerant toward a wide variety of variations, including padded and
mutating viral code
/* byrmoore.c: main  searching file */
#include "dec.h
void boyer_moore()
{
  FILE *my_file;
  char n[MAXSIZE], *sub_string, **pattern_array;
  int text_length, i, j, sum, match_count, size, divisor,
total_match, index, index2, vdat_count,  refNum,
 files_to_scan = 5, total_virus_length;
  int *pattern_length_array, *pattern_index_array; /* virus segment
lengths and delimiters */
  long file_size, total_file_size;
  register Handle rsrc, rsrc2;
/* Note: the code which scans files in the same way that
Disinfectant does is far too long to include in this article.  The
array used below is for the purpose of example only.  John Norstad
has made the enumeration part of his program publicly available (by
FTP at acns.nwu.edu) */
  Str255 ResFileArray[5] = {"\pOne*", "\pTwo*",  "\pThree",
"\pFour", "\pFive"};
  Str255 DescriptionArray[5] = {"File 1\t", "File 2\t",  "File
3\t", "File 4\t", "File 5\t"};
  ResType typeName;
  short resCount, typeCount, resCount2;
  srand((unsigned)time(NULL));
  ToolBoxInit();
  open_file (&my_file, WRITE_MODE);
  csettabs (TABS, stdout);
#ifdef _REPORT
  printf("File description\t\t\tScore\tFile size\tAlgorithm's
Decision\n");
 p
printf(
"================\t\t\t=====\t=========\t====================\n\n");
  fprintf(my_file, "File description\t\t\tScore\tFile
size\tAlgorithm's Decision\n");
  fprintf(my_file,
"================\t\t\t=====\t=========\t====================\n\n");
#endif
  sub_string = (char *)calloc(SIZE, sizeof(char));
  CurResFile();
  resCount = Count1Resources ('VDAT');
  vdat_count = resCount;
  pattern_length_array = (int *)calloc(resCount, sizeof(int));
  pattern_index_array = (int *)calloc(resCount, sizeof(int));
  pattern_array = (char **)calloc(resCount,
 sizeof(char *));
  while (resCount) /* loop resCount down to 1 */
  { /* get  handle, but don't load it */
    rsrc = Get1IndResource ('VDAT', resCount);
    index = SizeResource (rsrc);
    HLock (rsrc);
    pattern_array[resCount-1] = (char *)  calloc(index,
sizeof(char));
    pattern_length_array[resCount-1] = copy_array (*rsrc,
pattern_array[resCount-1], &index);
    pattern_index_array[resCount-1] = ((resCount < vdat_count) ?
(pattern_length_array[resCount- 1] +
pattern_index_array[resCount]) : pattern_length_array[resCount-1]);
    HUnlock (rsrc);
    --resCount;
  }
  total_virus_length = pattern_index_array[0];
  SetResLoad (true);
  for (i = 0; i <  files_to_scan; i++)
  {
    refNum = OpenResFile (ResFileArray[i]);
    match_count = 0;
    divisor = 0;
    for (j = 0; j < ITERATIONS; j++)
    {
      total_file_size = 0;
      sum = 0;
      while (!random_string(pattern_array, sub_string,
pattern_index_array,  total_virus_length, SIZE,
vdat_count));
      typeCount = Count1Types ();
      while (typeCount)
      {
        Get1IndType (&typeName, typeCount);
        resCount2 = Count1Resources (typeName);
        while (resCount2)
        {
          rsrc2 = Get1IndResource (typeName, resCount2);
          index2 = SizeResource (rsrc2);
          file_size = 0;
          HLock (rsrc2);
 while (text_length = copy_array (*rsrc2, &index2))
          {
            compare(sub_string, n, SIZE,text_length, &sum);
            file_size += text_length;
          }
          HUnlock (rsrc2);
          total_file_size += file_size;
          --resCount2;
        }
        --typeCount;
      }
      divisor++;
      if (sum)
        match_count++;
    }
#  ifdef _REPORT
    printf("%s\t\t", DescriptionArray[i]);
    fprintf(my_file, "%s\t\t", DescriptionArray[i]);
    printf("%d\t\t%ld\t", match_count, total_file_size);
    fprintf(my_file, "%d\t\t%ld\t", match_count, total_file_size);
# endif
    if (match_count >= (divisor/LIMIT))
    {
# ifdef _REPORT
      printf("\t%s\n", INFECTED_STRING);
      fprintf(my_file, "\t%s\n", INFECTED_STRING);
# endif
    }
    else
    {
# ifdef _REPORT
      printf("\t%s\n", CLEAN_STRING);
      fprintf(my_file, "\t%s\n", CLEAN_STRING);
# endif
    }
    CloseResFile(refNum);
  }
  printf("\nScore represents matches out of %d, with %d needed to
diagnose infection.\n",  divisor, divisor/LIMIT);
  fprintf(my_file, "\nScore represents matches out of %d, with %d
needed to diagnose  infection.\n", divisor, divisor/LIMIT);
  free(sub_string);
  fclose(my_file);
}
char random_string(string_array, sub_string, index_array, length,
substring_length, vdat_count)
char **string_array, sub_string[];
int index_array[], length, substring_length, vdat_count;
{
  int location, segment, i, zero_count = 0;
  Boolean legal = false, In_The_Right_Segment = false;   /* length
and segments okay? */
  segment = vdat_count-1;
  while (!legal)
  {
    location = (int)((rand()/(double)MAXINT)* (length -
substring_length));
    In_The_Right_Segment = false;
    while (!In_The_Right_Segment)
    {
      if (location <= index_array[segment])
      {
        In_The_Right_Segment = true;
        if (location <= (index_array[segment] - substring_length +
1))
          legal = true;
        else
          legal = false;
      }
      else
        segment--;
    }
  }
  if (segment < vdat_count-1)
    location -= index_array[segment+1];
  for (i = location; i < location + substring_length; i++)
  {
    sub_string[i-location] =  (string_array[segment])[i];
    if (!string_array[segment][i])
      zero_count++;
  }
  if (zero_count < substring_length/2)
    return(TRUE);
  else
    return(FALSE);
}
/* compare: the heart of the Boyer-Moore heurstic, similar to
Knuth-Morris-Pratt's matching engine */
void compare(p, t, pattern_length, text_length, sum)
char *p, *t;
int pattern_length, text_length, *sum;
{
  ALPHABET_ARRAY char_jump;
  int *match_jumps, print;
  allocate_array(&match_jumps, pattern_length);
  compute_jumps(p, char_jump, pattern_length-1);
  compute_match_jumps(p, &match_jumps, pattern_length);
  if (bm_match(p, t, char_jump, match_jumps, pattern_length,
text_length))
 (*sum)++;
  free(match_jumps);
}
void allocate_array(array, size)
INDEX_ARRAY array;
int size;
{
  *array = (int *)calloc(size, sizeof(int));
}
/* the bad-character failure function
NOTE: if the ASCII alphabet, which has size 256, is
used, this function is not worth computing for resource text strings
of length ≤ 256 */
void compute_jumps(p, char_jump, length)
char *p;
ALPHABET_ARRAY char_jump;
int length;
{
  int c, k;
  for (c = 0; c < CHARS; c++)
    char_jump[c] = length;
  for (k = 0; k < length; k++)
    char_jump[POSITIVE(p[k])] = length-k-1;
}
/* implementation of pseudocode from [Baase 88]
   - uses the good-suffix failure function */
void compute_match_jumps(p, match_jump, pattern_length)
char *p;
INDEX_ARRAY match_jump;
int pattern_length;
{
  int m, k, q, qq;
  int *back;
  allocate_array(&back, pattern_length+1);
  m = pattern_length;
  for (k = 0; k < m; k++)
    (*match_jump)[k] = 2*m-k-1;
  q = m;
  for (k = m-1; k >= 0; k--)
  {
    back[k] = q;
    while ((q < m) && (p[k] != p[q]))
    {
      (*match_jump)[q] = MIN2((*match_jump)[q], m-k-1);
      q = back[q];
    }
    q--;
  }
  for (k = 0; k < q; k++)
    (*match_jump)[k] = MIN2((*match_jump)[k], m+q-k-1);
  qq = back[q];
  while (q < m)
  {
    while (q < qq)
    {
      (*match_jump)[q] = MIN2((*match_jump)[q], qq-q+m-1);
      q++;
    }
    qq = back[qq];
  }
  free(back);
}
int bm_match(p, t, char_jump, match_jump, pattern_length,
text_length)
char *p, *t;
ALPHABET_ARRAY char_jump;
int *match_jump, pattern_length, text_length;
{
  int j, k; /* j indexes text characters; k indexes
 the pattern */
  j = pattern_length - 1;
  k = j;
  while (j < text_length)
  {
    if (k == -1)
      return(TRUE);
    if (t[j] == p[k])
    {
      j--;
      k--;
    }
    else
    {
      j += MAX2(char_jump[POSITIVE(t[j])], match_jump[k]);
      k = pattern_length - 1;
    }
  }
  return(FALSE);
}
/* dec.h - definitions and declarations for bm */
#include 
#include 
#include 
#include 
#define TABS 4
#define K 1024
#define MAXSIZE 8*K
#define MAXINT 32767
#define MINSUB 8
#define MAXSUB 12
#define STEP 4
#define ITERATIONS 1000
#define FALSE 0
#define TRUE 1
#define READ_MODE "r
#define WRITE_MODE "w
#define APPEND_MODE "a
#define MIN2(a, b) (((a) < (b)) ? (a) : (b))
#define MIN3(a, b, c) ((MIN2((a), (b)) < (c)) ? MIN2((a), (b)) :
(c))
#define MAX2(a, b) (((a) > (b)) ? (a) : (b))
#define POSITIVE(a) ((abs(a) == (a)) ? (a) : abs(a)+127)
#define CHARS 256
#define NIL_POINTER 0L
#define NIL_STRING "\p
#define IGNORED_STRING NIL_STRING
#define NIL_FILE_FILTER  NIL_POINTER
#define NIL_DIALOG_HOOK  NIL_POINTER
#define VDAT_RES_ID 0
typedef int ALPHABET_ARRAY[CHARS];
typedef int **INDEX_ARRAY;
typedef ResType **ResTypeHandle;
void compare(), allocate_array(), compute_jumps(),
compute_match_jumps(), error_message(), exit_cleanly(), main();
char open_file(), random_string();
int read_array();

Footnotes

1. [Cohen 86] is the most complete and formal of these publications. He gives a full

definition of the term virus and technical discussion of worm propagation and viral

spread.

2. An inter-corporation group comprised of personal computer industry

professionals (generally hardware and software developers) which is devoted to the

distribution of anti-viral information (e.g., training seminars and publications) and

tracking of new viruses. It was founded and is coordinated by John McAfee, the

president of InterPath Corporation in Santa Clara, CA. The full text of his classification

schema may be found in [McAfee 88].

3. This proof is available in its original form in [Cohen 86]; the doctoral thesis is

exclusively published by the micrographics department of the University of Southern

California. [Burger 88], [van Winkel 88], and many other works contain versions of

this reduction of new virus detection to the halting problem [Turing 36].

4. A brief definition of O-notation, from [Baase 88]:

f(n) = O(g(n)) (f is “order of” g) if and only if there exist c > 0, N > 0, such that

f(n) ≤ cg(n) for every n ≥ N.

Thus an O(mn)-time implementation requires time proportional to the product of

the lengths of the pattern and text strings, in the long run. An O(kn) version requires

time proportional to the product of the maximum acceptable number of differences and

the length of the pattern.

Our implementation of the dynamic programming algorithm was coded in C, using

Pascal-type pseudocode from [Baase 88] (Chapter 6) as a guide. The O(kn) version can

be found in [Landau 86], in the 18th annual ACM STOC Proceedings, with more general

pseudocode.

5. Among the Macintosh viruses with known variants (both strains and clones) are

the following: WDEF, with strains A and B, and nVIR, with very prolific strains A and

B, each with multiple clones found under Hpat, MEV#, AIDS, and other resource titles.

An explicit definition of the terms “strain”, “clone”, and “viral mutant” as they are

used in this article is given in the introduction.

6. The original presentation of the algorithm is given in [Boyer 77], a paper in the

October 1977 CACM; again, pseudocode from [Baase 88] (Chapter 5) was used as a

guide in our implementation.

7. This is the pivotal concept in [Morton 90], a recent article in MacTutor. The

evident weaknesses in this technique are stressed by the author, who recommends user

modification of the anti-viral source code as a means of circumventing viral tampering.

This comment forebodes the use of expert systems techniques in viral code design; the

use of artificial intelligence intermeshed with viral programs has been predicted in

[Cohen 86], and is expected to appear as the availability of compiler tools increases and

the viral visibility threshold decreases.

8. The evolutionary virus is a largely theoretical program, first proposed in

[Cohen 86]; however, mildly evolutionary code (viral and otherwise) already exists in

abundance. User modification of an antivirus is nearly certain to leave it “blind” to

successive generations of an automatically self-modifying virus.

9. The table computation rules (with the exception of the distance metric

modification - a, b, and c replace 1 in each rule) are quoted verbatim from [Baase

88], Section 6.3.

10.The article is [Landau 86], in the Proceedings of the 18th Annual ACM STOC.

Referenced by (4):