All Databases MacTech Vol 06-1990

Fine Tune MPW

Volume Number: 6

Issue Number: 7

Column Tag: Mac Workshop

Fine Tuning Code With MPW

By Allen Stenger, Gardena, CA

Tuning with MPW

[Allen Stenger works on the F-15 radar software for Hughes Aircraft Co. His

technical interests are software reliability, computer architecture, and computer

languages. Programming the Macintosh has been his hobby for the past five years.]

The Macintosh Programmer’s Workshop documentation does a good job of

describing how to use the performance analysis tools, but it does not explain what to do

with the results. This article is a tutorial on how to tune a program with MPW, using

the performance reports as a guide to which areas of the program are most in need of

improvement.

For the example program we will use a dragon-curve drawer. This is an example

in the Smalltalk books, and Listing 1 is a straightforward translation of the

Smalltalk-80 program into SemperSoft Modula-2. This straightforward program

takes about 8 seconds (477 ticks) to draw the dragon. This is not extremely slow, but

one imagines that it could be done faster. In fact, we will be able to get a 16-times

speedup with some simple transformations, and by being a little trickier we will get a

22-times speedup.

This is an artificial example, for a couple of reasons. First, for any imaginable

use to which DrawDragon might be put, its present performance is adequate--there is

no practical value to speeding it up. Second, the MPW tools would normally be applied

to much larger programs, and are in some ways an overkill for this simple example.

Speculations

We will start out with some speculations on the causes of the slowness of the

straightforward dragon. After doing the tuning we will revisit these and see how good

our intuition was.

From inspecting the code, one might generate a list of possible “problem areas”

such as the following:

Use of recursion. According to the folklore of programming, recursive programs

are slow, and certainly the Dragon procedure seems to spend most of its time

calling itself rather than doing any useful work.

Use of high-order language. Traditionally, in order to get the highest

performance, one must write in assembly language. The SemperSoft compiler is

a one-pass compiler, and presumably does little optimization.

Use of range-checking. This example is compiled with checks on, which in some

programs can cause a significant slowdown.

Use of floating-point. This example was run on a Macintosh Plus, which has no

floating-point hardware and does all floating point in software.

Graphics operations. These typically take a lot of CPU time. (There might not be

much we can do about this one, since the program’s whole purpose is to draw on

the screen.)

Tuning -- Preparation

We can instrument the dragon program following the recipes in the MPW

manual, Chapter 14 (the method is the same regardless of language, although the

interface details vary slightly). Here we will only change the calling program,

DrawDragon, and leave the other modules undisturbed. Since the performance

measurements are done by random sampling of the program counter, the whole

program will be instrumented merely by starting the measurement at the beginning of

the run and ending it at the end - it is not necessary to modify each routine. The

instrumented DrawDragon module is shown in Listing 2.

The performance analyzer yields an execution “profile” of the program,

characterizing it by where it spends its time. It is usually only helpful to look at the

top five routines - everything below that takes such a small amount of time that, even

if we could eliminate a routine totally, it would have a negligible effect on the timing.

Here, on our first run, we need only look at the #1 routine, since it takes 72.6% of

the time (total time for the program is 493 ticks). This is the Pack4 routine, which

is not in the Dragon program -- it is the SANE floating-point routines, as may be

looked up in Inside Macintosh. So our first tuning step is to reduce the amount of time

spent doing floating-point work.

Tuning -- First Pass

Some people (particularly FORTH fans) advocate doing no floating point at all,

but using instead scaled integer arithmetic. For most programs, eliminating floating

point altogether would be a drastic operation. For our little DrawDragon program it is

not so bad, but let’s see what else we could do first. Unfortunately the report does not

break down the floating-point operations used, but by examining the code we see that

each line drawn requires 2 integer-to-float conversions, 2 float-to-integer

conversions, 3 floating-point multiplies, and 1 each sine and cosine. It seems likely

that most of the floating-point time is spent on the sine and cosine, with the multiplies

coming in a distant second. Further examination of the program reveals that sine and

cosine are called with only 4 possible arguments (0, 90, 180, and 270 degrees), so

this suggests the use of a look-up table storing the values of sine and cosine for these

arguments.

Making this change and re-running the program and reports shows that this does

cause a large speedup - the timing is now 86 ticks, or 5.7 times faster. The top item

on the performance report is still Pack4 with 41.6%, so our next step is to try to

eliminate still more floating-point operations.

Tuning -- Second Pass

After scrutinizing the improved program some more, we may remember that the

reason we put in the sines and cosines in the first place was to handle drawing of lines

at arbitrary angles. For the 4 angles that we actually use, the sines and cosines have

integer values, so (for our particular case) there is no need for floating point anyway!

We therefore make a simple re-coding of the Go procedure to do all calculations as

integers. (Since the routine is now getting faster, there are not enough

program-counter samples to get a good breakdown of where the time is spent, so we

will also draw the dragon 10 times in a loop and divide by 10 to get the time per

dragon.) Upon re-running this, we find that the time has decreased to 30 ticks (2.9

times faster), and the top 5 routines in the performance report are all QuickDraw

routines and comprise 73.2% of the time. Further improvement seems to depend on

inventing some way of drawing lines faster than QuickDraw can, which seems a dim

hope.

Tuning -- Third Pass

Surprisingly, this is possible, in the special case we are dealing with (which has

only horizontal and vertical lines). By trying some separate timing tests, it appears

that our lines can be drawn in 1/3 less time by using FillRect rather than Line.

Implementing this leads to a program which runs in 22 ticks (1.4 times faster), for

an overall improvement over the straightforward Dragon of 22:1. This speedup

requires changes to only one routine, Go. The revised Pen module is shown in Listing

Speculations Inspected

The big winners among our speculations were floating point and graphics

operations. Floating point took up most of the time in the original program, and by

eliminating it we were able to get most of the speedup. Graphics operations take up

most of the remaining time, and although we made some reduction in this, it seems

unlikely that there is much more we can do. Possibly we could do something sneaky

such as building the whole picture in memory without using graphics operations, then

copying it to the screen with CopyBits; but this would be very complicated and perhaps

not much of an improvement. (One of the hardest parts of tuning is knowing when to

stop. Most programs can be improved indefinitely, but they gradually get more

complicated and error-prone.)

The issue of recursion turned out to be a red herring. The total time spent in the

Dragon procedure is only 3.9% of the total, and this includes the recursive calls.

Recursion has a bad reputation, for two reasons. First is the bad recursion examples

often given in textbooks (usually factorials or Fibonacci numbers), which can be done

more simply and faster by iteration. Second, in some older languages and

implementations, either there is no recursion in the language, so it must be simulated

by the program (FORTRAN), or it is treated as a special case (JOVIAL, PL/I) and

really is much slower because it requires a special dynamic allocation of variables

rather than the normal static allocation. In more modern languages such as Pascal or

Modula-2, all variables are dynamically allocated and recursive calls are no different

(and therefore no slower) than other calls.

The impact of using a high-order language or of range checking varies a great

deal depending on the program and on the language implementation. In this example

most of the time was taken by things outside the generated code, which seems to be

inherent in the particular problem being solved and not likely to be affected much by

the implementation language. In the final version of the program, the Modula-2

routines take a total of only 13.3%, so re-writing in assembly or taking out the range

checks could not possibly save more than this. The typical Macintosh application tends

to consist mainly of calls to the Toolbox, interspersed with occasional calculations, so

the timing of a tuned program tends to be dominated by the Toolbox time and does not

depend greatly on the implementation language. For programs such as spreadsheets and

compilers this is not true, since they really do spend a great deal of time on internal

calculations, but it is true of most applications.

Conclusion

By applying the MPW performance analysis tools and making the improvements

suggested by the program profile, we were able to speed up the dragon-drawer 22

times. The value of the execution profile is in telling us what not to look at. Most of a

program’s execution time is spent in a few very places, and we should concentrate our

efforts on those few places. (In Quality Control this is called the principle of the

“vital few and the trivial many.”) Profiles are more valuable for large programs,

since there is more to ignore there. In any case, success in tuning depends on

measurement.

For Further Reading

Jon Louis Bentley, Writing Efficient Programs. Prentice-Hall, 1982. An

excellent book on tuning; of value both to the professional programmer and the

hobbyist. Our Dragon example illustrates his principles Space-For-Time Rule 2 --

Store Precomputed Results (sine and cosine tables), and Procedure Rule 2 -- Exploit

Common Cases (eliminate floating-point since not needed in our case, and use FillRect

for faster horizontal and vertical line drawing). He gives many other principles

which we did not have a chance to use. In the beginning of the book he goes through 10

steps of optimizing an Approximate Travelling-Salesman Tour to get an overall

improvement of 17:1.

Adele Goldberg and David Robson, Smalltalk-80: The Language and Its

Implementation. Addison-Wesley, 1983. The source of the Dragon program (on pp.

372-3).

Donald E. Knuth, “An Empirical Study of FORTRAN Programs”,

Software--Practice and Experience, v. 1 (1971), pp. 105-133. Source of the term

“profile.” This article is written from the viewpoint of a compiler developer, and

considers several levels of optimization which can be applied to programs. Many

examples of optimization.

Mike Morton, “Faster Bitmap Rotation”. MacTutor, v. 4 (1988), no. 11, pp.

86-90 (reprinted in The Definitive MacTutor, pp. 56-60). Another example of

tuning, using only the total time of the routine as a measure and yielding a speedup of

3:1. Morton was careful to measure each attempted improvement to the routine; the

exact method of measurement is not important, but you must measure.

Stephen Dubin, Thomas W. Moore, and Sheel Kishore, “Using Regions in Medicine

with C”. MacTutor, v. 3 (1987), no. 10, pp. 27-31 (reprinted in The Essential

MacTutor, pp. 146-150). Description of re-coding a routine to calculate the area of

a region from Pascal or C to assembler, yielding a 1000:1 speedup.

James Plamondon, “Finding The Area Of A Region in C” (letter). MacTutor, v. 4

(1988), no. 4 (reprinted in The Definitive MacTutor, p. 699). Criticizes the

previous reference for solving the wrong problem (i.e., working very hard to get a

very good estimate of the area of a region drawn freehand). Gives a faster C routine for

doing a less-accurate but adequate estimate. In most applications you have some

leeway in the problem you solve, and you can use this to make the solution easier and

faster. We did not take advantage of this in the DrawDragon program; one speedup we

might have applied is to draw the line only one pixel wide (which is about twice as fast

as drawing it 4 pixels wide). Bentley (reference above) also considers this in his

Approximate Travelling-Salesman Tour -- since it is only an approximate solution

anyway, there is little harm in going from floating-point to slightly less accurate

scaled fixed point numbers.

Source Listings

Listing 1 - Straightforward Dragon
(******************************************************)
(*   *)
(* file:  DrawDragon.m   *)
(*   *)
(* Main program to test Dragon method.   *)
(*   *)
(* Written in SemperSoft Modula-2 v.1.1.2   *)
(*   *)
(* Allen Stenger August 1989     *)
(*   *)
(******************************************************)
MODULE DrawDragon;
FROM InOut IMPORT WriteLong, WriteString, Read;
FROM InsideMac IMPORT TickCount;
FROM DragonModule IMPORT Dragon;
FROM Pen IMPORT Home;
VAR
oldTime,
newTime : LONGINT;
ch  :  CHAR;
BEGIN
oldTime := TickCount();
Home;
Dragon( 8 );
newTime := TickCount();
WriteString( “Run time is “ );
WriteLong( newTime - oldTime, 6 );
WriteString( “ -- press space to exit “ );
Read( ch );
END DrawDragon.
(******************************************************)
(*   *)
(* file:  DragonModule.d   *)
(*   *)
(* Implementation of Dragon method from    *)
(* Goldberg and Robson,      *)

Referenced by (1):

Vol 6 Issues (MacTech Index)