Fine Tune MPW
Volume Number: 6
Issue Number: 7
Column Tag: Mac Workshop
Fine Tuning Code With MPW 
By Allen Stenger, Gardena, CA
Tuning with MPW
[Allen Stenger works on the F-15 radar software for Hughes Aircraft Co. His
technical interests are software reliability, computer architecture, and computer
languages. Programming the Macintosh has been his hobby for the past five years.]
The Macintosh Programmer’s Workshop documentation does a good job of
describing how to use the performance analysis tools, but it does not explain what to do
with the results. This article is a tutorial on how to tune a program with MPW, using
the performance reports as a guide to which areas of the program are most in need of
improvement.
For the example program we will use a dragon-curve drawer. This is an example
in the Smalltalk books, and Listing 1 is a straightforward translation of the
Smalltalk-80 program into SemperSoft Modula-2. This straightforward program
takes about 8 seconds (477 ticks) to draw the dragon. This is not extremely slow, but
one imagines that it could be done faster. In fact, we will be able to get a 16-times
speedup with some simple transformations, and by being a little trickier we will get a
22-times speedup.
This is an artificial example, for a couple of reasons. First, for any imaginable
use to which DrawDragon might be put, its present performance is adequate--there is
no practical value to speeding it up. Second, the MPW tools would normally be applied
to much larger programs, and are in some ways an overkill for this simple example.
Speculations
We will start out with some speculations on the causes of the slowness of the
straightforward dragon. After doing the tuning we will revisit these and see how good
our intuition was.
From inspecting the code, one might generate a list of possible “problem areas”
such as the following:
Use of recursion. According to the folklore of programming, recursive programs
are slow, and certainly the Dragon procedure seems to spend most of its time
calling itself rather than doing any useful work.
Use of high-order language. Traditionally, in order to get the highest
performance, one must write in assembly language. The SemperSoft compiler is
a one-pass compiler, and presumably does little optimization.
Use of range-checking. This example is compiled with checks on, which in some
programs can cause a significant slowdown.
Use of floating-point. This example was run on a Macintosh Plus, which has no
floating-point hardware and does all floating point in software.
Graphics operations. These typically take a lot of CPU time. (There might not be
much we can do about this one, since the program’s whole purpose is to draw on
the screen.)
Tuning -- Preparation
We can instrument the dragon program following the recipes in the MPW
manual, Chapter 14 (the method is the same regardless of language, although the
interface details vary slightly). Here we will only change the calling program,
DrawDragon, and leave the other modules undisturbed. Since the performance
measurements are done by random sampling of the program counter, the whole
program will be instrumented merely by starting the measurement at the beginning of
the run and ending it at the end - it is not necessary to modify each routine. The
instrumented DrawDragon module is shown in Listing 2.
The performance analyzer yields an execution “profile” of the program,
characterizing it by where it spends its time. It is usually only helpful to look at the
top five routines - everything below that takes such a small amount of time that, even
if we could eliminate a routine totally, it would have a negligible effect on the timing.
Here, on our first run, we need only look at the #1 routine, since it takes 72.6% of
the time (total time for the program is 493 ticks). This is the Pack4 routine, which
is not in the Dragon program -- it is the SANE floating-point routines, as may be
looked up in Inside Macintosh. So our first tuning step is to reduce the amount of time
spent doing floating-point work.
Tuning -- First Pass
Some people (particularly FORTH fans) advocate doing no floating point at all,
but using instead scaled integer arithmetic. For most programs, eliminating floating
point altogether would be a drastic operation. For our little DrawDragon program it is
not so bad, but let’s see what else we could do first. Unfortunately the report does not
break down the floating-point operations used, but by examining the code we see that
each line drawn requires 2 integer-to-float conversions, 2 float-to-integer
conversions, 3 floating-point multiplies, and 1 each sine and cosine. It seems likely
that most of the floating-point time is spent on the sine and cosine, with the multiplies
coming in a distant second. Further examination of the program reveals that sine and
cosine are called with only 4 possible arguments (0, 90, 180, and 270 degrees), so
this suggests the use of a look-up table storing the values of sine and cosine for these
arguments.
Making this change and re-running the program and reports shows that this does
cause a large speedup - the timing is now 86 ticks, or 5.7 times faster. The top item
on the performance report is still Pack4 with 41.6%, so our next step is to try to
eliminate still more floating-point operations.
Tuning -- Second Pass
After scrutinizing the improved program some more, we may remember that the
reason we put in the sines and cosines in the first place was to handle drawing of lines
at arbitrary angles. For the 4 angles that we actually use, the sines and cosines have
integer values, so (for our particular case) there is no need for floating point anyway!
We therefore make a simple re-coding of the Go procedure to do all calculations as
integers. (Since the routine is now getting faster, there are not enough
program-counter samples to get a good breakdown of where the time is spent, so we
will also draw the dragon 10 times in a loop and divide by 10 to get the time per
dragon.) Upon re-running this, we find that the time has decreased to 30 ticks (2.9
times faster), and the top 5 routines in the performance report are all QuickDraw
routines and comprise 73.2% of the time. Further improvement seems to depend on
inventing some way of drawing lines faster than QuickDraw can, which seems a dim
hope.
Tuning -- Third Pass
Surprisingly, this is possible, in the special case we are dealing with (which has
only horizontal and vertical lines). By trying some separate timing tests, it appears
that our lines can be drawn in 1/3 less time by using FillRect rather than Line.
Implementing this leads to a program which runs in 22 ticks (1.4 times faster), for
an overall improvement over the straightforward Dragon of 22:1. This speedup
requires changes to only one routine, Go. The revised Pen module is shown in Listing
3.
Speculations Inspected
The big winners among our speculations were floating point and graphics
operations. Floating point took up most of the time in the original program, and by
eliminating it we were able to get most of the speedup. Graphics operations take up
most of the remaining time, and although we made some reduction in this, it seems
unlikely that there is much more we can do. Possibly we could do something sneaky
such as building the whole picture in memory without using graphics operations, then
copying it to the screen with CopyBits; but this would be very complicated and perhaps
not much of an improvement. (One of the hardest parts of tuning is knowing when to
stop. Most programs can be improved indefinitely, but they gradually get more
complicated and error-prone.)
The issue of recursion turned out to be a red herring. The total time spent in the
Dragon procedure is only 3.9% of the total, and this includes the recursive calls.
Recursion has a bad reputation, for two reasons. First is the bad recursion examples
often given in textbooks (usually factorials or Fibonacci numbers), which can be done
more simply and faster by iteration. Second, in some older languages and
implementations, either there is no recursion in the language, so it must be simulated
by the program (FORTRAN), or it is treated as a special case (JOVIAL, PL/I) and
really is much slower because it requires a special dynamic allocation of variables
rather than the normal static allocation. In more modern languages such as Pascal or
Modula-2, all variables are dynamically allocated and recursive calls are no different
(and therefore no slower) than other calls.
The impact of using a high-order language or of range checking varies a great
deal depending on the program and on the language implementation. In this example
most of the time was taken by things outside the generated code, which seems to be
inherent in the particular problem being solved and not likely to be affected much by
the implementation language. In the final version of the program, the Modula-2
routines take a total of only 13.3%, so re-writing in assembly or taking out the range
checks could not possibly save more than this. The typical Macintosh application tends
to consist mainly of calls to the Toolbox, interspersed with occasional calculations, so
the timing of a tuned program tends to be dominated by the Toolbox time and does not
depend greatly on the implementation language. For programs such as spreadsheets and
compilers this is not true, since they really do spend a great deal of time on internal
calculations, but it is true of most applications.
Conclusion
By applying the MPW performance analysis tools and making the improvements
suggested by the program profile, we were able to speed up the dragon-drawer 22
times. The value of the execution profile is in telling us what not to look at. Most of a
program’s execution time is spent in a few very places, and we should concentrate our
efforts on those few places. (In Quality Control this is called the principle of the
“vital few and the trivial many.”) Profiles are more valuable for large programs,
since there is more to ignore there. In any case, success in tuning depends on
measurement.
For Further Reading
Jon Louis Bentley, Writing Efficient Programs. Prentice-Hall, 1982. An
excellent book on tuning; of value both to the professional programmer and the
hobbyist. Our Dragon example illustrates his principles Space-For-Time Rule 2 --
Store Precomputed Results (sine and cosine tables), and Procedure Rule 2 -- Exploit
Common Cases (eliminate floating-point since not needed in our case, and use FillRect
for faster horizontal and vertical line drawing). He gives many other principles
which we did not have a chance to use. In the beginning of the book he goes through 10
steps of optimizing an Approximate Travelling-Salesman Tour to get an overall
improvement of 17:1.
Adele Goldberg and David Robson, Smalltalk-80: The Language and Its
Implementation. Addison-Wesley, 1983. The source of the Dragon program (on pp.
372-3).
Donald E. Knuth, “An Empirical Study of FORTRAN Programs”,
Software--Practice and Experience, v. 1 (1971), pp. 105-133. Source of the term
“profile.” This article is written from the viewpoint of a compiler developer, and
considers several levels of optimization which can be applied to programs. Many
examples of optimization.
Mike Morton, “Faster Bitmap Rotation”. MacTutor, v. 4 (1988), no. 11, pp.
86-90 (reprinted in The Definitive MacTutor, pp. 56-60). Another example of
tuning, using only the total time of the routine as a measure and yielding a speedup of
3:1. Morton was careful to measure each attempted improvement to the routine; the
exact method of measurement is not important, but you must measure.
Stephen Dubin, Thomas W. Moore, and Sheel Kishore, “Using Regions in Medicine
with C”. MacTutor, v. 3 (1987), no. 10, pp. 27-31 (reprinted in The Essential
MacTutor, pp. 146-150). Description of re-coding a routine to calculate the area of
a region from Pascal or C to assembler, yielding a 1000:1 speedup.
James Plamondon, “Finding The Area Of A Region in C” (letter). MacTutor, v. 4
(1988), no. 4 (reprinted in The Definitive MacTutor, p. 699). Criticizes the
previous reference for solving the wrong problem (i.e., working very hard to get a
very good estimate of the area of a region drawn freehand). Gives a faster C routine for
doing a less-accurate but adequate estimate. In most applications you have some
leeway in the problem you solve, and you can use this to make the solution easier and
faster. We did not take advantage of this in the DrawDragon program; one speedup we
might have applied is to draw the line only one pixel wide (which is about twice as fast
as drawing it 4 pixels wide). Bentley (reference above) also considers this in his
Approximate Travelling-Salesman Tour -- since it is only an approximate solution
anyway, there is little harm in going from floating-point to slightly less accurate
scaled fixed point numbers.
Source Listings
Listing 1 - Straightforward Dragon
(******************************************************)
(* *)
(* file: DrawDragon.m *)
(* *)
(* Main program to test Dragon method. *)
(* *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* *)
(* Allen Stenger August 1989 *)
(* *)
(******************************************************)
MODULE DrawDragon;
FROM InOut IMPORT WriteLong, WriteString, Read;
FROM InsideMac IMPORT TickCount;
FROM DragonModule IMPORT Dragon;
FROM Pen IMPORT Home;
VAR
oldTime,
newTime : LONGINT;
ch : CHAR;
BEGIN
oldTime := TickCount();
Home;
Dragon( 8 );
newTime := TickCount();
WriteString( “Run time is “ );
WriteLong( newTime - oldTime, 6 );
WriteString( “ -- press space to exit “ );
Read( ch );
END DrawDragon.
(******************************************************)
(* *)
(* file: DragonModule.d *)
(* *)
(* Implementation of Dragon method from *)
(* Goldberg and Robson, *)