Performance Tuning
Volume Number: 10
Issue Number: 7
Column Tag: Powering UP
Touch that memory, go directly to jail, or at least to a wait state
By Richard Clark, General Magic and Jordan Mattson, Apple Computer,
Inc.
The new Power Macintosh systems are definite price/performance winners: they
run most 68K software at Quadra speeds, and recompiled code runs two to four times
faster than a top of the line Quadra. Yet simple recompilation isn’t enough if you want
the best possible performance. Getting the best speed out of a Power Macintosh
program requires performance tuning: identifying and changing the sections of code
which are holding back performance.
This month’s Powering Up looks at the common actions which rob Power
Macintosh programs of their maximum performance. We’ll start with a discussion of
how reading and writing memory affects the speed of PowerPC code, take a detour
through the Power Macintosh performance analysis tool, look at some specific
performance enhancement techniques, and come back to an older 68K trick which
doesn’t work so well anymore. Armed with this information, you should be well on
your way to tuning your own PowerPC code.
Remember this!
If you only remember one thing from this article, remember this: in almost
every PowerPC program, memory accesses affect performance the most. Compare this
to the 68K, where reducing the number and complexity of instructions is the goal, and
you’ll see that you’ll have to use some different techniques for tuning PowerPC code.
Why should memory accesses be such a big deal on the PowerPC? First, the
PowerPC is designed to be clocked at high speeds, but high speed memory is expensive
and hard to get. You might have noticed that Power Macintosh systems use the same
80ns SIMMs as 33MHz 68040 and 80x86 systems. Since the PowerPC can easily
issue requests for memory in less than 80ns, Power Macintosh systems use several
techniques to connect the fast PowerPC chip to slower RAM.
One technique involves running the memory bus at a slower speed than the
processor, and asking the processor to wait every time it needs a memory location.
This is often called “inserting wait states” into each memory access. (The first Power
Macintosh models run the bus at 1/2 the processor clock speed, so a 66MHz system
has a 33MHz bus.) This technique isn’t unique to the Power Macintosh - most
commercial systems insert at least a single wait state into every memory access, and
some “clock doubled” microprocessors connect to the outside world at one speed but
multiply the clock speed by two before feeding it to the microprocessor’s logic.
Since the system’s RAM can’t supply information as quickly as the processor
wants, every PowerPC chip contains a built-in “memory cache.” A cache is a block of
memory placed between the processor and some slower external device, so a memory
cache is a block of extremely fast memory placed between the processor and regular
RAM, a disk cache is a block of memory between the disk drive and the processor, and
so on. If the processor has to read the same location multiple times, subsequent reads
can come from the “cached” copy of the information, which leads to dramatic
improvements in performance. (A standard 601 PowerPC chip contains a single 32KB
cache internally; in practical terms, a read from the cache on one typical configuration
takes about 1/4 the time of a read from external memory, although this is dependent
on clock rates.)
The cache doesn’t help so much when writing values to memory. Imagine a
scenario where the PowerPC changes location 1 from the value 0 to 1. If the write only
went as far as the cache, the cached copy of location 1 would read “1” while the copy in
RAM would contain “0”. This leads to a conflict: the processor sees one value at
location 1, while external devices such as the disk controller see the old value!
Systems designers can use several techniques to avoid such “stale data”, but the
simplest and most practical method involves a “write-through cache”. (Technically
speaking, Apple uses “copy-back” for most of RAM. The video buffer is the only
major exception.) In this model, when the PowerPC writes to location 1, it changes
both the cached copy and the location in RAM. Since the write involves external
memory, the cache’s speed advantage goes away.
The PowerPC memory cache, like most memory caches, is currently organized
into “lines” of 32 bytes each (future systems may be different. On a 601, a line is 64
bytes with two 32-byte sectors). The first time a program accesses one memory
location, the chip actually loads 32 bytes (or 8 instructions) into the cache. So,
reading a byte at address 0x01 actually loads locations 0x00 through 0x1F into the
cache. Organizing the cache this way not only simplifies the logic a great deal, but
makes sense since both data and program memory accesses tend to occur in “clusters”.
Since the cache holds several contiguous locations, the PowerPC can use “burst
mode” when reading memory. In this mode, the microprocessor supplies an address,
then keeps asking for the “next” location. Since the memory chips only have to decode
the first address, then simply move to the next memory cell, RAM can supply the
series of contiguous locations much more quickly than if each location was read
separately.
The other major reason that memory accesses affect the PowerPC so dramatically
is that the PowerPC (like other RISC chips) uses many simple instructions which
were designed for efficient execution. The chip wants another instruction every clock
cycle or two (giving the cache a real workout), and was built around the assumption
that most work will be done inside the registers. If a program places multiple reads or
writes to external memory in a row, the entire chip will have to wait (multiple reads
have a different behavior on the 603 and 604 to diminish this effect).
Improving the performance of PowerPC code
Now that we know that memory accesses have a dramatic affect on PowerPC
performance, we’ll look at some specific examples. Let’s assume that you have a 68K
application which you’ve just ported to the PowerPC. (If you actually have such an
application, give yourself a pat on the back!) Your new application is fast, but you
expected it to be even faster. What can you do?
Look for inefficient code.
As in any application, the efficiency of your algorithms plays a large role in the