All Databases MacTech Vol 10-1994

Performance Tuning

Volume Number: 10

Issue Number: 7

Column Tag: Powering UP

Performance Tuning

Touch that memory, go directly to jail, or at least to a wait state

By Richard Clark, General Magic and Jordan Mattson, Apple Computer,

Inc.

The new Power Macintosh systems are definite price/performance winners: they

run most 68K software at Quadra speeds, and recompiled code runs two to four times

faster than a top of the line Quadra. Yet simple recompilation isn’t enough if you want

the best possible performance. Getting the best speed out of a Power Macintosh

program requires performance tuning: identifying and changing the sections of code

which are holding back performance.

This month’s Powering Up looks at the common actions which rob Power

Macintosh programs of their maximum performance. We’ll start with a discussion of

how reading and writing memory affects the speed of PowerPC code, take a detour

through the Power Macintosh performance analysis tool, look at some specific

performance enhancement techniques, and come back to an older 68K trick which

doesn’t work so well anymore. Armed with this information, you should be well on

your way to tuning your own PowerPC code.

Remember this!

If you only remember one thing from this article, remember this: in almost

every PowerPC program, memory accesses affect performance the most. Compare this

to the 68K, where reducing the number and complexity of instructions is the goal, and

you’ll see that you’ll have to use some different techniques for tuning PowerPC code.

Why should memory accesses be such a big deal on the PowerPC? First, the

PowerPC is designed to be clocked at high speeds, but high speed memory is expensive

and hard to get. You might have noticed that Power Macintosh systems use the same

80ns SIMMs as 33MHz 68040 and 80x86 systems. Since the PowerPC can easily

issue requests for memory in less than 80ns, Power Macintosh systems use several

techniques to connect the fast PowerPC chip to slower RAM.

One technique involves running the memory bus at a slower speed than the

processor, and asking the processor to wait every time it needs a memory location.

This is often called “inserting wait states” into each memory access. (The first Power

Macintosh models run the bus at 1/2 the processor clock speed, so a 66MHz system

has a 33MHz bus.) This technique isn’t unique to the Power Macintosh - most

commercial systems insert at least a single wait state into every memory access, and

some “clock doubled” microprocessors connect to the outside world at one speed but

multiply the clock speed by two before feeding it to the microprocessor’s logic.

Since the system’s RAM can’t supply information as quickly as the processor

wants, every PowerPC chip contains a built-in “memory cache.” A cache is a block of

memory placed between the processor and some slower external device, so a memory

cache is a block of extremely fast memory placed between the processor and regular

RAM, a disk cache is a block of memory between the disk drive and the processor, and

so on. If the processor has to read the same location multiple times, subsequent reads

can come from the “cached” copy of the information, which leads to dramatic

improvements in performance. (A standard 601 PowerPC chip contains a single 32KB

cache internally; in practical terms, a read from the cache on one typical configuration

takes about 1/4 the time of a read from external memory, although this is dependent

on clock rates.)

The cache doesn’t help so much when writing values to memory. Imagine a

scenario where the PowerPC changes location 1 from the value 0 to 1. If the write only

went as far as the cache, the cached copy of location 1 would read “1” while the copy in

RAM would contain “0”. This leads to a conflict: the processor sees one value at

location 1, while external devices such as the disk controller see the old value!

Systems designers can use several techniques to avoid such “stale data”, but the

simplest and most practical method involves a “write-through cache”. (Technically

speaking, Apple uses “copy-back” for most of RAM. The video buffer is the only

major exception.) In this model, when the PowerPC writes to location 1, it changes

both the cached copy and the location in RAM. Since the write involves external

memory, the cache’s speed advantage goes away.

The PowerPC memory cache, like most memory caches, is currently organized

into “lines” of 32 bytes each (future systems may be different. On a 601, a line is 64

bytes with two 32-byte sectors). The first time a program accesses one memory

location, the chip actually loads 32 bytes (or 8 instructions) into the cache. So,

reading a byte at address 0x01 actually loads locations 0x00 through 0x1F into the

cache. Organizing the cache this way not only simplifies the logic a great deal, but

makes sense since both data and program memory accesses tend to occur in “clusters”.

Since the cache holds several contiguous locations, the PowerPC can use “burst

mode” when reading memory. In this mode, the microprocessor supplies an address,

then keeps asking for the “next” location. Since the memory chips only have to decode

the first address, then simply move to the next memory cell, RAM can supply the

series of contiguous locations much more quickly than if each location was read

separately.

The other major reason that memory accesses affect the PowerPC so dramatically

is that the PowerPC (like other RISC chips) uses many simple instructions which

were designed for efficient execution. The chip wants another instruction every clock

cycle or two (giving the cache a real workout), and was built around the assumption

that most work will be done inside the registers. If a program places multiple reads or

writes to external memory in a row, the entire chip will have to wait (multiple reads

have a different behavior on the 603 and 604 to diminish this effect).

Improving the performance of PowerPC code

Now that we know that memory accesses have a dramatic affect on PowerPC

performance, we’ll look at some specific examples. Let’s assume that you have a 68K

application which you’ve just ported to the PowerPC. (If you actually have such an

application, give yourself a pat on the back!) Your new application is fast, but you

expected it to be even faster. What can you do?

Look for inefficient code.

As in any application, the efficiency of your algorithms plays a large role in the

Referenced by (2):