All Databases MacTech Vol 08-1992

Efficient 68030

Volume Number: 8

Issue Number: 5

Column Tag: Efficient coding

Efficient 68030 Programming

Optimizing your code to run faster on the 68030

By Mike Scanlin, MacTutor Regular Contributing Author

So you want to optimize your code to run faster on the 68030? Well, good for

you. This article will help. Most of these optimizations apply to running on the

68020/040, too, and in no case will any of these optimizations make your code run

slower on a non-030 machine (like the 68000). These optimizations are not language

specific, although most of the examples given here are in Think C and assembly.

The two most important things you need to know in order to make your program

run fast on a 68030 are: (1) How to intelligently align your code and data and (2) How

to use the data and instruction caches effectively. Once you understand a few basic

attributes of the 030 you can do a few things to your code without much work to ensure

that you have optimal cache use and alignment.

How much difference will it make, you ask? Using these techniques you can save

about 15% on your subroutine calling overhead, 25% to 50% on some of your 32-bit

memory accesses (reading or writing longs, Ptrs, Handles, etc), 60% on your

NewHandleClear calls and more. Taken in sum these things are likely to make a

noticeable difference in your code’s performance when running on an 030.

The single biggest gain you can realize in your programs when running on an

030 is to make sure your 4-byte variables (longs, Ptrs, Handles, etc) are on 4-byte

memory boundaries (i.e. their address is evenly divisible by 4). To understand why this

makes a difference you need to know a little bit about how the data cache works.

THE DATA CACHE

The 68030 has a 256 byte data cache where it stores copies of the most recently

accessed memory locations and values. If your code reads the same memory location

twice within a short period of time, it is very likely that the 68030 can get the value

at that location from its cache (which is in the processor itself) rather than from

actual memory. This reduces the time it takes to execute the instruction accessing the

memory location because it eliminates the external bus cycle that the processor would

normally use to go out into memory and fetch the value that the instruction needs.

Of course, it’s not quite as simple as that. There are a couple of other details to

know. The 256 byte data cache is actually broken up into 16 separate caches of 16

bytes each. When the processor is trying to fetch the value at a given memory location

it first checks if it is in one of its 16 data caches. If it is, then it uses the value and

everything is cool. If it’s not, then it goes out into memory and fetches the value along

with the 16 or 32 bytes surrounding the value. If x is the memory location requested,

then the processor fetches all the bytes from location ((x >> 4) << 4) to ( (x +

sizeof(fetch)) >> 4) << 4) + 15), where sizeof(fetch) is 1, 2 or 4 depending on if this

is a byte, word or long access. Basically, it fetches enough 16 byte “buckets” to

completely cover the request (and each bucket begins at an address divisible by 16).

The reason why more than one 16-byte bucket might be necessary is because the first

part of the requested value could be at the end of one bucket and the second part could be

at the beginning of the next bucket. An example is the low memory global GrayRgn,

whose address is $9EE - the first half (hi-word) exists in the bucket from $9E0 to

$9EF and the second half (lo-word) exists in the bucket from $9F0 to $9FF. So,

assuming GrayRgn isn’t in the data cache to begin with, the instruction:

Move.L GrayRgn,A0

will cause 32 bytes to be read in from memory (4 of which the processor

actually uses to process the Move.L instruction). At the end of the instruction, two of

the 16 data caches are filled with the values from $9E0 to $9FF. The reason why the

processor does this is because it is predicting that the code is going to want the bytes

near the most recently requested memory location. Although the first access to a

16-byte bucket is expensive, subsequent accesses to the same bucket are free (in terms

of time to fetch the data). This actually helps routines like BlockMove quite a bit (it’s

46% faster for non-overlapping moves because of it). And it’s not quite as sick as you

Referenced by (6):