Efficient 68030
Volume Number: 8
Issue Number: 5
Column Tag: Efficient coding
Efficient 68030 Programming
Optimizing your code to run faster on the 68030
By Mike Scanlin, MacTutor Regular Contributing Author
So you want to optimize your code to run faster on the 68030? Well, good for
you. This article will help. Most of these optimizations apply to running on the
68020/040, too, and in no case will any of these optimizations make your code run
slower on a non-030 machine (like the 68000). These optimizations are not language
specific, although most of the examples given here are in Think C and assembly.
The two most important things you need to know in order to make your program
run fast on a 68030 are: (1) How to intelligently align your code and data and (2) How
to use the data and instruction caches effectively. Once you understand a few basic
attributes of the 030 you can do a few things to your code without much work to ensure
that you have optimal cache use and alignment.
How much difference will it make, you ask? Using these techniques you can save
about 15% on your subroutine calling overhead, 25% to 50% on some of your 32-bit
memory accesses (reading or writing longs, Ptrs, Handles, etc), 60% on your
NewHandleClear calls and more. Taken in sum these things are likely to make a
noticeable difference in your code’s performance when running on an 030.
The single biggest gain you can realize in your programs when running on an
030 is to make sure your 4-byte variables (longs, Ptrs, Handles, etc) are on 4-byte
memory boundaries (i.e. their address is evenly divisible by 4). To understand why this
makes a difference you need to know a little bit about how the data cache works.
THE DATA CACHE
The 68030 has a 256 byte data cache where it stores copies of the most recently
accessed memory locations and values. If your code reads the same memory location
twice within a short period of time, it is very likely that the 68030 can get the value
at that location from its cache (which is in the processor itself) rather than from
actual memory. This reduces the time it takes to execute the instruction accessing the
memory location because it eliminates the external bus cycle that the processor would
normally use to go out into memory and fetch the value that the instruction needs.
Of course, it’s not quite as simple as that. There are a couple of other details to
know. The 256 byte data cache is actually broken up into 16 separate caches of 16
bytes each. When the processor is trying to fetch the value at a given memory location
it first checks if it is in one of its 16 data caches. If it is, then it uses the value and
everything is cool. If it’s not, then it goes out into memory and fetches the value along
with the 16 or 32 bytes surrounding the value. If x is the memory location requested,
then the processor fetches all the bytes from location ((x >> 4) << 4) to ( (x +
sizeof(fetch)) >> 4) << 4) + 15), where sizeof(fetch) is 1, 2 or 4 depending on if this
is a byte, word or long access. Basically, it fetches enough 16 byte “buckets” to
completely cover the request (and each bucket begins at an address divisible by 16).
The reason why more than one 16-byte bucket might be necessary is because the first
part of the requested value could be at the end of one bucket and the second part could be
at the beginning of the next bucket. An example is the low memory global GrayRgn,
whose address is $9EE - the first half (hi-word) exists in the bucket from $9E0 to
$9EF and the second half (lo-word) exists in the bucket from $9F0 to $9FF. So,
assuming GrayRgn isn’t in the data cache to begin with, the instruction:
Move.L GrayRgn,A0
will cause 32 bytes to be read in from memory (4 of which the processor
actually uses to process the Move.L instruction). At the end of the instruction, two of
the 16 data caches are filled with the values from $9E0 to $9FF. The reason why the
processor does this is because it is predicting that the code is going to want the bytes
near the most recently requested memory location. Although the first access to a
16-byte bucket is expensive, subsequent accesses to the same bucket are free (in terms
of time to fetch the data). This actually helps routines like BlockMove quite a bit (it’s
46% faster for non-overlapping moves because of it). And it’s not quite as sick as you