Efficient 68040
Volume Number: 9
Issue Number: 2
Column Tag: Efficient coding
Efficient 68040 Programming
Optimizing your code to run faster on the 68040
By Mike Scanlin, MacTech Magazine Regular Contributing Author
The current trend towards more and more 68040s is clear to anyone who follows
the Macintosh. Some sources say that most, if not all, of the Mac product line will be
moved to the 68040 sometime in 1993. With QuickTime and Color QuickDraw already
requiring at least a 68020, perhaps the day when system software and applications
require a 68040 isn’t that far away. In preparation for that day, here some tips on
how to write efficient code for the 68040.
ACCELERATE THE INSTALLED BASE
One of the goals of the 040 designers was to increase the performance of the large
installed base of 680x0 code that was already out there. They ga thered 30MB of object
code from several different platforms and profiled it to gather instruction frequency
and other statistics. They used this information to influence the design of the cache
structure and memory management system as well as which parts of the instruction
set they would optimize.
From this trace data it was determined that most of the common instructions
could execute in one clock cycle if the Integer Unit were pipelined and if the
instructions weren’t larger than three words each. The resulting six stage pipeline
optimizes several of the less-complicated addressing modes: Rn, (An), (An)+, -(An),
(An, d16), $Address and #Data. These seven modes are called the optimized
effective-address modes (OEA). When writing efficient 68040 code you should stick to
these addressing modes and not use the others (i.e. don’t use instructions that are 4
words (8 bytes) or longer). Sequences of instructions comprised only of these
addressing modes can be pipelined without stalls and will have a lower average
instruction time than sequences of instructions containing 8 byte instructions every so
often.
Figure 1 shows a comparison of cycle times between the 68020 and 68040 that
illustrates some of the improvements made for the 68040 (RL stands for register
list).
BRANCHES
One thing to notice in the above table is that branches taken are now faster than
branches not taken. This is different from all other non-68040 members of the
680x0 family. It’s somewhat annoying because it means that you can’t simultaneously
optimize for both the 040 and the 030 (there are other cases of this, too, discussed a
little further on). The reason Motorola did this is because their trace data of existing
code showed that 75% of all branch instructions were taken.
In addition to switching which was the faster case, they also managed to speed up
both cases by adding a dedicated branch adder that always calculates the destination
address when it sees a branch instruction. If it turns out that the branch is not taken
then the results of the branch adder are ignored.
25MHz cycles
Instruction Addr mode 68020 68040
Move Rn,Rn Í2 1
Move ,Rn 6 1
Move Rn, 6 1
Move , 8 2
Move (An,Rn,d8),Rn 10 3
Move Rn,(An,Rn,d8) 6 3
Move multiple RL, 4+2n 2+n
Move multiple ,RL 8+4n 2+n
Simple arithmetic Rn,Rn Í2 1
Simple arithmetic Rn, 6 1
Simple arithmetic ,Rn 8 1
Sifts (1 to 31 bits) - Í4 2
Branch taken - Í6 2
Branch not taken - Í4 3
Branch to subroutine - Í6 2
Return from subroutine - Í10 5
Figure 1
What all of this means in practical terms is that you should always write your
code so that branches are taken rather than not taken. The most commonly executed
thread should take all branches. For instance, this code:
x = 1;
if (likelyEvent)
x = 2;
can be improved by switching the condition and forcing the branch after the Tst
instruction to be taken (assuming likelyEvent is True more than half of the time):
x = 2;
if (!likelyEvent)
x = 1;
Be careful when doing this, though, that the compiler doesn’t generate an extra
instruction for the added “!”. If so, it’s not worth switching the condition. But in
examples like the one given, the compiler can usually just change a Beq instruction to
a Bne instruction and you’ll be better off.
While we’re on the subject of branches, here’s a trick you can use to do a fast
unconditional branch on the 040 if you’re writing in assembly or using a clever C
compiler (works on the 020 and 030, too, but takes longer on those): use the Trapn #
(trap never immediate) instruction to unconditionally branch ahead by 2 or 4 bytes in
1 cycle. One example where this is useful is if you have a small clause (2 or 4 bytes)
in an else statement.
First, define these two macros:
/* Trapn.W */
#define SKIP_TWO_BYTES DC.W 0x51FA
/* Trapn.L */
#define SKIP_FOUR_BYTES DC.W 0x51FB
Now suppose you had this code:
if (x) {
y = 1;
z = 2;
}
else
q = 3;
The normal assembly generated might be:
Tst x
Beq.S @1
Moveq #1,y
Moveq #2,z
Bra.S @2
@1 Moveq #3,q
@2
A clever compiler (or, more likely, assembly language programmer) could
optimize this as:
Tst x
Beq.S @1
Moveq #1,y
Moveq #2,z
SKIP_TWO_BYTES
@1 Moveq #3,q
@2
What’s happening here is that the two bytes generated by the Moveq #3,q
instruction become the immediate data for the Trapn.W instruction in the
SKIP_TWO_BYTES macro. Trapn.W is normally a 4 byte instruction but the macro
only defines the first two bytes. Since it will never trap, the instruction decoder
always ignores its operand (the Moveq #3,q instruction) and begins decoding the next
instruction at @2 on the next clock. Works the same way for the Trapn.L instruction,
except that in that case you embed exactly 4 bytes as the immediate data that will be
skipped as part of the Trap instruction.
Note that to take advantage of this trick you’re usually going to want the smaller
of the “if” clause and the “else” clause to be the “else” clause (to increase the
chances that the “else” clause is 4 bytes or less). It would be nice if this was the most
commonly executed of the two clauses, too, to take advantage of the faster branch-taken
time. Hopefully compilers that have a “Generate 68020 code” flag will take advantage
of this in the future (I don’t know of any at the moment that do).
SAVING AND RESTORING REGISTERS
Optimal saving and restoring of registers on the 040 is different than on other
680x0s. When loading registers from memory using the post-increment addressing
mode:
Movem.L (SP)+,RL
you should use individual Move.L instructions instead. It will always be faster, no
matter how many registers are involved (not exactly intuitive, is it?). When storing
registers to memory with the pre-decrement addressing mode, as in:
Movem.L RL,-(SP)
you should use individual Move.L instructions unless your register list is comprised
of: (1) exactly one data register and one address register or, (2) two or more address
registers combined with any number (0..7) of data registers.
OTHER TIDBITS
Three-word instructions with 32-bit immediate operands are faster than trying
to use Moveq to preload the immediate value into a register first. The opposite is true
on earlier 680x0s. For example, this code:
Cmp.L #20,(A0)
is faster on an 040 than this pair of instructions:
Moveq #20,D0
Cmp.L D0,(A0)
When subtracting an immediate value from an address register it is faster to add
the negative value instead. This is because there is no complement circuit for the
address registers in the 040. This instruction:
Add #-4,A0
is faster than either of these two:
Lea -4(A0),A0
Sub #4,A0
Bsr and Bra are faster than Jsr and Jmp because the hardware can precompute
the destination address for Bsr and Bra.
DON’T USE REGISTER VARIABLES
There are some cases where it’s better to use a stack variable instead of a
register variable. The reason is that source effective addresses of the form (An, d16)
are just as fast as Rn once the data is the data cache. So the first read access to a stack
variable will be slow compared to a register but subsequent reads of that variable will
be equal in speed. By not assigning registers to your read-only stack variables ( which
includes function parameters passed on the stack) you save the overhead of
saving/restoring the register as well as the time to initialize it.
You should, however, use register variables for variables that are written to.
For instance, consider this function:
int
Foo(w, x, p)
int w, x;
int *p;
{
int y, z;
z = w;
do {
y += z + *p * w;
*p += x / w + y;
} while (--z);
return (y);
}
In this example, w, x, y, z and *p are being read from (things on the right side of
the equations) and y, z and *p are being written to. On the 040, you should make
register variables out of those things that are being written to and leave the rest as
stack variables:
int
Foo(w, x, p)
int w, x;
register int *p;
{
register int y, z;
z = w;
do {
y += z + *p * w;
*p += x / w + y;
} while (--z);
return (y);
}
This second version is faster than the original version (as you would expect) but
it is also faster than a version where w and x are declared as register variables ( which
you might not expect).
FLOATING POINT OPERATIONS
When it came to floating point operations, the 040 designers looked at their trace
data and decided to implement in silicon any instruction that made up more than 1% of
the 68881/2 code base. The remaining [uncommon] instructions were implemented in
software. Those implemented in silicon are:
FAdd, FCmp, FDiv, FMul, FSub
FAbs, FSqrt, FNeg, FMove, FTst
FBcc, FDbcc, FScc, FTrapcc
FMovem, FSave, FRestore
They also made it so the Integer Unit and the Floating Point Unit operate in
parallel, which means you should interleave floating-point and non-floating-point
instructions as much as possible.
Here’s a table that summarizes the performance improvements made by having
the FPU instructions executed by the 040 rather than by a 68882:
25MHz cycles
Instruction Addr mode 68882 68040
FMove FPn,FPn 21 2
FMove.D ,FPn 40 3
FMove.D FPh, 44 3
FAdd FPn,FPn 21 3
FSub FPn,FPn 21 3
FMul FPn,FPn 76 5
FDiv FPn,FPn 108 38
FSqrt FPn,FPn 110 103
FAdd.D ,FPn 75 3
FSub.D ,FPn 75 3
FMul.D ,FPn 95 5
FDiv.D ,FPn 127 38
FSqrt.D ,FPn 129 103
Notice that on the 040 an FMul is about 7x faster than an FDiv and on a 68882
it’s only about 1.4x faster. This suggests that you should avoid FDiv on an 040 much
more than you would on a 68882. Perhaps your algorithms could be rewritten to take
advantage of this when running on an 040.
A trick that works in some cases is to multiply by 1 over a number instead of
dividing by a number. Take this code from a previous MacTutor article on random
numbers:
;1
M EQU $7FFFFFFF
quotient EQU FP0
newSeed EQU D1
result EQU 8
LocalSize EQU 0
Link A6,#LocalSize
Jsr UpdateSeed
FMove.L newSeed,quotient
FDiv.L #M,quotient
FMove.X quotient,([result,A6])
Unlk A6
Rts
By precomputing the floating point value OneOverM (1/M) and restricting
ourselves to the optimized effective addressing modes we can rewrite this code to
eliminate the Link, Unlk and FDiv:
;2
OneOverM EQU "$3FE000008000000100000002
quotient EQU FP0
newSeed EQU D1
result EQU 4
Jsr UpdateSeed
FMove.L newSeed,quotient
FMul.X #OneOverM,quotient
Move.L result(A7),A0
FMove.X quotient,(A0)
Rts
This optimized version runs about 38% faster than the original overall (the
relatively low improvement is caused by the fact that UpdateSeed is taking up most of
the time). This example points out one other interesting thing, too, and that is the
Move.L result(A7),A0 (an Integer Unit instruction) is running in parallel with the
FMul instruction (an FPU instruction). Since the FMul takes longer, the FMove.X
instruction at the end will have to wait for the FMul to finish before it does its move
but there’s nothing we can do about that in this case.
INSTRUCTION AND DATA CACHE
The 040 has a 4K instruction cache and a 4K data cache. If you are performing
some operation on a large amount of data, try to make your code fit in 4K or less (at
least your innermost loop if nothing else) and try to operate on 4K chunks of
contiguous data at a time. Don’t randomly read single bytes from a large amount of data
if you can help it. This will avoid cache flushing and reloading as much as possible.
Many of the things I mentioned in the Efficient 68030 Programming article
(Sept 92) about 16-byte cache lines apply to the 040 as well; it’s just that the 040
has more of them. Also, as mentioned in the 030 article, data alignment is majorly
important on the 040 as well. Rather than repeat it all here, check out that previous
article instead.
MOVE16
Most people have at least heard about the only new instruction that the 68040
provides but many people aren’t sure when they can use it. The rules are pretty
simple: the source and destination addresses must be an even multiple of 16 and you
must be moving 16 bytes at a time.
So when is this useful? Well, if you know you’re running in a 68040
environment (use Gestalt) then you know that the Memory Manager only allocates
blocks on 16 byte boundaries (because that’s the way Apple implemented it). You can
use this information to your advantage if you are copying data from one memory block
to another.
Why not just use BlockMove you ask? Three reasons: (1) Trap overhead, (2)
Job preflighting to find the optimal move instructions for the given parameters
(which we already know are Move16 compatible) and, (3) It flushes the caches for the
range of memory you moved every time you call it.
Why does it flush the caches? Because of the case where the Memory Manager has
called it to move a relocatable block that contains code (the MM doesn’t know anything
about the contents of a block so it has to assume the worst). This one case imposes an
unnecessary penalty on your non-code BlockMoves (99% of all moves, I would guess)
and it is this author’s opinion that Apple should provide a BlockMoveData trap that
doesn’t flush the caches and that would only be called when the programmer who wrote
the code knew that what was being moved was not code (and deliberately made a call to
BlockMoveData instead of BlockMove). Write your senator, maybe we can do some good
here.
One other thing to note about the Move16 instruction is that unlike other Move
instructions it doesn’t leave the data it’s moving in the data cache. This is great if
you’re moving a large amount of data that you’re not going to manipulate afterwards
(like updating a frame buffer for the screen or something) but may not be what you
want if you’re about to manipulate the data that you’re moving (where it might be
advantageous to have it in the cache after it’s been moved). There is no rule of thumb
on this because it depends on how much data you have and how much manipulation
you’re going to do on it after it’s moved. You’ll have to run some tests for your
particular case.
Well, that’s all the tips and tricks I know for programming the 68040. I’d like
to thank the friendly and efficient people at Motorola for source material in producing
this article as well as for producing such an awesome processor. I am truly a fan. With
any luck at all the 80x86 camp will writher away and die and 680x0’s will RULE THE
WORLD! Thanks also to RuleMaster Hansen for his code, clarifications, corrections and
rules.