Fixed-Point Math Speed
Volume Number: 13
Issue Number: 11
Column Tag: Programming Techniques
Getting a Speed Boost with Fixed-Point Math
by Daniel W. Rickey, Manitoba Cancer Treatment and Research Foundation
A little assembly goes a long way to improving code
execution speed
Motivation
The recent releases of the Metrowerks compilers allow Pascal programmers to include
assembly language in their programs as functions and procedures. A built-in
assembler is a very powerful tool because many of the features of the Pascal
environment are available in the assembler. For example, compiler directives enable
a routine to contain both 68k and PowerPC assembly code. In addition, the integrated
environment provides seamless source-level debugging of the assembly and Pascal
code. (C programmers can gain the same benefits, but they must write complete
functions to use the built-in assembler.)
One of the biggest advantages of assembler is that it allows the programmer to use
processor instructions that are difficult or impossible to access from a high-level
language. Because of this ability, the programmer may have the option of using a more
efficient algorithm when coding in assembler. For these reasons, the assembler is an
excellent tool for optimizing time-critical routines. This article shows how the
built-in assembler can be used to optimize some simple fixed-point math routines.
Fixed-Point Math
In many imaging applications, such as texture mapping or ray casting of
three-dimensional medical data sets (Herman 1993), approximately 105 to 106
pixel values are mapped to the screen. For interactive viewing, we need to calculate at
least 105 position and intensity values each second. These calculations can be
performed with floating-point math and the resulting values rounded to integer format
for display. Unfortunately, the round function is very slow, particularly on 68k
machines lacking a 68882 co-processor, e.g. machines using a 68LC040. In fact, all
floating-point operations on machines lacking a co-processor are very slow. Although
the PowerPC processors offer excellent floating-point performance they are
relatively slow at rounding because this operation requires a move to and from main
memory. To overcome these speed limitations, programmers can use fixed-point
calculations instead of floating-point operations. On a 68k processor, rounding a
fixed-point number can be 50 times faster than rounding a floating-point number. If a
co-processor is not present, other operations, such as multiplication, will be much
faster using fixed-point math. On PowerPC-based machines the fixed-point round is
about 3 times faster than a floating-point round. The speed advantage gives strong
motivation to use fixed-point math.
Fixed-point math is surprisingly easy to use. On Macintosh a fixed-point number is
32 bits long, that is Fixed = LongInt;. The low-order 16 bits contain the fractional
part of the number and the high-order 16 bits contain the integer portion of the
number. Addition and subtraction are handled as they normally would be: by simply
adding and subtracting the numbers. Multiplication, division and type conversions
must be performed using function calls. The Toolbox functions for performing these
operations are:
FUNCTION FixMul(a, b: Fixed): Fixed;
FUNCTION FixDiv(a, b: Fixed): Fixed;
FUNCTION Fix2X(x: Fixed): double_t;
FUNCTION X2Fix(x: double_t): Fixed;
FUNCTION Long2Fix(x: LongInt): Fixed;
FUNCTION Fix2Long(x: Fixed): LongInt;
FUNCTION FixRound(x: Fixed): Integer;
Because the extended variable type is not supported on a PowerPC, the original Fix2X
and X2Fix functions were updated to replace the Extended type with double_t, which is
equal to a Double on a PowerPC and an Extended on a 68k processor.
Goals
The primary motivation for using fixed-point calculations is speed. Unfortunately,
neither the 68k nor PowerPC Toolbox routines are well optimized. In fact, these
routines were not PowerPC native until system 7.5.3. In this article we will
concentrate on optimizations of the multiply, divide, and round functions as they are
used most often. For this work we will use the built-in assembler, available in the
Metrowerks Pascal Compiler, to write 68k and PowerPC code.
In addition to being fast, we require our functions to be well behaved. Because
fixed-point numbers must lie between 32767.9999 and -32768, there exists a
possibility of overflow during multiplication and division. If an overflow occurs, the
function returns a large value with the appropriate sign. In the case of multiplication,
an overflow condition behaves similar to the Toolbox routine: $7FFE0000 is returned
when both arguments have the same sign and -$7FFE0000 is returned when the
arguments have different signs. The largest possible fixed-point value is not returned
as this might cause an overflow in the calling code. Note that these examples differ
from the 68k assembler code of Lebedev (1994), which did not deal with overflow
conditions.
Fixed-Point Round
The algorithm for rounding a fixed-point number is very simple: one half is added to
the number and then the result is bit-shifted right by 16 bits. Shown in Listing 1 is
the function for rounding a fixed-point number. A normal function declaration is used
followed by the asm directive. Compiler directives are used to determine whether
680x0 or PowerPC instructions are used. Note that when compiled for a 68000-based
machine, the Toolbox routine is used. This is because we make use of instructions that
are available only on 68020 and newer processors.
680x0
Those unfortunate souls who program in C should note that the conventions for passing
parameters to and from a 68k function differ from the Pascal code shown here. The
function begins with the fralloc directive, which creates a stack frame using link a6,
#$0. Once x is loaded into register d0, it is added to $7FFF, which, to maintain
consistency with the PowerPC code shown below, is a tad less than one half ($8000).
An algebraic shift is used to sign extend the result. Note that the "immediate" form of
the asr bit-shift instruction can only shift a maximum of 8 bits, thus two instructions
are required. The result is placed on the stack in the location allocated by the caller,
i.e. $C(a6). Note that (a6) is the previous value of a6; $4(a6) is the return address;
$8(a6) is the value of x; and $C(a6) is the space allocated for the returned value. The
frfree directive simply undoes what fralloc did: it emits unlk a6. After executing this
instruction, a6 will contain the previous value of a6 and a7 will point to the return
address. Finally the function executes the rtd instruction to return to the caller by
loading the return address into the program counter. This instruction also causes the
stack pointer to point to the returned value thereby de-allocating the stack space
originally allocated for x.
PowerPC
Compared to the 680x0, assembly language on the PowerPC processor is much more
civilized. For example, we don't have to sort out the gruesome details of address modes
and stack pointers because parameters are passed in registers and the same
conventions are used for Pascal and C. In addition, there are only a few simple
addressing modes compared to the fourteen available on the 680x0.
As described by Evans (1996) and Kacmarcik (1995), volatile registers r3 through
r10 are used to pass parameters to the function. The PowerPC implementation of the
MyFixRound function illustrates this: x is passed to the function in register r3 and the
function result is returned in register r3. The PowerPC assembler differs from the
68k assembler because a stack frame is not required as long as the number of local
variables is kept small. The first instruction addic (add immediate with carry) adds
the sign-extended 16-bit $7FFF to r3 and places the result in r3. Because of the sign
extension, we cannot add the exact one-half value of $8000. However, this
shortcoming should not make any difference to real-world applications. The srawi
(shift right algebraic word immediate) instruction is used to perform an algebraic
shift right by 16 bits. The standard blr (branch to link register) instruction ends the
function. (The link register was loaded with the return address when the function was
called.) It should be noted that in PowerPC assembly language there are many extended
mnemonics that the assembler converts into valid instructions. However, the
Metrowerks compiler may not recognize all of them.
Listing 1: MyFixRound
Function MyFixRound(x :Fixed): LongInt;
{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A840; {on a 68000 use the Toolbox routine}
{$ENDC} {$ENDC}
Function MyFixRound(x :Fixed): LongInt;
asm;
Begin
{$IFC OPTION(MC68020)}
fralloc {create an a6 stack frame}
move.l x, d0; {move x into register d0}
add.l #$7FFF, d0; {add 1/2 to x}
asr.l #$08, d0; {shift 32-bit word to right by 8 bits}
asr.l #$08, d0; {shift 32-bit word to right by 8 bits}
move.l d0, $000C(a6); {place the result on the stack}
frfree; {free the a6 stack frame}
rtd #$0004; {return and de-allocate parameter}
{$ENDC}
{$IFC POWERPC}
addic r3, r3, $7FFF; {add 1/2 to x}
srawi r3, r3, 16; {shift word to right by 16 bits}
blr; {return; results are returned in r3}
{$ENDC}
End; {MyFixRound}
Fixed-Point multiply
The algorithm for multiplying two fixed-point values is also simple: multiply the two
values together and then bit-shift the result to the right by 16 bits. Shown in Listing
2 is the function for multiplying two fixed-point numbers. The basic structures of the
680x0 and PowerPC functions are similar to those shown in Listing 1.
680x0
This function illustrates the use of a processor instruction (muls.l) that is not
available from Pascal. The muls.l instruction multiplies two signed 32-bit numbers
together to give a 64-bit result, which is contained in two registers. After performing
the multiplication, the routine ensures that the result is smaller than the largest
acceptable fixed-point value, i.e. the high word of the result must be smaller than
$7FFF. If the result is within range, the bit shift right is performed. Because the
result is contained in two registers, the bit-shift operation requires a couple of steps:
1) d1 is shifted right by 16 bits; 2) d2 is algebraically shifted left by 16 bits; and,
3) the high order two bytes of d2 are moved into d1. Pairs of instructions are used to
perform the 16-bit shifts. In the case of an overflow or out-of-range result, the
function returns a large number whose sign depends on the signs of the two arguments.
Note that because two parameters are passed to this function, the result is placed in a
different location on the stack than for the MyFixRound function, i.e. $10(a6) versus
$0C(a6). In addition, the rtd instruction removes two four-byte parameters from the
stack.
PowerPC
Performing a 32-bit by 32-bit multiply on PowerPC processor requires two
instructions: mulhw calculates the high-order 32 bits, and mullw the low-order 32
bits. After performing the multiplication, the cmpwi instruction is used to check if the
result is smaller than the largest acceptable fixed-point value. Note that cmpwi is an
extended form of the cmpi (compare immediate) instruction. If the result is in range,
a technique similar to the 680x0 example is used to combine the results stored in
registers r5 and r6. However, only two instructions are needed on the PowerPC: the
rlwinm (rotate left word immediate then AND with mask) instruction is used to bit
shift the low-order 32 bits of r5 and the rlwimi (rotate left word immediate then
mask insert) is used to move the lower 16 bits of r6 into the top 16 bits of r3.
Out-of-range values are handled as in the 68k code.
Listing 2: MyFixMul
Function MyFixMul(x, y :Fixed):Fixed;
{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A868; {on a 68000 use the Toolbox routine}
{$ENDC} {$ENDC}
Function MyFixMul(x, y :Fixed):Fixed;
asm;
Begin
{$IFC OPTION(MC68020)}
fralloc {create an a6 stack frame}
move.l x, d0; {move x and y values into registers}
move.l y, d1;

muls.l d0, d2:d1; {do the multiplication d2:d1 := d0*d1}
cmpi.l #$00007FFF, d2; {check if the result is larger than the}
bge Overflow; {maximum allowed for a fixed-point number}

cmpi.l #-$00007FFF, d2; {check if the result is less than the}
ble Overflow; {minimum allowed for a fixed-point number}

lsr.l #08, d1; {want the high-order 2 bytes of d1; shift right
16 bits}
lsr.l #08, d1; {can only bit shift a maximum of 8 bits}