All Databases MacTech Vol 13-1997

Fixed-Point Math Speed

Volume Number: 13

Issue Number: 11

Column Tag: Programming Techniques

Getting a Speed Boost with Fixed-Point Math

by Daniel W. Rickey, Manitoba Cancer Treatment and Research Foundation

A little assembly goes a long way to improving code

execution speed

Motivation

The recent releases of the Metrowerks compilers allow Pascal programmers to include

assembly language in their programs as functions and procedures. A built-in

assembler is a very powerful tool because many of the features of the Pascal

environment are available in the assembler. For example, compiler directives enable

a routine to contain both 68k and PowerPC assembly code. In addition, the integrated

environment provides seamless source-level debugging of the assembly and Pascal

code. (C programmers can gain the same benefits, but they must write complete

functions to use the built-in assembler.)

One of the biggest advantages of assembler is that it allows the programmer to use

processor instructions that are difficult or impossible to access from a high-level

language. Because of this ability, the programmer may have the option of using a more

efficient algorithm when coding in assembler. For these reasons, the assembler is an

excellent tool for optimizing time-critical routines. This article shows how the

built-in assembler can be used to optimize some simple fixed-point math routines.

Fixed-Point Math

In many imaging applications, such as texture mapping or ray casting of

three-dimensional medical data sets (Herman 1993), approximately 105 to 106

pixel values are mapped to the screen. For interactive viewing, we need to calculate at

least 105 position and intensity values each second. These calculations can be

performed with floating-point math and the resulting values rounded to integer format

for display. Unfortunately, the round function is very slow, particularly on 68k

machines lacking a 68882 co-processor, e.g. machines using a 68LC040. In fact, all

floating-point operations on machines lacking a co-processor are very slow. Although

the PowerPC processors offer excellent floating-point performance they are

relatively slow at rounding because this operation requires a move to and from main

memory. To overcome these speed limitations, programmers can use fixed-point

calculations instead of floating-point operations. On a 68k processor, rounding a

fixed-point number can be 50 times faster than rounding a floating-point number. If a

co-processor is not present, other operations, such as multiplication, will be much

faster using fixed-point math. On PowerPC-based machines the fixed-point round is

about 3 times faster than a floating-point round. The speed advantage gives strong

motivation to use fixed-point math.

Fixed-point math is surprisingly easy to use. On Macintosh a fixed-point number is

32 bits long, that is Fixed = LongInt;. The low-order 16 bits contain the fractional

part of the number and the high-order 16 bits contain the integer portion of the

number. Addition and subtraction are handled as they normally would be: by simply

adding and subtracting the numbers. Multiplication, division and type conversions

must be performed using function calls. The Toolbox functions for performing these

operations are:

FUNCTION FixMul(a, b: Fixed): Fixed;

FUNCTION FixDiv(a, b: Fixed): Fixed;

FUNCTION Fix2X(x: Fixed): double_t;

FUNCTION X2Fix(x: double_t): Fixed;

FUNCTION Long2Fix(x: LongInt): Fixed;

FUNCTION Fix2Long(x: Fixed): LongInt;

FUNCTION FixRound(x: Fixed): Integer;

Because the extended variable type is not supported on a PowerPC, the original Fix2X

and X2Fix functions were updated to replace the Extended type with double_t, which is

equal to a Double on a PowerPC and an Extended on a 68k processor.

Goals

The primary motivation for using fixed-point calculations is speed. Unfortunately,

neither the 68k nor PowerPC Toolbox routines are well optimized. In fact, these

routines were not PowerPC native until system 7.5.3. In this article we will

concentrate on optimizations of the multiply, divide, and round functions as they are

used most often. For this work we will use the built-in assembler, available in the

Metrowerks Pascal Compiler, to write 68k and PowerPC code.

In addition to being fast, we require our functions to be well behaved. Because

fixed-point numbers must lie between 32767.9999 and -32768, there exists a

possibility of overflow during multiplication and division. If an overflow occurs, the

function returns a large value with the appropriate sign. In the case of multiplication,

an overflow condition behaves similar to the Toolbox routine: $7FFE0000 is returned

when both arguments have the same sign and -$7FFE0000 is returned when the

arguments have different signs. The largest possible fixed-point value is not returned

as this might cause an overflow in the calling code. Note that these examples differ

from the 68k assembler code of Lebedev (1994), which did not deal with overflow

conditions.

Fixed-Point Round

The algorithm for rounding a fixed-point number is very simple: one half is added to

the number and then the result is bit-shifted right by 16 bits. Shown in Listing 1 is

the function for rounding a fixed-point number. A normal function declaration is used

followed by the asm directive. Compiler directives are used to determine whether

680x0 or PowerPC instructions are used. Note that when compiled for a 68000-based

machine, the Toolbox routine is used. This is because we make use of instructions that

are available only on 68020 and newer processors.

680x0

Those unfortunate souls who program in C should note that the conventions for passing

parameters to and from a 68k function differ from the Pascal code shown here. The

function begins with the fralloc directive, which creates a stack frame using link a6,

#$0. Once x is loaded into register d0, it is added to $7FFF, which, to maintain

consistency with the PowerPC code shown below, is a tad less than one half ($8000).

An algebraic shift is used to sign extend the result. Note that the "immediate" form of

the asr bit-shift instruction can only shift a maximum of 8 bits, thus two instructions

are required. The result is placed on the stack in the location allocated by the caller,

i.e. $C(a6). Note that (a6) is the previous value of a6; $4(a6) is the return address;

$8(a6) is the value of x; and $C(a6) is the space allocated for the returned value. The

frfree directive simply undoes what fralloc did: it emits unlk a6. After executing this

instruction, a6 will contain the previous value of a6 and a7 will point to the return

address. Finally the function executes the rtd instruction to return to the caller by

loading the return address into the program counter. This instruction also causes the

stack pointer to point to the returned value thereby de-allocating the stack space

originally allocated for x.

PowerPC

Compared to the 680x0, assembly language on the PowerPC processor is much more

civilized. For example, we don't have to sort out the gruesome details of address modes

and stack pointers because parameters are passed in registers and the same

conventions are used for Pascal and C. In addition, there are only a few simple

addressing modes compared to the fourteen available on the 680x0.

As described by Evans (1996) and Kacmarcik (1995), volatile registers r3 through

r10 are used to pass parameters to the function. The PowerPC implementation of the

MyFixRound function illustrates this: x is passed to the function in register r3 and the

function result is returned in register r3. The PowerPC assembler differs from the

68k assembler because a stack frame is not required as long as the number of local

variables is kept small. The first instruction addic (add immediate with carry) adds

the sign-extended 16-bit $7FFF to r3 and places the result in r3. Because of the sign

extension, we cannot add the exact one-half value of $8000. However, this

shortcoming should not make any difference to real-world applications. The srawi

(shift right algebraic word immediate) instruction is used to perform an algebraic

shift right by 16 bits. The standard blr (branch to link register) instruction ends the

function. (The link register was loaded with the return address when the function was

called.) It should be noted that in PowerPC assembly language there are many extended

mnemonics that the assembler converts into valid instructions. However, the

Metrowerks compiler may not recognize all of them.

Listing 1: MyFixRound

Function MyFixRound(x :Fixed): LongInt;

{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A840;  {on a 68000 use the Toolbox routine}
{$ENDC} {$ENDC}

Function MyFixRound(x :Fixed): LongInt;

asm;

Begin

{$IFC OPTION(MC68020)}
  fralloc              {create an a6 stack frame}
  move.l  x, d0;          {move x into register d0}
  add.l  #$7FFF, d0;      {add 1/2 to x}
  asr.l  #$08, d0;      {shift 32-bit word to right by 8 bits}
  asr.l  #$08, d0;      {shift 32-bit word to right by 8 bits}

  move.l  d0, $000C(a6);  {place the result on the stack}
  frfree;              {free the a6 stack frame}
  rtd  #$0004;          {return and de-allocate parameter}
{$ENDC}
{$IFC POWERPC}
  addic  r3, r3, $7FFF;  {add 1/2 to x}
  srawi  r3, r3, 16;    {shift word to right by 16 bits}
  blr;                  {return; results are returned in r3}
{$ENDC}

End; {MyFixRound}

Fixed-Point multiply

The algorithm for multiplying two fixed-point values is also simple: multiply the two

values together and then bit-shift the result to the right by 16 bits. Shown in Listing

2 is the function for multiplying two fixed-point numbers. The basic structures of the

680x0 and PowerPC functions are similar to those shown in Listing 1.

680x0

This function illustrates the use of a processor instruction (muls.l) that is not

available from Pascal. The muls.l instruction multiplies two signed 32-bit numbers

together to give a 64-bit result, which is contained in two registers. After performing

the multiplication, the routine ensures that the result is smaller than the largest

acceptable fixed-point value, i.e. the high word of the result must be smaller than

$7FFF. If the result is within range, the bit shift right is performed. Because the

result is contained in two registers, the bit-shift operation requires a couple of steps:

1) d1 is shifted right by 16 bits; 2) d2 is algebraically shifted left by 16 bits; and,

3) the high order two bytes of d2 are moved into d1. Pairs of instructions are used to

perform the 16-bit shifts. In the case of an overflow or out-of-range result, the

function returns a large number whose sign depends on the signs of the two arguments.

Note that because two parameters are passed to this function, the result is placed in a

different location on the stack than for the MyFixRound function, i.e. $10(a6) versus

$0C(a6). In addition, the rtd instruction removes two four-byte parameters from the

stack.

PowerPC

Performing a 32-bit by 32-bit multiply on PowerPC processor requires two

instructions: mulhw calculates the high-order 32 bits, and mullw the low-order 32

bits. After performing the multiplication, the cmpwi instruction is used to check if the

result is smaller than the largest acceptable fixed-point value. Note that cmpwi is an

extended form of the cmpi (compare immediate) instruction. If the result is in range,

a technique similar to the 680x0 example is used to combine the results stored in

registers r5 and r6. However, only two instructions are needed on the PowerPC: the

rlwinm (rotate left word immediate then AND with mask) instruction is used to bit

shift the low-order 32 bits of r5 and the rlwimi (rotate left word immediate then

mask insert) is used to move the lower 16 bits of r6 into the top 16 bits of r3.

Out-of-range values are handled as in the 68k code.

Listing 2: MyFixMul

Function MyFixMul(x, y :Fixed):Fixed;

{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A868;    {on a 68000 use the Toolbox routine}
{$ENDC} {$ENDC}

Function MyFixMul(x, y :Fixed):Fixed;

asm;

Begin

{$IFC OPTION(MC68020)}
  fralloc      {create an a6 stack frame}
  move.l  x, d0;  {move x and y values into registers}
  move.l  y, d1;

  muls.l  d0, d2:d1;        {do the multiplication d2:d1 := d0*d1}
  cmpi.l  #$00007FFF, d2;    {check if the result is larger than the}
  bge  Overflow;          {maximum allowed for a fixed-point number}

  cmpi.l  #-$00007FFF, d2;  {check if the result is less than the}
  ble  Overflow;          {minimum allowed for a fixed-point number}

  lsr.l  #08, d1;      {want the high-order 2 bytes of d1; shift right
16 bits}
  lsr.l  #08, d1;      {can only bit shift a maximum of 8 bits}

Referenced by (6):