All Databases MacTech Vol 09-1993

Efficient C

Volume Number: 9

Issue Number: 8

Column Tag: C Workshop

Efficient C Programming

High-level optimizations

By Mike Scanlin, MacTech Magazine Regular Contributing Author

This article explains and gives examples of how to get better code generation for

common operations and constructs in C on the Macintosh. These higher level

optimizations will, on average, produce faster and smaller code in other languages as

well (C++, Pascal). Some of them will make your code harder to read, more difficult

to port, and possibly have a negative performance impact on non-680x0 CPUs.

However, for those cases where you’re optimizing your bottlenecks for the 680x0

CPUs, these tricks will help you.

There are several things you can do to your code independent of which language or

compiler you’re using in order to improve performance. Let’s start with those.

FILE INPUT/OUTPUT

There are three things to know in order to produce optimal file I/O: (1) always

read/write in sector aligned amounts, (2) read/write in as big of chunks as possible

and, (3) be sure to disable Apple’s disk cache.

From the Mac file system’s point of view, files are made up of blocks. From a

disk driver point of view blocks are made up of sectors. A sector is usually 512 bytes.

On large disks a block can be 10K or larger (files are a minimum of 1 block in size).

You can get the exact block size by doing a GetVInfo and checking the

volumeParam.ioVAlBlkSiz field. Your buffers should be multiples of this amount when

reading and writing to that particular volume (and if not, then they should at least be

multiples of the sector size) and should begin on a block boundary (or, at a minimum,

a sector boundary) within the file. Your reads/writes will be 2x faster if you

read/write aligned sectors than if you don’t.

Recently while implementing a virtual memory scheme I had to determine the

optimal VM page size for maximum disk throughput (measured in bytes/second). After

testing a variety of page sizes on a variety of CPUs and hard disks, I determined that

the optimal size was 64K. If you read and write less than 64K at a time you will not be

minimizing disk I/O time (and for very small reads and writes you will be paying a

significant throughput penalty). Here’s an experiment for the unbelievers: write a

little program that writes an 8MB file 512 bytes at a time and then writes another

8MB file 64K at a time. You should find that the 64K at a time case is 8x to 40x faster

than the 512 byte at a time case. Then try reading in that 8MB file 512 bytes at a time

and then 64K at a time. It should be about 35x to 40x faster for the 64K case (your

actual times will depend on your CPU and your hard drive).

Lastly, if you are using large aligned I/O buffers you should turn off Apple’s disk

cache for your reads and writes. IM Files pg. 2-95 says that you can do this by setting

bit 5 of ioPosMode before a Read or Write call. In those cases where the cache is

kicking in, you’ll be 13x faster by forcing it off. For a complete explanation of Apple’s

disk cache, see “Apple’s Lame Disk Cache” on pg. 75 of April 1993 MacTech.

OTHER LANGUAGE-INDEPENDENT THINGS

In a previous MacTech article (Sept 1992) I droned on and on about aligning

your data structures and stack usage. That’s true in any language on the Mac (because

it’s a result of the 680x0 architecture). Do it.

One thing you can do to reduce the calling overhead of your functions is to use

fewer parameters. Sometimes you’ll find that a general purpose routine that takes

several parameters is being called in a loop where most of the parameters aren’t

changing between calls within the loop. In cases like this you should make a parameter

block and pass a pointer to the parameter block rather than passing all of the

parameters each time. Not only does this make aligned-stack calling easier to

implement and maintain but it really reduces the pushing and popping of invariant

stack data during your loop. For instance, you could change this prototype:

void MyWizzyFunction(short wizFactor, long numWizzes, Boolean

doWiz, short fudgeFactor, short wizHorz, short wizVert);

to this:

void MyWizzyFunction(WizParamBlockPtr wizPBPtr);

with this parameter block:

typedef struct WizParamBlock {
long numWizzes;
short wizFactor;
short fudgeFactor;
short wizHorz;
short wizVert;
Boolean doWiz;
} WizParamBlock, * WizParamBlockPtr;

You’ll save a lot of time from reduced stack operations on each call to

MyWizzyFunction.

TABLE LOOKUPS

I once met someone who told me that every computer science problem could be

reduced to a table lookup. I guess that if your table was unlimited in size this might be

true (but then the table initialization time might kill you). Nonetheless, there are

many cases where code can be sped up with a relatively small table. The idea is to

precompute some data and look it up rather than recalculate it each time through a

loop. For example, this code:

!register Byte n, *xPtr;

Byte x[1000];

short y[1000];

yPtr = y;

xPtr = x;

i = 1000;

do {
n = *xPtr++;

*yPtr++ = n*n + n/5 - 7;

} while (--i);

is much slower than this code:

/* 1 */
register Byte *tablePtr;
register short tableOffset;
short table[256];
/* first pre-compute all possible
 * 256 values and store in table
 */
yPtr = table;
i = 0;
do {
*yPtr++ = i*i + i/5 - 7;
} while (++i < 256);
tablePtr = (Byte *) table;
yPtr = y;
xPtr = x;
i = 1000;
do {
/* we do manual scaling for speed */
tableOffset = *xPtr++;
tableOffset *= sizeof(short);
/* generates Move (Ax,Dx),(Ay)+ */
*yPtr++ = *(short *)
(tablePtr + tableOffset);
} while (--i);

This second version which only requires a 256 element table contains no

multiplies or divides. The tableOffset *= sizeof(short) statement compiles down to an

Add instruction since sizeof(short) evaluates to 2. The *yPtr++ = ... statement

compiles down to Move (Ax,Dx),(Ay)+ which is as optimal as you can do (and what you

would have written if you had been writing assembly).

One thing that’s really important to know when using lookup tables is that your

table element size needs to be a power of 2 in order to have fast pointer calculations

(which can be done with a shift of the index). If you only need 5 bytes per table

element then it would be better to pad each element to 8 bytes so that you can shift the

index by 3 (times 8) rather than multiplying it by 5 when trying to get a pointer to a

given element.

Also, depending on the amount of data involved, you may want to declare the table

as a static and let the compiler calculate its values at compile-time.

USE SHIFTS WHEN YOU CAN

This one is obvious enough that most programmers assume that the compiler

always does it for them. If you want to divide a value by 8, you might think that this

would generate an efficient shift right by 3:

x /= 8;

It’s true that if x is unsigned then MPW and Think do the right thing but if x is

signed they generate a divide. The reason is because you can’t shift a negative number

to the right to divide by 8 (if the original value is -1 you’ll get -1 as the result, too,

because of sign extension). To solve this problem, you should add 7 to x (when it’s

negative) before shifting. Use this instead of the above for signed right-shifts by 3:

/* 2 */
if (x < 0)
x += 7;
x >>= 3;

and use a shift left instead of a multiply when multiplying by a power of 2. Also, there

may be brain-dead compilers out there that your code will be ported to some day so you

should use the shift operator even when working with unsigned values. It’s a good habit

to get into.

USE & INSTEAD OF % WHEN YOU CAN

When moding by powers of 2, you should ‘and’ it by (value - 1) instead. Don’t do

this:

x = y % 8;

do this (to save a division):

x = y & (8 - 1);

As before, this may yield incorrect results if y is signed but if the result is just

to get the last 3 bits, it works fine. And if you want the remainder of a negative number

when divided by 8 (i.e. what mod would return to you if you used it) you can do this to

save a divide:

/* 3 */
x = y & (8 - 1);
if (y < 0)
x += 8;

DON’T USE MULTIPLY

As you know, multiply instructions are expensive on the 680x0 and you should

avoid them wherever possible. What you may not know, though, is the extent to which

you should avoid them. For instance, some would say that this code:

x *= 20;

is acceptable. However, in a tight loop it would be much better to use:

/* 4 */
temp = x;
temp += x;
temp += temp;
x <<= 4;
x += temp;

It’s not necessarily intuitive that five instructions are better than one but,

assuming temp and x are register variables, the times for the above are:

68000: 70 cycles for first one, 30 cycles for second

68030: 28 cycles for first one, 14 cycles for second

68040: 15 cycles for first one, 6 cycles for second

This type of C programming, which I call “writing assembly language with C

syntax” requires a detailed knowledge of your compiler and your register variables

allocation. It also requires a little knowledge of assembly language which, if you don’t

have, would be a good thing to start learning (use Think’s Disassemble command and

MPW’s dumpobj to see what the compiler is doing with your C code).

DON’T USE ‘FOR’ STATEMENTS

Many people resist this optimization but it falls into the category of convenient

syntax vs. efficient syntax. The basic point is that you can always do at least as good as

a ‘for’ loop by using a ‘while’ (for 0 or more iterations) or a ‘do-while’ loop (for 1

or more iterations), and in most cases you can do better by not using a ‘for’ loop. (In

fact, Wirth removed the ‘FOR’ keyword from his latest language Oberon because he

considered it unnecessary.)

Here’s an example. This code:

for (i = 0; i < NUM_LOOPS; i++) {
}
is better as:
/* 5 */
i = NUM_LOOPS;
do {
} while (--i);

because the first one generates:

Moveq #0,D7

Bra.S @2

@2 Addq #1,D7

Cmpi #NUM_LOOPS,D7

Blt.S @1

and the second one generates:

Moveq #NUM_LOOPS,D7

Subq #1,D7

Bne.S @1

Now, it’s true that I’m comparing apples and oranges a bit here because the first

loop counts up and the second loop counts down but the first loop is representative of

how I see a lot of inexperienced programmers write their ‘for’ loops. Even if they

were to make the optimization of counting down to zero, the do-while loop is still more

efficient because of the extra branch instruction at the top of the ‘for’ loop.

As an experiment, try writing your code without ‘for’ loops for a while. I think

you’ll find that it often becomes clearer and in many cases it will become more

efficient, too.

USE REASONABLE REGISTER VARIABLES

While register variables are certainly a good tool for making your code faster, if

you don’t use them right you might be hurting yourself.

When writing an application on the 680x0, you have 3 address registers

(pointers) and 5 data registers to play with. Do NOT declare more than that. And if

something doesn’t really need to be in a register (because it’s only read from once or

twice, for instance) then don’t put it in a register. The time to save, initialize and

restore the register will cause a performance hit rather than gain.

The most important thing is to write your functions so that they have a

reasonable number of local variables (no more than 3 pointers and 5 non-pointers,

ideally). If you just can’t split the function up or use fewer variables then try to use

reuse them later in the function for other things.

Even if you don’t use register variables, big functions with lots of locals make it

extremely difficult for any compiler to allocate registers efficiently. This applies to

many different machines and compilers.

Referenced by (5):