Efficient C
Volume Number: 9
Issue Number: 8
Column Tag: C Workshop
Efficient C Programming 
High-level optimizations
By Mike Scanlin, MacTech Magazine Regular Contributing Author
This article explains and gives examples of how to get better code generation for
common operations and constructs in C on the Macintosh. These higher level
optimizations will, on average, produce faster and smaller code in other languages as
well (C++, Pascal). Some of them will make your code harder to read, more difficult
to port, and possibly have a negative performance impact on non-680x0 CPUs.
However, for those cases where you’re optimizing your bottlenecks for the 680x0
CPUs, these tricks will help you.
There are several things you can do to your code independent of which language or
compiler you’re using in order to improve performance. Let’s start with those.
FILE INPUT/OUTPUT
There are three things to know in order to produce optimal file I/O: (1) always
read/write in sector aligned amounts, (2) read/write in as big of chunks as possible
and, (3) be sure to disable Apple’s disk cache.
From the Mac file system’s point of view, files are made up of blocks. From a
disk driver point of view blocks are made up of sectors. A sector is usually 512 bytes.
On large disks a block can be 10K or larger (files are a minimum of 1 block in size).
You can get the exact block size by doing a GetVInfo and checking the
volumeParam.ioVAlBlkSiz field. Your buffers should be multiples of this amount when
reading and writing to that particular volume (and if not, then they should at least be
multiples of the sector size) and should begin on a block boundary (or, at a minimum,
a sector boundary) within the file. Your reads/writes will be 2x faster if you
read/write aligned sectors than if you don’t.
Recently while implementing a virtual memory scheme I had to determine the
optimal VM page size for maximum disk throughput (measured in bytes/second). After
testing a variety of page sizes on a variety of CPUs and hard disks, I determined that
the optimal size was 64K. If you read and write less than 64K at a time you will not be
minimizing disk I/O time (and for very small reads and writes you will be paying a
significant throughput penalty). Here’s an experiment for the unbelievers: write a
little program that writes an 8MB file 512 bytes at a time and then writes another
8MB file 64K at a time. You should find that the 64K at a time case is 8x to 40x faster
than the 512 byte at a time case. Then try reading in that 8MB file 512 bytes at a time
and then 64K at a time. It should be about 35x to 40x faster for the 64K case (your
actual times will depend on your CPU and your hard drive).
Lastly, if you are using large aligned I/O buffers you should turn off Apple’s disk
cache for your reads and writes. IM Files pg. 2-95 says that you can do this by setting
bit 5 of ioPosMode before a Read or Write call. In those cases where the cache is
kicking in, you’ll be 13x faster by forcing it off. For a complete explanation of Apple’s
disk cache, see “Apple’s Lame Disk Cache” on pg. 75 of April 1993 MacTech.
OTHER LANGUAGE-INDEPENDENT THINGS
In a previous MacTech article (Sept 1992) I droned on and on about aligning
your data structures and stack usage. That’s true in any language on the Mac (because
it’s a result of the 680x0 architecture). Do it.
One thing you can do to reduce the calling overhead of your functions is to use
fewer parameters. Sometimes you’ll find that a general purpose routine that takes
several parameters is being called in a loop where most of the parameters aren’t
changing between calls within the loop. In cases like this you should make a parameter
block and pass a pointer to the parameter block rather than passing all of the
parameters each time. Not only does this make aligned-stack calling easier to
implement and maintain but it really reduces the pushing and popping of invariant
stack data during your loop. For instance, you could change this prototype:
void MyWizzyFunction(short wizFactor, long numWizzes, Boolean
doWiz, short fudgeFactor, short wizHorz, short wizVert);
to this:
void MyWizzyFunction(WizParamBlockPtr wizPBPtr);
with this parameter block:
typedef struct WizParamBlock {
long numWizzes;
short wizFactor;
short fudgeFactor;
short wizHorz;
short wizVert;
Boolean doWiz;
} WizParamBlock, * WizParamBlockPtr;
You’ll save a lot of time from reduced stack operations on each call to
MyWizzyFunction.
TABLE LOOKUPS
I once met someone who told me that every computer science problem could be
reduced to a table lookup. I guess that if your table was unlimited in size this might be
true (but then the table initialization time might kill you). Nonetheless, there are
many cases where code can be sped up with a relatively small table. The idea is to
precompute some data and look it up rather than recalculate it each time through a
loop. For example, this code:
!register Byte n, *xPtr;
register short i, *yPtr;
Byte x[1000];
short y[1000];
yPtr = y;
xPtr = x;
i = 1000;
*yPtr++ = n*n + n/5 - 7;
is much slower than this code:
/* 1 */
register Byte *tablePtr;
register short tableOffset;
short table[256];
/* first pre-compute all possible
* 256 values and store in table
*/
yPtr = table;
i = 0;
do {
*yPtr++ = i*i + i/5 - 7;
} while (++i < 256);
tablePtr = (Byte *) table;
yPtr = y;
xPtr = x;
i = 1000;
do {
/* we do manual scaling for speed */
tableOffset = *xPtr++;
tableOffset *= sizeof(short);
/* generates Move (Ax,Dx),(Ay)+ */
*yPtr++ = *(short *)
(tablePtr + tableOffset);
} while (--i);
This second version which only requires a 256 element table contains no
multiplies or divides. The tableOffset *= sizeof(short) statement compiles down to an
Add instruction since sizeof(short) evaluates to 2. The *yPtr++ = ... statement
compiles down to Move (Ax,Dx),(Ay)+ which is as optimal as you can do (and what you
would have written if you had been writing assembly).
One thing that’s really important to know when using lookup tables is that your
table element size needs to be a power of 2 in order to have fast pointer calculations
(which can be done with a shift of the index). If you only need 5 bytes per table
element then it would be better to pad each element to 8 bytes so that you can shift the
index by 3 (times 8) rather than multiplying it by 5 when trying to get a pointer to a
given element.
Also, depending on the amount of data involved, you may want to declare the table
as a static and let the compiler calculate its values at compile-time.
USE SHIFTS WHEN YOU CAN
This one is obvious enough that most programmers assume that the compiler
always does it for them. If you want to divide a value by 8, you might think that this
would generate an efficient shift right by 3:
x /= 8;
It’s true that if x is unsigned then MPW and Think do the right thing but if x is
signed they generate a divide. The reason is because you can’t shift a negative number
to the right to divide by 8 (if the original value is -1 you’ll get -1 as the result, too,
because of sign extension). To solve this problem, you should add 7 to x (when it’s
negative) before shifting. Use this instead of the above for signed right-shifts by 3:
/* 2 */
if (x < 0)
x += 7;
x >>= 3;
and use a shift left instead of a multiply when multiplying by a power of 2. Also, there
may be brain-dead compilers out there that your code will be ported to some day so you
should use the shift operator even when working with unsigned values. It’s a good habit
to get into.
USE & INSTEAD OF % WHEN YOU CAN
When moding by powers of 2, you should ‘and’ it by (value - 1) instead. Don’t do
this:
x = y % 8;
do this (to save a division):
x = y & (8 - 1);
As before, this may yield incorrect results if y is signed but if the result is just
to get the last 3 bits, it works fine. And if you want the remainder of a negative number
when divided by 8 (i.e. what mod would return to you if you used it) you can do this to
save a divide:
/* 3 */
x = y & (8 - 1);
if (y < 0)
x += 8;
DON’T USE MULTIPLY
As you know, multiply instructions are expensive on the 680x0 and you should
avoid them wherever possible. What you may not know, though, is the extent to which
you should avoid them. For instance, some would say that this code:
x *= 20;
is acceptable. However, in a tight loop it would be much better to use:
/* 4 */
temp = x;
temp += x;
temp += temp;
x <<= 4;
x += temp;
It’s not necessarily intuitive that five instructions are better than one but,
assuming temp and x are register variables, the times for the above are:
68000: 70 cycles for first one, 30 cycles for second
68030: 28 cycles for first one, 14 cycles for second
68040: 15 cycles for first one, 6 cycles for second
This type of C programming, which I call “writing assembly language with C
syntax” requires a detailed knowledge of your compiler and your register variables
allocation. It also requires a little knowledge of assembly language which, if you don’t
have, would be a good thing to start learning (use Think’s Disassemble command and
MPW’s dumpobj to see what the compiler is doing with your C code).
DON’T USE ‘FOR’ STATEMENTS
Many people resist this optimization but it falls into the category of convenient
syntax vs. efficient syntax. The basic point is that you can always do at least as good as
a ‘for’ loop by using a ‘while’ (for 0 or more iterations) or a ‘do-while’ loop (for 1
or more iterations), and in most cases you can do better by not using a ‘for’ loop. (In
fact, Wirth removed the ‘FOR’ keyword from his latest language Oberon because he
considered it unnecessary.)
Here’s an example. This code:
for (i = 0; i < NUM_LOOPS; i++) {
}
is better as:
/* 5 */
i = NUM_LOOPS;
do {
} while (--i);
because the first one generates:
Moveq #0,D7
Bra.S @2
@1
@2 Addq #1,D7
Cmpi #NUM_LOOPS,D7
Blt.S @1
and the second one generates:
Moveq #NUM_LOOPS,D7
@1
Subq #1,D7
Bne.S @1
Now, it’s true that I’m comparing apples and oranges a bit here because the first
loop counts up and the second loop counts down but the first loop is representative of
how I see a lot of inexperienced programmers write their ‘for’ loops. Even if they
were to make the optimization of counting down to zero, the do-while loop is still more
efficient because of the extra branch instruction at the top of the ‘for’ loop.
As an experiment, try writing your code without ‘for’ loops for a while. I think
you’ll find that it often becomes clearer and in many cases it will become more
efficient, too.
USE REASONABLE REGISTER VARIABLES
While register variables are certainly a good tool for making your code faster, if
you don’t use them right you might be hurting yourself.
When writing an application on the 680x0, you have 3 address registers
(pointers) and 5 data registers to play with. Do NOT declare more than that. And if
something doesn’t really need to be in a register (because it’s only read from once or
twice, for instance) then don’t put it in a register. The time to save, initialize and
restore the register will cause a performance hit rather than gain.
The most important thing is to write your functions so that they have a
reasonable number of local variables (no more than 3 pointers and 5 non-pointers,
ideally). If you just can’t split the function up or use fewer variables then try to use
register variables with restricted scope (some subset of the function) so that you can
reuse them later in the function for other things.
Even if you don’t use register variables, big functions with lots of locals make it
extremely difficult for any compiler to allocate registers efficiently. This applies to
many different machines and compilers.