PowerPC Series
Volume Number: 10
Issue Number: 1
Column Tag: PowerPC Series
PowerPC Code Generation 
What’s the difference between PowerPC and 68K machines?
By Peter A. Jacobson, Absoft Corp.
About the author
Peter is a principle of Absoft. Along with his partner Wood Lotz, he has been
developing scientific and engineering software since 1979 for a wide variety of
micro- and mini-computers.
This article will discuss some of this issues concerning code generation by high
level language compilers for the IBM PowerPC RISC microprocessor. It will compare
and contrast typical code generation strategies employed on CISC based architectures,
such as the Motorola M68000 family of microprocessors, against the approach that
might be taken with the PowerPC. The topics addressed will include addressing modes,
register sets, instruction sets, instruction pipelines, and superscalar considerations.
It should be understood that certain features of the PowerPC will be simplified and
various aspects of code generation will be trivialized in order to facilitate this
discussion.
It is difficult to arrive at a precise definition of what constitutes a RISC
microprocessor. It rarely means that an individual machine actually has fewer
instructions than its CISC counterpart. The PowerPC has over 230 instructions while
an MC68020 has barely 100. The technology progresses so quickly that definitions are
amended before they even come into common usage. In addition, features that were once
ascribed only to RISC technology have found their way into CISC architectures.
However, one of the most significant differences affecting code generation for RISC
microprocessors is that instructions are restricted to one machine word in length and
there are consequently only a limited number of instruction formats available which
can access memory. Typically, load and store are the only memory operations provided
and usually with extremely limited effective addressing modes. The PowerPC provides
just one fundamental addressing mode: register indirect with index. The index, which
may be either an immediate operand or a general purpose register, is added to a
general purpose register to form the effective address. The immediate operand is
encoded in the instruction and consists of a 16-bit value, sign extended to 32 bits. The
register can be suppressed by specifying general purpose register R0 so that an
address is formed from just the index. In this way, absolute addresses can be formed
from the immediate operand, but they are limited to just the lowest and the highest
32768 bytes of memory.
Obviously, this restraint seriously affects code generation strategies. With a
Motorola MC680x0, an efficient code generator could add 1 to a variable by
incrementing a memory location directly with an ADDQ instruction. While for the
PowerPC, it is necessary to first load the variable from memory, perform the
addition, and then store the result. It might appear that this model leads to very
inefficient code production, but there are many other factors that must be considered
in generating code. First, most program variables are usually accessed more than once
in a given procedure. Therefore, for either type of microprocessor, it is almost always
more efficient to have the variable already available in a register, rather than
repeatedly accessing main memory for it. Since RISC microprocessors provide such a
limited number of instructions which can access memory, code generators must be
capable of performing a very sophisticated analysis of program flow and allocate
registers accordingly. RISC microprocessors typically provide a large set of registers
that can be used to maintain copies of program variables. The Motorola MC68040,
widely used in the current generation of Macintoshes, is limited to 8 data registers, 8
address registers, 8 floating point registers, and a condition code register. The
PowerPC provides 32 general purpose registers, 32 floating point registers, a
condition register divided into eight 4-bit fields, and six user-level special purpose
registers. The general purpose registers can be used for both addresses and data. Just
as with the MC68040, some of the registers are reserved for special purposes (such
as stack pointer, data space pointer, etc.), but the PowerPC still provides a large
number of registers for program use.
Compilers also create their own variables, many of which can have short, but
very active life spans. Such compiler created variables are used for loop induction,
array indexing, maintaining the intermediate results of expression evaluation, and so
on. The register set represents the fastest memory available to the microprocessor and
efficient register allocation is critical to program performance. Various register
allocation schemes are used by code generators to insure that the most appropriate
variables are allocated to registers, either temporarily within a region of code, or
permanently, for the length of the procedure. Further, compilers do not necessarily
immediately write the result of an assignment statement to memory. This is known as a
delayed store and is employed to allow efficient scheduling of the instruction stream
(discussed below). Indeed, variables which are local to a procedure may never be
written to memory. However, regardless of how efficiently the compiler allocates
variables to registers, it must provide a mechanism by which a programmer can
indicate that a variable (and its associated memory location) is volatile. Processes in
many real time systems often communicate with each other through memory locations
and use memory mapped I/O to control or react to external devices. If, by setting a
variable to a specific value, the programmer intends to control a valve or launch a
missile, it would be inappropriate (to say the least) for the code generator not to
update the associated memory location immediately.
To the programmer accustomed to the Motorola M68000 family of
microprocessors and unfamiliar with RISC architectures, the instruction set of the
PowerPC may seem initially puzzling. Nevertheless, the PowerPC architecture has
much in common with other RISC microprocessors such as the SPARC, MC88110,
R4400, and obviously POWER. The first significant difference is that most of the
instructions take three operands, two sources and a destination, and several
instructions take more. Also, there is no stack pointer, no instructions for calling
subroutines, no obvious way to move the contents of a general purpose register to
another general purpose register, and many other apparent deficiencies. (However,
programmers familiar will older mainframes and mini-computers will find nothing
new here.) Consider the following instruction:
fnmsubs 6,12,13,18
This is the “Floating Point Negative Multiply-Subtract (Single-Precision)”
instruction. Since there is no ambiguity in the instruction set, registers are indicated
by number only - register numbers cannot be confused with immediate values. This
instruction says to multiply the operand in floating point register 12 by the operand
in floating point register 13 and then subtract the operand in floating point register
18 from this intermediate value. The result is rounded, then negated, and finally
placed in floating point register 6. The latency of this instruction is just 4 clocks - the
total time it takes to execute the instruction and for the result to be available in the
destination floating point register.
Since every instruction can have a destination operand different from its
source(s), compilers are not forced to either copy or reload values (variables or
expressions) that will be used multiple times in a block of code. This is important not
only in avoiding unnecessary memory accesses, but as will be seen later, provides
opportunities for exploiting the instruction pipeline and the superscalar nature of the
PowerPC.
The problem of there being no stack pointer in the PowerPC architecture has
been addressed by the various standards bodies concerned with the PowerPC. Through
the formalization and adoption of ABIs (Application Binary Interfaces) the needs of
high-level languages for a uniform stack pointer and stack frame have been addressed.
General purpose register 1 is normally designated as the stack pointer and various
locations in the frame have been reserved for house keeping purposes. A frame is often
created by saving the current stack pointer and then subtracting the required frame
amount from the stack pointer to create the new frame. In practice it is easier to
accomplish this than it appears since one form of the store instruction will write the
effective address of the destination into the register used to calculate the effective
address:
stwu rS,d(rA)
This is the “Store Word with Update” instruction which says to store the
contents of the source register rS at an effective address equal to the contents of
general purpose register rA plus the immediate index value d and then place that
effective address in rA. To create a frame, rS and rA would be 1 and d would be
negative. The instruction would cause r1 to be stored at the location resulting from the
calculation of the effective address r1-d and then update r1 to r1-d.
One of the most important locations in the frame is naturally where the return
address for a subroutine call is stored. As stated earlier, the PowerPC does not have a
subroutine call instruction - instead the branch instruction is used. A form of this
instruction places the address of the instruction that follows the branch into a special
purpose register called the link register. Any procedure which is not a leaf (i.e. a
procedure which calls other procedures) must save the link register before calling
another procedure. A subroutine return is accomplished by simply branching to the
contents of the link register.
The so-called fused multiply-add instructions are another feature of the
PowerPC instruction set that is important enough to be mentioned here. These
instructions can perform a multiplication and an addition in the same amount of time
as just a single multiplication or a single addition alone. In other words, twice as fast
as the combined operations. Fortunately, this type of operation occurs often enough in
mathematical software that the alert code generator will find ample opportunities to
exploit them. For example, expressions of the form:
a1 = a0 + b x c
appear in matrix operations and in polynomial expansions.
The PowerPC implements a true superscalar architecture. A superscalar
machine is one which can issue multiple instructions to different execution units
during each clock cycle. The PowerPC incorporates three different execution units that
can operate independently and in parallel. They are the integer unit which affects the
general purpose registers, the floating point unit which affects the floating point
registers, and the branch unit which affects certain of the special purpose registers.
Therefore, an integer shift, a floating point addition, and a branch instruction could all
be issued during the same clock cycle. It is important to understand that not all of the
PowerPC instructions can execute in a single clock cycle and it would be extremely
difficult to schedule all three execution units for simultaneous execution on every
cycle, but with careful code generation and attention paid to data dependencies, an
exceptionally efficient throughput can be achieved.
It is not necessary for an instruction to completely finish in an individual
execution unit before another instruction can be issued. The execution of an instruction
consists of multiple stages that can be viewed (very roughly for the PowerPC is far
more complicated) as fetch, decode, execute, and writeback. Each instruction is fetched
from an instruction queue, decoded, executed, and the result is then written to the
appropriate register file. These stages are called the pipeline and it is possible and
certainly desirable for multiple instructions to be in the pipeline at once - each at a
different stage. The basic limitation which would cause an instruction to stall is data
dependency, which means that the execution of the instruction is dependant on the
result of the preceding instruction. An instruction can also be stalled if it is waiting
for an instruction with a latency greater than once clock to finish executing. That is, an
instruction takes more cycles than there are stages in the pipeline for that execution
unit. Instruction latency is determined by how complicated an instruction is (division
takes longer than addition) and by memory access considerations. An instruction may
stall while waiting for an operand to be delivered from memory. The issues of cache
arbitration, both for instructions and data, are beyond the scope of this article.
A code generator which is aware of these two features, multiple execution units
and their pipelines, attempts to schedule the instruction stream to make the most
efficient use of the resources. Scheduling consists largely of the code generator
rearranging or moving instructions to eliminate data dependencies and to keep the
individual pipelines busy. This can cause expressions to executed out of order, array
element address calculations to take place far from the memory references, and any
number of other reorderings of the instruction stream to eliminate data dependencies.
Obviously, register allocation seriously affects this scheduling process and is usually
put off as long as possible to prevent any artificial or code-generator created
dependencies.
“It projects a military coup!”