December 93 - MAKING THE LEAP TO POWERPC
MAKING THE LEAP TO POWERPC
DAVE RADCLIFFE
Apple will soon be introducing the first Macintosh CPU architecture not based on a
68000-family microprocessor. The entirely new architecture is built around a new
RISC CPU -- the PowerPC microprocessor jointly designed by IBM, Motorola, and
Apple. Truly taking advantage of PowerPC technology will require an ongoing effort by
both Apple and developers. Apple is making the first leap to this new platform; now it's
up to developers to make the next leap and bring the performance made possible by
PowerPC technology to their applications.
In 1984, Apple Computer offered a startling vision of the future of personal
computing by introducing the Macintosh, which radically changed the desktop. Now,
nearly ten years later, the computing world embraces graphical interfaces. Ten years
is a lifetime in computing terms; at that age, many computing architectures are
considered ancient. The Macintosh enters its second decade by looking to the future
while remembering its past -- making the transition from the sturdy Motorola
68000 family to the sleek new PowerPC processor-based family without forsaking
developers and users and their investment in the 680x0 architecture.
The PowerPC microprocessor is the most significant change to date in the Macintosh
product line. This article introduces the new PowerPC architecture and discusses the
ramifications for existing applications, as well as opportunities for new or revised
applications to take full advantage of the power of the new chip. It contrasts the new
architecture with the old and explains how this new architecture both acknowledges
the past and prepares for the future.
COMPARING CISC AND RISC
Much has been written about the differences between a CISC (complex instruction set
computer) architecture, used in Motorola's MC680x0 processors, and a RISC (reduced
instruction set computer) architecture, used in the PowerPC microprocessor. The
relative merits of the two architectures have also been widely debated. A detailed
discussion of CISC and RISC is beyond the scope of this article, but some understanding
of RISC principles is useful for understanding PowerPC architecture.
Two logical considerations motivated CISC development. The first was a desire to
simplify assembly-language programming by enriching the functionality of the
instruction set. CISC architectures did this by providing a greater variety of
instructions, as well as a wide array of addressing modes, thereby reducing the
number of steps required to perform a particular operation. Second, as writing
compilers became easier, there was a desire to provide instructions more closely
related to operations performed by high-level languages. CISC architectures were
marvelously successful at satisfying this goal also. In the early 1980s, hardware
designers began to run into the limitations inherent in CISC architectures,
particularly in their ability to streamline the flow of instructions. At the same time,
the software world was deemphasizing assembly-language programming in favor of
high-level languages with sophisticated, optimizing compilers. This allowed hardware
designers to simplify their architecture and shift much of the performance burden to
compiler writers.
The classic equation for execution time is whereET is the total execution time,N is the
number of instructions executed,CPI is the number of cycles per instruction, andCT is
the cycle time. Both CISC and RISC architectures benefit from reduced cycle time.
Faster clock rates translate directly to smaller cycle times, and hence shorter
execution times. Where CISC and RISC architectures differ is in their approach toN and
CPI. CISC tries to shorten execution times by minimizingN, while RISC tries to
minimizeCPI.
PIPELINING
The four typical stages in executing an instruction are fetch, decode, execute, and
write. In a simplistic architecture, these stages all happen in sequence, and the next
instruction can't start until the previous instruction has finished, as shown in Figure
1. Designers realized that this need not be the case and that each of these stages can
overlap. Once an instruction is fetched and passed to the decode stage, the next
instruction can be fetched without waiting for the first instruction to complete. This
technique, known aspipelining, is shown in Figure 2.
The example in Figure 2 executes the same two instructions, but in only nine cycles,
compared to 12 cycles in the nonpipelined case. There's a curious thing about this
example, though: the second instruction takes eight cycles to complete when pipelined,
but only five when it's not. This is because the various stages take different amounts of
time to complete. The overall result is better, but unnecessary delays can occur in
instruction execution.
Figure 1 Nonpipelined Stages of Execution
Figure 2 Pipelined Stages of Execution
Variable numbers of cycles per stage is a characteristic of CISC architectures.
Complex instructions may occupy multiple words, requiring multiple cycles to fetch.
Multiple operands complicate the process of decoding. More complicated instructions
take longer to execute than simpler instructions. In Figure 2, the execute stage of the
second instruction is delayed two cycles while waiting for the first instruction to
execute. This is known as a pipelinestall. Similarly, the write stage sits idle for one
cycle between the first and second instructions while waiting for the execute stage of
the second instruction to complete. This is known as a pipelinebubble. Both stalls and
bubbles reduce the efficiency of the pipeline and increase the overall number of cycles
per instruction.
INCREASING PIPELINE EFFICIENCY
RISC architectures work very hard to eliminate inefficiencies in the instruction
pipeline and keep the pipeline jammed full. RISC architectures share most or all of the
following common features:
• Instructions are a uniform length. Variable-length instructions in CISC
architectures mean that time must be spent just figuring out how long the
instruction is and how many operands it uses. RISC architectures don't have
that problem.
• Simplified instructions, instruction formats, and addressing modes allow
for fast instruction decoding and execution.
• Relatively large numbers of registers and large amounts of fast-cache
memory reduce cycles spent for access to slower, main memory and allow
frequently used variables to be kept loaded.
• Load/store architecture is used for access to memory. The only
memory-to- register and register-to-memory operations are load and store
instructions. All other operations are register only. Register-to-memory and
memory-to-memory operations in CISC architectures require multiple cycles
to complete.
• Instructions are simple. In an ideal RISC machine, each stage requires one
cycle to complete.
• For improved performance, instructions can be implemented directly in
hardware instead of being microprogrammed as in CISC processors.
Figure 3 shows an example of executing instructions on a nonpipelined RISC machine.
When instructions are not pipelined, they complete serially, with two instructions
completing in eight cycles. The optimal case for pipelining instructions is shown in
Figure 4. Now you have the two instructions executing in just five cycles. If the
pipeline is kept full like this, the number of cycles per instruction drops to just one.
This is the goal of most RISC architectures.
Figure 3 RISC Nonpipelined Stages of Execution
Figure 4 RISC Pipelined Stages of Execution
One cycle per instruction is the ideal case for this example, but in reality, stalls and
bubbles occur, even in the best architectures. This is where the compiler comes into
play. The compiler has detailed knowledge of how the program should work. It need not
perform operations in the order specified in the source code; it need only guarantee
that the right result is obtained. If you build into the compiler some knowledge of how
to make best use of the CPU, the compiler can make a huge difference in program
performance.
Consider the following two C instructions:
b = *a + 5;
d = *c + 10;
The variablesa, b, c, and dare all long or pointer-to-long variables. The compiler
might generate the following assembly instructions on the PowerPC microprocessor:
lwz r5,0(r3) ; Load value pointed to by r3 into r5
addi r5,r5,0x0005 ; Add 5 to value in r5
lwz r6,0(r4) ; Load value pointed to by r4 into r6
addi r6,r6,0x000a ; Add 10 to value in r6
The lwz instruction (Load Word and Zero) loads a register from a source value. On a
PowerPC processor, words are 32-bit values; 16-bit values are half words.
Theaddiinstruction (Add Immediate) adds the immediate value and stores the result.
Figure 5 shows what happens when these instructions execute. Bothaddiinstructions
stall in the decode stage because they can't enter the execute stage until the register is
available from thelwzinstruction.
The compiler can prevent the stalls. Instead of following the flow of the original source
code, you can rearrange the instructions as follows:
lwz r5,0(r3) ; Load value pointed to by r3 into r5
lwz r6,0(r4) ; Load value pointed to by r4 into r6
addi r5,r5,0x0005 ; Add 5 to value in r5
addi r6,r6,0x000a ; Add 10 to value in r6
Now look at what happens to the instruction pipeline (Figure 6): there are no delays.
By moving the add instructions to later in the instruction stream, you allow the load
instructions they depend on to complete, so the add instructions can execute
immediately.
Figure 5 Stalled Pipelined Execution
Figure 6 No-Delay Pipelined Execution
BRANCHING
All pipelined architectures face the problem of branches. Any time a conditional
branch is encountered, the processor faces a dilemma because now two instruction
streams are possible. It can't pipeline both possible paths. It can guess which path to
take, but if it guesses wrong, the pipeline is disrupted.
One common approach to this problem is a technique calleddelayed branching. In
delayed branching, the processoralwaysexecutes the instruction immediately following
the branch instruction. While starting this instruction, the CPU can be figuring out
the destination of the branch instruction and so can keep the pipeline flowing. Of
course, it's important that the instruction after the branch not affect the branch. It's
up to the compiler to find an instruction unrelated to the branch instruction to fill this
delay slot. If it can't fill the delay slot, the compiler can always put in a no-op
instruction, but this is inefficient. Some architectures allow the instruction in the
delay slot to be ignored if the branch is taken. This avoids the need to fill the delay slot
with a no-op instruction, but undermines the purpose of delayed branching. PowerPC
architecture takes a unique approach to the branching problem, as discussed later in
the section "Branch Processor.
SUPERSCALAR DESIGN
Another technique RISC designers use to increase performance is superscalar or
multi-issue design. The simpler design of RISC architectures makes it possible to
build in multiple processing units; this is superscalar design. In the same way that the
compiler can juggle instructions to avoid resource constraints, the CPU can now