All Databases develop - 1993

December 93 - MAKING THE LEAP TO POWERPC

MAKING THE LEAP TO POWERPC

DAVE RADCLIFFE

Apple will soon be introducing the first Macintosh CPU architecture not based on a

68000-family microprocessor. The entirely new architecture is built around a new

RISC CPU -- the PowerPC microprocessor jointly designed by IBM, Motorola, and

Apple. Truly taking advantage of PowerPC technology will require an ongoing effort by

both Apple and developers. Apple is making the first leap to this new platform; now it's

up to developers to make the next leap and bring the performance made possible by

PowerPC technology to their applications.

In 1984, Apple Computer offered a startling vision of the future of personal

computing by introducing the Macintosh, which radically changed the desktop. Now,

nearly ten years later, the computing world embraces graphical interfaces. Ten years

is a lifetime in computing terms; at that age, many computing architectures are

considered ancient. The Macintosh enters its second decade by looking to the future

while remembering its past -- making the transition from the sturdy Motorola

68000 family to the sleek new PowerPC processor-based family without forsaking

developers and users and their investment in the 680x0 architecture.

The PowerPC microprocessor is the most significant change to date in the Macintosh

product line. This article introduces the new PowerPC architecture and discusses the

ramifications for existing applications, as well as opportunities for new or revised

applications to take full advantage of the power of the new chip. It contrasts the new

architecture with the old and explains how this new architecture both acknowledges

the past and prepares for the future.

COMPARING CISC AND RISC

Much has been written about the differences between a CISC (complex instruction set

computer) architecture, used in Motorola's MC680x0 processors, and a RISC (reduced

instruction set computer) architecture, used in the PowerPC microprocessor. The

relative merits of the two architectures have also been widely debated. A detailed

discussion of CISC and RISC is beyond the scope of this article, but some understanding

of RISC principles is useful for understanding PowerPC architecture.

Two logical considerations motivated CISC development. The first was a desire to

simplify assembly-language programming by enriching the functionality of the

instruction set. CISC architectures did this by providing a greater variety of

instructions, as well as a wide array of addressing modes, thereby reducing the

number of steps required to perform a particular operation. Second, as writing

compilers became easier, there was a desire to provide instructions more closely

related to operations performed by high-level languages. CISC architectures were

marvelously successful at satisfying this goal also. In the early 1980s, hardware

designers began to run into the limitations inherent in CISC architectures,

particularly in their ability to streamline the flow of instructions. At the same time,

the software world was deemphasizing assembly-language programming in favor of

high-level languages with sophisticated, optimizing compilers. This allowed hardware

designers to simplify their architecture and shift much of the performance burden to

compiler writers.

The classic equation for execution time is whereET is the total execution time,N is the

number of instructions executed,CPI is the number of cycles per instruction, andCT is

the cycle time. Both CISC and RISC architectures benefit from reduced cycle time.

Faster clock rates translate directly to smaller cycle times, and hence shorter

execution times. Where CISC and RISC architectures differ is in their approach toN and

CPI. CISC tries to shorten execution times by minimizingN, while RISC tries to

minimizeCPI.

PIPELINING

The four typical stages in executing an instruction are fetch, decode, execute, and

write. In a simplistic architecture, these stages all happen in sequence, and the next

instruction can't start until the previous instruction has finished, as shown in Figure

1. Designers realized that this need not be the case and that each of these stages can

overlap. Once an instruction is fetched and passed to the decode stage, the next

instruction can be fetched without waiting for the first instruction to complete. This

technique, known aspipelining, is shown in Figure 2.

The example in Figure 2 executes the same two instructions, but in only nine cycles,

compared to 12 cycles in the nonpipelined case. There's a curious thing about this

example, though: the second instruction takes eight cycles to complete when pipelined,

but only five when it's not. This is because the various stages take different amounts of

time to complete. The overall result is better, but unnecessary delays can occur in

instruction execution.

Figure 1 Nonpipelined Stages of Execution

Figure 2 Pipelined Stages of Execution

Variable numbers of cycles per stage is a characteristic of CISC architectures.

Complex instructions may occupy multiple words, requiring multiple cycles to fetch.

Multiple operands complicate the process of decoding. More complicated instructions

take longer to execute than simpler instructions. In Figure 2, the execute stage of the

second instruction is delayed two cycles while waiting for the first instruction to

execute. This is known as a pipelinestall. Similarly, the write stage sits idle for one

cycle between the first and second instructions while waiting for the execute stage of

the second instruction to complete. This is known as a pipelinebubble. Both stalls and

bubbles reduce the efficiency of the pipeline and increase the overall number of cycles

per instruction.

INCREASING PIPELINE EFFICIENCY

RISC architectures work very hard to eliminate inefficiencies in the instruction

pipeline and keep the pipeline jammed full. RISC architectures share most or all of the

following common features:

• Instructions are a uniform length. Variable-length instructions in CISC

architectures mean that time must be spent just figuring out how long the

instruction is and how many operands it uses. RISC architectures don't have

that problem.

• Simplified instructions, instruction formats, and addressing modes allow

for fast instruction decoding and execution.

• Relatively large numbers of registers and large amounts of fast-cache

memory reduce cycles spent for access to slower, main memory and allow

frequently used variables to be kept loaded.

• Load/store architecture is used for access to memory. The only

memory-to- register and register-to-memory operations are load and store

instructions. All other operations are register only. Register-to-memory and

memory-to-memory operations in CISC architectures require multiple cycles

to complete.

• Instructions are simple. In an ideal RISC machine, each stage requires one

cycle to complete.

• For improved performance, instructions can be implemented directly in

hardware instead of being microprogrammed as in CISC processors.

Figure 3 shows an example of executing instructions on a nonpipelined RISC machine.

When instructions are not pipelined, they complete serially, with two instructions

completing in eight cycles. The optimal case for pipelining instructions is shown in

Figure 4. Now you have the two instructions executing in just five cycles. If the

pipeline is kept full like this, the number of cycles per instruction drops to just one.

This is the goal of most RISC architectures.

Figure 3 RISC Nonpipelined Stages of Execution

Figure 4 RISC Pipelined Stages of Execution

One cycle per instruction is the ideal case for this example, but in reality, stalls and

bubbles occur, even in the best architectures. This is where the compiler comes into

play. The compiler has detailed knowledge of how the program should work. It need not

perform operations in the order specified in the source code; it need only guarantee

that the right result is obtained. If you build into the compiler some knowledge of how

to make best use of the CPU, the compiler can make a huge difference in program

performance.

Consider the following two C instructions:

b = *a + 5;

d = *c + 10;

The variablesa, b, c, and dare all long or pointer-to-long variables. The compiler

might generate the following assembly instructions on the PowerPC microprocessor:

lwz r5,0(r3) ; Load value pointed to by r3 into r5

addi r5,r5,0x0005 ; Add 5 to value in r5

lwz r6,0(r4) ; Load value pointed to by r4 into r6

addi r6,r6,0x000a ; Add 10 to value in r6

The lwz instruction (Load Word and Zero) loads a register from a source value. On a

PowerPC processor, words are 32-bit values; 16-bit values are half words.

Theaddiinstruction (Add Immediate) adds the immediate value and stores the result.

Figure 5 shows what happens when these instructions execute. Bothaddiinstructions

stall in the decode stage because they can't enter the execute stage until the register is

available from thelwzinstruction.

The compiler can prevent the stalls. Instead of following the flow of the original source

code, you can rearrange the instructions as follows:

lwz r5,0(r3) ; Load value pointed to by r3 into r5

lwz r6,0(r4) ; Load value pointed to by r4 into r6

addi r5,r5,0x0005 ; Add 5 to value in r5

addi r6,r6,0x000a ; Add 10 to value in r6

Now look at what happens to the instruction pipeline (Figure 6): there are no delays.

By moving the add instructions to later in the instruction stream, you allow the load

instructions they depend on to complete, so the add instructions can execute

immediately.

Figure 5 Stalled Pipelined Execution

Figure 6 No-Delay Pipelined Execution

BRANCHING

All pipelined architectures face the problem of branches. Any time a conditional

branch is encountered, the processor faces a dilemma because now two instruction

streams are possible. It can't pipeline both possible paths. It can guess which path to

take, but if it guesses wrong, the pipeline is disrupted.

One common approach to this problem is a technique calleddelayed branching. In

delayed branching, the processoralwaysexecutes the instruction immediately following

the branch instruction. While starting this instruction, the CPU can be figuring out

the destination of the branch instruction and so can keep the pipeline flowing. Of

course, it's important that the instruction after the branch not affect the branch. It's

up to the compiler to find an instruction unrelated to the branch instruction to fill this

delay slot. If it can't fill the delay slot, the compiler can always put in a no-op

instruction, but this is inefficient. Some architectures allow the instruction in the

delay slot to be ignored if the branch is taken. This avoids the need to fill the delay slot

with a no-op instruction, but undermines the purpose of delayed branching. PowerPC

architecture takes a unique approach to the branching problem, as discussed later in

the section "Branch Processor.

SUPERSCALAR DESIGN

Another technique RISC designers use to increase performance is superscalar or

multi-issue design. The simpler design of RISC architectures makes it possible to

build in multiple processing units; this is superscalar design. In the same way that the

compiler can juggle instructions to avoid resource constraints, the CPU can now

Referenced by (3):