March 95 - Balance of Power: Introducing PowerPC Assembly Language
Balance of Power: Introducing PowerPC Assembly
Language
Dave Evans
So far I've avoided the subject of PowerPC(TM) assembly language in this column, for
fear of being struck down by the portability gods. But I also realize that a column on
PowerPC development without a discussion of this subject would be too pious. Although
today's compiler technology makes assembly language generally unnecessary, you
might find it useful for critical subroutines or program bottlenecks. In this column
I'll try to give you enough information to satisfy that occasional need.
If the thought of using assembly language still troubles you, please consider this as
useful information for debugging. Eventually you'll need to read PowerPC assembly for
tracing through code that was optimized, or when symbolic debugging just isn't
practical. Also in this column, I'll cover the runtime basics that will help you
recognize stack frames and routine calls during debugging.
USING POWERPC ASSEMBLY LANGUAGE
Assembly language on the PowerPC processor should be used only for the most
performance-critical code -- that is, when that last 5% performance improvement is
worth the extra effort. This code typically consists of tight loops or routines that are
very frequently used.
After you've carefully profiled your code and found a bottleneck routine in which your
application spends most of its time, then what do you do? First you need an assembler;
I recommend Apple's PPCAsm (part of MPW Pro or E.T.O., both available from Apple
Developer Catalog).
Next, you'll need to understand the instruction set and syntax. This column will give
you a basic summary, but for a thorough reference you'll need thePowerPC 601 RISC
Microprocessor User's Manual; to order one, call 1-800-POWERPC
(1-800-769-3772).
Finally, you need to know the basic PowerPC runtime details -- for example, that
parameters are passed in general registers R3 through R10, that the stack frame is
set up by the callee, and so on.
Once you have these tools and information, you can easily write a subroutine in
assembly language that's callable from any high-level language. Then you'll need to
review your code with the persistence of Hercules, fixing pipeline stalls and otherwise
improving your performance.
THE INSTRUCTION SET AT A GLANCE
Many people think RISC processors have fewer instructions than CISC processors.
What's truer is that each RISC instruction has reduced complexity, especially in
memory addressing, but there are often many more instructions than in a CISC
instruction set. You'll be amazed at the number and variation of the instructions in the
PowerPC instruction set. The basic categories are similar to 680x0 assembly
language:
• integer arithmetic and logical instructions
• instructions to load and store data
• compare and branch instructions
• floating-point instructions
• processor state instructions
We'll go over the first three categories here; you can read more about the last two in
the PowerPC user's manual. Once you're familiar with the PowerPC mnemonics, you'll
notice the similarity with any other instruction set. But first let's look at some key
differences from 680x0 assembly: register usage, memory addressing, and branching.
KEY DIFFERENCES
Most PowerPC instructions take three registers as opposed to two, and in the reverse
order compared to 680x0 instructions. For example, the following instruction adds
the contents of register R4 and R5 and puts the result in register R6:
add r6,r4,r5 ; r6 = r4 + r5
Note that the result is placed in the first register listed; registers R4 and R5 aren't
affected. Most instructions operate on the last two registers and place the result in the
first register listed.
Unlike the 680x0 processors, the PowerPC processor doesn't allow many instructions
to deal directly with memory. Most instructions take only registers as arguments. The
branch, load, and store instructions are the only ones with ways of effectively
addressing memory.
• The branch instructions use three addressing modes: immediate, link
register indirect, and count register indirect. The first includes relative and
absolute addresses, while the other two let you load the link or count register
and use it as a target address. (The link and count registers are
special-purpose registers used just for branching.) Using the link register is
also how you return from a subroutine call, as I'll demonstrate in a moment.
• Load and store instructions have three addressing modes: register
indirect, which uses a register as the effective address; register indirect with
index, which uses the addition of two registers as the effective address; and
register indirect with immediate index, which adds a constant offset to a
register for the effective address. I'll show examples of these later.
The more complicated 680x0 addressing modes do not have equivalents in PowerPC
assembly language.
On 680x0 processors, there are branch instructions and separate jump (jmp), jump
to subroutine (jsr), and return from subroutine (rts) instructions. But in PowerPC
assembly there are only branches. All branches can be conditional or nonconditional;
they all have the same addressing modes, and they can choose to store the next
instruction's address in the link register. This last point is how subroutine calls are
made and then returned from. A call to a subroutine uses a branch with link (bl)
instruction, which loads the link register with the next instruction and then jumps to
the effective address. To return from the subroutine, you use the branch to link
register (blr) instruction to jump to the previous code path. For example:
bl BB ; branch to "BB
AA: cmpi cr5,r4,0 ; is r4 zero?
...
BB: addi r4,r3,-24 ; r4 = r3 - 24
blr ; return to "AA
Since conditional branches can also use the link or count register, you can have
conditional return statements like this:
bgtlr cr5 ; return if cr5 has
; greater than bit set
The instructions blr and bgtlr are simplified mnemonics for the less
attractive bclr 20,0 and bclr 12,[CRn+]1 instructions. The PowerPC
user's manual lists these as easier-to-read alternatives to entering the
specific bit fields of the bclr instruction, and PPCAsm supports these
mnemonics. But when debugging you may see the less attractive versions in
disassemblies.*
ARITHMETIC AND LOGICAL INSTRUCTIONS
You've already seen the add and addi instructions, but let's go over one key variation
before looking at other integer arithmetic and logical instructions. Notice the period
character "." in the following instruction:
add. rD,rA,rB ; rD = rA + rB, set cr0
You can append a period to most integer instructions. This character causes bits in the
CR0 condition register field to be set based on how the result compares to 0; you can
later use CR0 in a conditional branch. In 680x0 assembly language, this is implied in
most moves to a data register; however, PowerPC assembly instructions that move
data to a register must explicitly use the period.
Other basic integer instructions include the following:
subf rD,rA,rB ; subtract from
; rD = rB - rA
subfi rD,rA,val ; subtract from immediate
; rD = val - rA
neg rD,rA ; negate
; rD = -rA
mullw rD,rA,rB ; multiply low word
; rD = [low 32 bits] rA*rB
MULHW RD,RA,RB ; MULTIPLY HIGH WORD
; rD = [high 32 bits] rA*rB
divw rD,rA,rB ; divide word
; rD = rA / rB
divwu rD,rA,rB ; divide unsigned word
; rD = rA / rB [unsigned]
and rD,rA,rB ; logical AND
; rD = rA AND rB
or rD,rA,rB ; logical OR
; rD = rA OR rB
nand rD,rA,rB ; logical NAND
; rD = rA NAND rB
srw rD,rS,rB ; shift right word
; rD = (rS >> rB)
srawi rD,rS,SH ; algebraic shift right
; word immediate
; rD = (rS >> SH)
Another flexible and powerful set of instructions is the rotate instructions. They allow
you to perform a number of register operations besides just rotation, including
masking, bit insertions, clearing specific bits, extracting bits, and combinations of
these. Each rotate instruction takes a source register, a destination, an amount to shift
either in a register or as immediate data, and a mask begin (MB) and mask end (ME)
value. The mask is either ANDed with the result or is used to determine which bits to
copy into the destination register. The mask is a 32-bit value with all bits between
location MB and ME set to 1 and all other bits set to 0. For example, the following
instruction will take the contents of R3, rotate it left by 5, AND it with the bit pattern
00001111 11111100 00000000 00000000, and place the result in register R4.
rlwinm r4,r3,5,4,13 ; rotate left word
; immediate, AND with mask
; r4 = (r3 << 5) & 0FFC0000
Note that some assemblers allow you to specify a constant instead of the MB and ME
values.
MOVING DATA
Getting data to and from memory requires the load and store instructions. There are a
few variations, each with the addressing modes mentioned earlier. The amount of
memory, the address alignment, and the specific processor will also affect how much
time the operation will take. Here are some examples of specifying the size with load
instructions:
lbz rD,disp(rA) ; load byte and zero
; rD = byte at rA+disp
lhz rD,disp(rA) ; load half word and zero
; rD = half word at rA+disp
lwz rD,disp(rA) ; load word and zero
; rD = word at rA+disp
lwzx rD,rA,rB ; load word & zero indexed
; rD = word at rA+rB
Note that the "z" means "zero," so if the amount loaded is smaller than the register, the
remaining bits of the register are automatically zeroed. This is like an automatic
extend instruction in 680x0 assembly language. You can also have the effective address
register preincrement, by appending "u" for "update." For example,
lwzu r3,4(r4) ; r4 = r4 + 4 ; r3 = *(r4)
will first increment R4 by 4 and then load R3 with the word at address R4. The
preincrement doesn't exist in 680x0 assembly, but it's similar to the
predecrementing instruction move.l d3,-(a4). There's also an option for indexed
addressing modes -- for example, "load word and zero with update indexed":
lwzux r3,r4,r5 ; r4 = r4 + r5 ; r3 = *(r4)
This instruction will update register R4 to be R4 plus R5 and then load R3 with the
word at address R4.
Store instructions have the same options as load instructions, but start with "st
instead of "l." (The "z" is omitted because there's no need to zero anything.) For
example:
stb rD,disp(rA) ; store byte
sthx rD,rA,rB ; store half word indexed