AltiVec Revealed
Volume Number: 15
Issue Number: 7
Column Tag: Into The Hardware
AltiVec Revealed
By Tom Thompson
This extension to the PowerPC instruction set
promises high performance for multimedia,
communication, and 3D graphics applications
Introduction
More than ever, the Macintosh handles more diverse and richer types of information. A
partial list would include: displaying 3D graphics for scientific applications and
games, capturing data from a digital video camera, decoding and displaying MPEG-2
video from a training DVD, arranging and maintaining a streaming video session, and
using VOIP (voice over IP) to implement a conference call.
For these tasks, a Mac must actually perform a significant amount of real-time data
processing. Aggravating this situation is that the interfaces that supply this
time-critical information are a lot faster than just a year ago: the Mac's standard
Ethernet interface now operates at 100 Mbps and there's FireWire, which blasts data
about at 100- to 400-Mbps rates.
Thus far, the PowerPC processors in these computers have delivered the goods through
sheer computational brawn. It also helps that the PowerPC instruction set provides a
number of Digital Signal Processing (DSP)-style operations and a non-precise
floating-point mode that speeds arithmetic computations. Such capabilities allowed a
first-generation, 80 MHz PowerPC 601 to implement a V.32 modem and a speech
recognition engine entirely in software.
Today's third-generation PowerPC 750 (a.k.a. G3) has more execution units, larger
on-chip caches, support for a high-speed backside L2 cache, and operates at higher
clock speeds. (The clock speed of today's systems currently hover around 450 MHz.)
These features endow the Mac with the processing muscle to handle many of the
multimedia and communications chores just described. As this type of work becomes
the norm rather than the exception, however, such demanding jobs will tax even the
capabilities of this processor.
To handle this growing category of applications, in 1997 Motorola announced a major
extension to the PowerPC since the architecture was conceived in 1991. Termed
AltiVec, it is a technology that performs high-speed hardware-based data manipulation
of vectors. (A vector is contiguous list of data elements that, from the programmer's
point of view, can be considered a one-dimensional array.) AltiVec works with vectors
that are a fixed 128 bits in length, a size that's sufficient to store sixteen 8-bit
numbers or four 32-bit numbers.
The data arrays manipulated by communications and graphics algorithms are -- at the
machine code level -- represented as vectors. Since AltiVec manipulates vectors in
hardware, the technology can significantly accelerate such algorithms. Put another
way, AltiVec enhances the PowerPC's ability to handle vector operations, similar to
how the Floating-Point Unit (FPU) boosts the speed of floating-point operations.
AltiVec has over 160 PowerPC-compatible instructions that perform various
arithmetic and logical operations on a vector's contents. In addition, AltiVec
instructions can manipulate both integer and floating-point data, unlike Intel's MMX
technology that is restricted to integer data.
The AltiVec technology will first appear in a fourth-generation PowerPC processor,
the G4. The G4 has been sampling in quantity since late last year, and should appear in
Macs early next year. Given the lead times for becoming familiar with a new
technology and revising your software to take advantage of it, now's the time for
MacTech readers to take a serious look at AltiVec.
Something Old, Something New
To fully appreciate how the AltiVec technology fits into PowerPC architecture and
operates, it's necessary to take a tour of the G4 processor itself. The G4 is a 32-bit
PowerPC processor that's basically a souped-up G3 (which itself is a souped-up
PowerPC 603e). Although the G4 borrows heavily from its G3 roots, significant
changes in the design will allow the G4 to deliver performance that will be vastly
superior to its predecessor.
As Figure 1 shows, the G4 design starts with the G3's six concurrent execution units
and adds a new one. This seventh unit implements the AltiVec technology and is thus
called the vector unit. I'll describe the capabilities of the vector unit shortly. The
processor reuses the G3's proven integer core, which consists of two integer
Arithmetic Logic Units (ALUs), and the System Unit. (The System Unit is considered
an execution unit because it participates in certain integer calculations). The Branch
unit, which uses the G3's prediction logic to manage branch and jump operations,
remains unchanged.
Figure 1. The G4's microarchitecture.
However, here the similarities end. The internal buses between the processor's L1
caches and many of the execution units have been expanded from 64 to 128 bits. The
wider buses were necessary to support AltiVec's vector operations, but they also boost
data transfers throughout the processor. The G4's FPU has been beefed up so that
double-precision floating-point instructions -- not just single-precision
instructions, as was the case with the G3 -- are fully pipelined. The resulting lower
latency in the FPU's processing of these instructions, in combination with the wider
internal buses, should enable the G4 to accelerate many scientific applications that
rely heavily on double-precision floating-point calculations. The processor's backside
L2 cache interface is now 128 bits wide and supports 128-bit transfers. The
maximum size of the L2 cache is doubled to 2 MB.
To effectively process real-time or multimedia data, the G4 must obtain it from main
memory a steady rate. To ensure that this occurs, the processor has four
software-controlled prefetch engines, called Data Stream (DS) channels. Each engine
operates independently of each other, and uses the processor's empty (idle) bus cycles
to transfer data into the L1 and L2 caches. You use a set of Data Stream Touch (DST)
instructions to initiate a transfer, and it proceeds automatically without further
program intervention.
Finally, the G4 brings back the multiprocessor (MP) support that's absent in the G3.
To simplify the G3's bus design, a fourth "shared" state was removed from its cache
coherency protocols. This shared state is necessary to implement the shared data
regions that MP systems rely on for coordinating activities and exchanging
information. Building a multiprocessor system out of more than two G3s required the
addition of glue logic, which complicates an MP system design and increases its cost.
The G4's bus uses a 5-state cache coherency protocol. This includes the standard four
states known as MESI (modified/exclusive/shared/invalid), plus a new "reserved
state. The reserved state implements direct data transfers between processor caches.
Up to four G4s and their L2 caches can be assembled into a multiprocessor array
without glue logic, which makes it easy to build MP systems.
Motorola fabricates the G4 with its HIP5 0.22-micron, six-metal layer CMOS process
that uses copper as the interconnecting material. This process technology allows
Motorola to pack the G4's 10.5 million transistors onto a 83 mm2 die. (For
comparison, Intel's Deschutes version of the Pentium II die occupies 118 mm2, and
the Katmai Pentium III die weighs in at 128 mm2.) With the copper traces and a lower
operating voltage of 1.2 Volts, the G4 dissipates less than 8 Watts at 400 MHz. Despite
the higher clock speed and additional transistors, the G4's power consumption
compares favorably with a G3's, which operates at 3.3 Volts and dissipates 5 Watts at
250 MHz.
Initial versions of the G4 will use the same 360-pin Ball Grid Array (BGA) packaging
that houses the G3. This makes the G4 pin-compatible with the G3, and allows it to be
dropped into existing G3-based designs. However, this configuration limits the width
of the G4's bus and L2 cache interfaces to 64 bits, which crimps the processor's
throughput to external memory and peripherals. To realize its full potential, later G4s
will sport the 128-bit interfaces. Depending upon when the first G4-based Macs ship,
Apple engineers might use the G3 pin-compatible version of the G4 to speed system
design and testing. Or, they may opt for better performance by leap-frogging to a G4
equipped with the wider paths, similar to what happened with the PowerPC 740 /750
versions for G3-based systems.
To sum up, the G4's microarchitecture provides a slew of new features that will
improve the performance of many multimedia applications and 3D graphics
applications. The improved FPU should also make a G4-based Mac a valuable tool for
complex simulations and data visualization. With its modest power consumption, Mac
road warriors can expect to see the G4 at the heart of future PowerBook models. Power
users also win, since the G4's multiprocessor support means that high-performance
MP systems will be available to tackle heavy-duty computing jobs.
Vector Unit Overview
As Figure 1 indicates, a separate autonomous execution unit implements AltiVec's
vector instructions. This vector unit has its own register file, status/control register
(VSCR), and a vector save/restore register (VRSAVE). The register file consists of 32
registers that are 128 bits wide. To adequately feed the vector unit, 128-bit wide data
paths link it and other execution units to the processor's L1 caches and the load/store
unit. The vector unit shares few resources and communications paths with the other
execution units. This eliminates situations where the vector unit must be tightly
synchronized to another execution unit because it depends on data from that particular
unit. Because it works in parallel with the other execution units and has its own
register file, you don't have to invoke special processor mode-switching routines
before using vector instructions. You can freely intermix integer, floating-point, and
vector instructions in the source code without impacting a program's performance.
To further improve the vector unit's throughput, it uses a simplified, streamlined
design. There's no support for misaligned data accesses and it only generates several
hardware exceptions. Nor does the vector unit implement complex instructions: many
of them execute in a single cycle, although certain instructions can take up to three or
four cycles to execute.
Figure 2. Detail of the vector unit. All of the sub-units operate in parallel. The G4
can dispatch two vector instructions at a time to the unit.
The vector unit itself consists of parallel sub-units, as illustrated in Figure 2. Each
sub-unit is tailored to handle specific instructions. A vector simple sub-unit and
vector complex sub-unit handle integer vector operations. The vector floating-point
sub-unit deals with the floating-point operations. The vector permute sub-unit
implements a large-scale shift operation and can selectively reorder the contents of
vectors. Certain application-specific versions of the G4 might have a vector unit that
has different combinations of these sub-units, such as an array of vector simple
sub-units.
The vector simple sub-unit executes single-cycle instructions that perform addition,
subtraction, comparison, shifts, and logical operations with vectors. The vector
complex sub-unit fields the compute-intensive multiply and multiply-add
instructions that require several cycles to complete.
The vector floating-point sub-unit is equipped with four multiply-add devices that
process four single-precision floating-point numbers simultaneously. It performs
floating-point add, subtract, and multiply-add vector instructions in four cycles.
Because of the parallel multiply-add devices, properly coded algorithms should
execute four times faster.
The floating-point sub-unit has two modes of operation: a Java mode and a non-Java
mode. The Java mode provides compliance with the Java Language Specification 1. The
non-Java mode provides faster results with less numeric accuracy. This latter mode is
useful for real-time algorithms where response times are more critical than the
data's accuracy.
In the Java mode, the floating-point sub-unit implements the default behavior for
exception handling as specified by the IEEE 754 numeric standard. In addition, it
supports only the standard's default rounding mode, round-to-nearest. This simplifies
the floating-point sub-unit's design in that arithmetic errors generate default results
and don't invoke a hardware exception, and rounding control flags aren't required. Nor
does the sub-unit set any floating-point status flags in the FPU's status register. As a
consequence of these simplifications, technical applications that require full
compliance with the IEEE 754 standard must use the G4's FPU. However, algorithms
that require only single-precision floating-point arithmetic (such as those for signal
processing and 3D graphics) will work fine within these limits.
The vector permute sub-unit implements a sophisticated data-mangling instruction
known as permute, which gives the unit its name. With the permute instruction, you
can choose individual bytes from two source vector registers and merge them into any
position within a destination vector register. In a single cycle, the permute
instruction can clean up misaligned data or slip a new destination address into a
network packet header.
All sub-units but the permute unit use the Single Instruction Multiple Data (SIMD)
technique that enables the hardware to process a vector's data elements in parallel.
This capability allows the vector unit to process 16 integer or four floating-point
operations at a time. The G4 can dispatch up to two AltiVec instructions (one
arithmetic/logic and one permute) to the vector unit per tick of the processor clock.
Therefore, for a 400 MHz G4, the peak performance of the vector unit can reach a
peak of 12.8 billion integer operations per second, while floating-point calculations
using multiply-add instructions can hit 3.2 GFLOPS
Vector Instruction Overview
AltiVec instructions, like other PowerPC instructions, are a fixed 32 bits in length.
The instructions typically use three operands (two source registers and one
destination register), although there are the inevitable exceptions to this format.
When an instruction completes, the contents of the source operands are left intact.
AltiVec follows the principles of RISC in that the instructions only modify the contents
of the vector registers. Vector load and store instructions must be used to transfer data
between memory and these registers.
AltiVec instructions can be divided into six categories. These are:
• Vector integer arithmetic operations. These instructions implement add,
subtract, multiply and shift operations for integer computations, plus Boolean
logic and compare operations for bit masking and program control.
• Vector floating-point operations. These instructions perform calculations
on vectors that contain floating-point digits. They support add, subtract, and
multiply-add operations, plus the obligatory conversions between integer and
floating-point values.
• Vector permute/format operations. These instructions handle
sophisticated data manipulation and data replication functions. Some of them
implement data packing and unpacking operations, including two instructions
whose specialty is format conversions between 16-bit video pixels and
32-bit graphics pixels.
• Vector load/store operations. These instructions retrieve data from
memory into a vector register, or write the contents of a vector register to
memory. Load and store operations often work with quadwords. However, they
also support scalar (non-vector) transfers.
• Memory control operations. These instructions manage the inflow of data
to the processor caches. Specifically, they are the DST instructions that
control the G4's prefetch engines.
• Processor control operations. These instructions load and store the
contents of the vector unit's control/status register.
Most of the time, you'll work exclusively with instructions in the first four
categories. The latter two categories will be used in performance-critical applications
where data must be readily available in the processor caches, or to switch the
processor between the Java/non-Java mode.
AltiVec instructions fall into two distinct groups as to how they manipulate the vector
data: intraelement operations and interelement operations. Intraelement operations
take elements from the same location or position in the source registers, process them
in parallel, and place the results in the same location in a destination register, as
shown in Figure 3. For example, a vector add instruction takes the corresponding
digits in two source registers, adds them together, and stores the sums in a destination
register. The bulk of AltiVec's arithmetic and logical instructions process elements in
this fashion.
Figure 3. AltiVec features vector operations that either retain the order of the
elements processed (intraelement), or rearranges them (interelement).
Interelement operations take elements from different locations in the source
registers, process them, and place the outcome into different locations in the
destination register. The permute instruction, and its variants such as merge and
splat, perform interelement operations.
In short, intraelement operations fetch, process, and store data elements without
disturbing the order of the data. Interelement operations fetch elements in any order,
process them, and store the results in any order. By organizing the instructions this
way, Motorola could use separate sub-units to implement the instructions. The
separate sub-units improve the parallel processing ability of the vector unit, thus
improving its throughput. AltiVec introduces 162 new instructions and four new data
types. These data types consist of packed data elements that occupy a 128-bit quantity
called a quadword. Quadwords represent the contents of memory or vector registers. As
shown in Figure 4, quadwords can be divided into vectors composed of 16 bytes (8