All Databases MacTech Vol 15-1999

AltiVec Revealed

Volume Number: 15

Issue Number: 7

Column Tag: Into The Hardware

AltiVec Revealed

By Tom Thompson

This extension to the PowerPC instruction set

promises high performance for multimedia,

communication, and 3D graphics applications

Introduction

More than ever, the Macintosh handles more diverse and richer types of information. A

partial list would include: displaying 3D graphics for scientific applications and

games, capturing data from a digital video camera, decoding and displaying MPEG-2

video from a training DVD, arranging and maintaining a streaming video session, and

using VOIP (voice over IP) to implement a conference call.

For these tasks, a Mac must actually perform a significant amount of real-time data

processing. Aggravating this situation is that the interfaces that supply this

time-critical information are a lot faster than just a year ago: the Mac's standard

Ethernet interface now operates at 100 Mbps and there's FireWire, which blasts data

about at 100- to 400-Mbps rates.

Thus far, the PowerPC processors in these computers have delivered the goods through

sheer computational brawn. It also helps that the PowerPC instruction set provides a

number of Digital Signal Processing (DSP)-style operations and a non-precise

floating-point mode that speeds arithmetic computations. Such capabilities allowed a

first-generation, 80 MHz PowerPC 601 to implement a V.32 modem and a speech

recognition engine entirely in software.

Today's third-generation PowerPC 750 (a.k.a. G3) has more execution units, larger

on-chip caches, support for a high-speed backside L2 cache, and operates at higher

clock speeds. (The clock speed of today's systems currently hover around 450 MHz.)

These features endow the Mac with the processing muscle to handle many of the

multimedia and communications chores just described. As this type of work becomes

the norm rather than the exception, however, such demanding jobs will tax even the

capabilities of this processor.

To handle this growing category of applications, in 1997 Motorola announced a major

extension to the PowerPC since the architecture was conceived in 1991. Termed

AltiVec, it is a technology that performs high-speed hardware-based data manipulation

of vectors. (A vector is contiguous list of data elements that, from the programmer's

point of view, can be considered a one-dimensional array.) AltiVec works with vectors

that are a fixed 128 bits in length, a size that's sufficient to store sixteen 8-bit

numbers or four 32-bit numbers.

The data arrays manipulated by communications and graphics algorithms are -- at the

machine code level -- represented as vectors. Since AltiVec manipulates vectors in

hardware, the technology can significantly accelerate such algorithms. Put another

way, AltiVec enhances the PowerPC's ability to handle vector operations, similar to

how the Floating-Point Unit (FPU) boosts the speed of floating-point operations.

AltiVec has over 160 PowerPC-compatible instructions that perform various

arithmetic and logical operations on a vector's contents. In addition, AltiVec

instructions can manipulate both integer and floating-point data, unlike Intel's MMX

technology that is restricted to integer data.

The AltiVec technology will first appear in a fourth-generation PowerPC processor,

the G4. The G4 has been sampling in quantity since late last year, and should appear in

Macs early next year. Given the lead times for becoming familiar with a new

technology and revising your software to take advantage of it, now's the time for

MacTech readers to take a serious look at AltiVec.

Something Old, Something New

To fully appreciate how the AltiVec technology fits into PowerPC architecture and

operates, it's necessary to take a tour of the G4 processor itself. The G4 is a 32-bit

PowerPC processor that's basically a souped-up G3 (which itself is a souped-up

PowerPC 603e). Although the G4 borrows heavily from its G3 roots, significant

changes in the design will allow the G4 to deliver performance that will be vastly

superior to its predecessor.

As Figure 1 shows, the G4 design starts with the G3's six concurrent execution units

and adds a new one. This seventh unit implements the AltiVec technology and is thus

called the vector unit. I'll describe the capabilities of the vector unit shortly. The

processor reuses the G3's proven integer core, which consists of two integer

Arithmetic Logic Units (ALUs), and the System Unit. (The System Unit is considered

an execution unit because it participates in certain integer calculations). The Branch

unit, which uses the G3's prediction logic to manage branch and jump operations,

remains unchanged.

Figure 1. The G4's microarchitecture.

However, here the similarities end. The internal buses between the processor's L1

caches and many of the execution units have been expanded from 64 to 128 bits. The

wider buses were necessary to support AltiVec's vector operations, but they also boost

data transfers throughout the processor. The G4's FPU has been beefed up so that

double-precision floating-point instructions -- not just single-precision

instructions, as was the case with the G3 -- are fully pipelined. The resulting lower

latency in the FPU's processing of these instructions, in combination with the wider

internal buses, should enable the G4 to accelerate many scientific applications that

rely heavily on double-precision floating-point calculations. The processor's backside

L2 cache interface is now 128 bits wide and supports 128-bit transfers. The

maximum size of the L2 cache is doubled to 2 MB.

To effectively process real-time or multimedia data, the G4 must obtain it from main

memory a steady rate. To ensure that this occurs, the processor has four

software-controlled prefetch engines, called Data Stream (DS) channels. Each engine

operates independently of each other, and uses the processor's empty (idle) bus cycles

to transfer data into the L1 and L2 caches. You use a set of Data Stream Touch (DST)

instructions to initiate a transfer, and it proceeds automatically without further

program intervention.

Finally, the G4 brings back the multiprocessor (MP) support that's absent in the G3.

To simplify the G3's bus design, a fourth "shared" state was removed from its cache

coherency protocols. This shared state is necessary to implement the shared data

regions that MP systems rely on for coordinating activities and exchanging

information. Building a multiprocessor system out of more than two G3s required the

addition of glue logic, which complicates an MP system design and increases its cost.

The G4's bus uses a 5-state cache coherency protocol. This includes the standard four

states known as MESI (modified/exclusive/shared/invalid), plus a new "reserved

state. The reserved state implements direct data transfers between processor caches.

Up to four G4s and their L2 caches can be assembled into a multiprocessor array

without glue logic, which makes it easy to build MP systems.

Motorola fabricates the G4 with its HIP5 0.22-micron, six-metal layer CMOS process

that uses copper as the interconnecting material. This process technology allows

Motorola to pack the G4's 10.5 million transistors onto a 83 mm2 die. (For

comparison, Intel's Deschutes version of the Pentium II die occupies 118 mm2, and

the Katmai Pentium III die weighs in at 128 mm2.) With the copper traces and a lower

operating voltage of 1.2 Volts, the G4 dissipates less than 8 Watts at 400 MHz. Despite

the higher clock speed and additional transistors, the G4's power consumption

compares favorably with a G3's, which operates at 3.3 Volts and dissipates 5 Watts at

250 MHz.

Initial versions of the G4 will use the same 360-pin Ball Grid Array (BGA) packaging

that houses the G3. This makes the G4 pin-compatible with the G3, and allows it to be

dropped into existing G3-based designs. However, this configuration limits the width

of the G4's bus and L2 cache interfaces to 64 bits, which crimps the processor's

throughput to external memory and peripherals. To realize its full potential, later G4s

will sport the 128-bit interfaces. Depending upon when the first G4-based Macs ship,

Apple engineers might use the G3 pin-compatible version of the G4 to speed system

design and testing. Or, they may opt for better performance by leap-frogging to a G4

equipped with the wider paths, similar to what happened with the PowerPC 740 /750

versions for G3-based systems.

To sum up, the G4's microarchitecture provides a slew of new features that will

improve the performance of many multimedia applications and 3D graphics

applications. The improved FPU should also make a G4-based Mac a valuable tool for

complex simulations and data visualization. With its modest power consumption, Mac

road warriors can expect to see the G4 at the heart of future PowerBook models. Power

users also win, since the G4's multiprocessor support means that high-performance

MP systems will be available to tackle heavy-duty computing jobs.

Vector Unit Overview

As Figure 1 indicates, a separate autonomous execution unit implements AltiVec's

vector instructions. This vector unit has its own register file, status/control register

(VSCR), and a vector save/restore register (VRSAVE). The register file consists of 32

registers that are 128 bits wide. To adequately feed the vector unit, 128-bit wide data

paths link it and other execution units to the processor's L1 caches and the load/store

unit. The vector unit shares few resources and communications paths with the other

execution units. This eliminates situations where the vector unit must be tightly

synchronized to another execution unit because it depends on data from that particular

unit. Because it works in parallel with the other execution units and has its own

before using vector instructions. You can freely intermix integer, floating-point, and

vector instructions in the source code without impacting a program's performance.

To further improve the vector unit's throughput, it uses a simplified, streamlined

design. There's no support for misaligned data accesses and it only generates several

hardware exceptions. Nor does the vector unit implement complex instructions: many

of them execute in a single cycle, although certain instructions can take up to three or

four cycles to execute.

Figure 2. Detail of the vector unit. All of the sub-units operate in parallel. The G4

can dispatch two vector instructions at a time to the unit.

The vector unit itself consists of parallel sub-units, as illustrated in Figure 2. Each

sub-unit is tailored to handle specific instructions. A vector simple sub-unit and

vector complex sub-unit handle integer vector operations. The vector floating-point

sub-unit deals with the floating-point operations. The vector permute sub-unit

implements a large-scale shift operation and can selectively reorder the contents of

vectors. Certain application-specific versions of the G4 might have a vector unit that

has different combinations of these sub-units, such as an array of vector simple

sub-units.

The vector simple sub-unit executes single-cycle instructions that perform addition,

subtraction, comparison, shifts, and logical operations with vectors. The vector

complex sub-unit fields the compute-intensive multiply and multiply-add

instructions that require several cycles to complete.

The vector floating-point sub-unit is equipped with four multiply-add devices that

process four single-precision floating-point numbers simultaneously. It performs

floating-point add, subtract, and multiply-add vector instructions in four cycles.

Because of the parallel multiply-add devices, properly coded algorithms should

execute four times faster.

The floating-point sub-unit has two modes of operation: a Java mode and a non-Java

mode. The Java mode provides compliance with the Java Language Specification 1. The

non-Java mode provides faster results with less numeric accuracy. This latter mode is

useful for real-time algorithms where response times are more critical than the

data's accuracy.

In the Java mode, the floating-point sub-unit implements the default behavior for

exception handling as specified by the IEEE 754 numeric standard. In addition, it

supports only the standard's default rounding mode, round-to-nearest. This simplifies

the floating-point sub-unit's design in that arithmetic errors generate default results

and don't invoke a hardware exception, and rounding control flags aren't required. Nor

does the sub-unit set any floating-point status flags in the FPU's status register. As a

consequence of these simplifications, technical applications that require full

compliance with the IEEE 754 standard must use the G4's FPU. However, algorithms

that require only single-precision floating-point arithmetic (such as those for signal

processing and 3D graphics) will work fine within these limits.

The vector permute sub-unit implements a sophisticated data-mangling instruction

known as permute, which gives the unit its name. With the permute instruction, you

can choose individual bytes from two source vector registers and merge them into any

position within a destination vector register. In a single cycle, the permute

instruction can clean up misaligned data or slip a new destination address into a

network packet header.

All sub-units but the permute unit use the Single Instruction Multiple Data (SIMD)

technique that enables the hardware to process a vector's data elements in parallel.

This capability allows the vector unit to process 16 integer or four floating-point

operations at a time. The G4 can dispatch up to two AltiVec instructions (one

arithmetic/logic and one permute) to the vector unit per tick of the processor clock.

Therefore, for a 400 MHz G4, the peak performance of the vector unit can reach a

peak of 12.8 billion integer operations per second, while floating-point calculations

using multiply-add instructions can hit 3.2 GFLOPS

Vector Instruction Overview

AltiVec instructions, like other PowerPC instructions, are a fixed 32 bits in length.

The instructions typically use three operands (two source registers and one

destination register), although there are the inevitable exceptions to this format.

When an instruction completes, the contents of the source operands are left intact.

AltiVec follows the principles of RISC in that the instructions only modify the contents

of the vector registers. Vector load and store instructions must be used to transfer data

between memory and these registers.

AltiVec instructions can be divided into six categories. These are:

• Vector integer arithmetic operations. These instructions implement add,

subtract, multiply and shift operations for integer computations, plus Boolean

logic and compare operations for bit masking and program control.

• Vector floating-point operations. These instructions perform calculations

on vectors that contain floating-point digits. They support add, subtract, and

multiply-add operations, plus the obligatory conversions between integer and

floating-point values.

• Vector permute/format operations. These instructions handle

sophisticated data manipulation and data replication functions. Some of them

implement data packing and unpacking operations, including two instructions

whose specialty is format conversions between 16-bit video pixels and

32-bit graphics pixels.

• Vector load/store operations. These instructions retrieve data from

memory into a vector register, or write the contents of a vector register to

memory. Load and store operations often work with quadwords. However, they

also support scalar (non-vector) transfers.

• Memory control operations. These instructions manage the inflow of data

to the processor caches. Specifically, they are the DST instructions that

control the G4's prefetch engines.

• Processor control operations. These instructions load and store the

contents of the vector unit's control/status register.

Most of the time, you'll work exclusively with instructions in the first four

categories. The latter two categories will be used in performance-critical applications

where data must be readily available in the processor caches, or to switch the

processor between the Java/non-Java mode.

AltiVec instructions fall into two distinct groups as to how they manipulate the vector

data: intraelement operations and interelement operations. Intraelement operations

take elements from the same location or position in the source registers, process them

in parallel, and place the results in the same location in a destination register, as

shown in Figure 3. For example, a vector add instruction takes the corresponding

digits in two source registers, adds them together, and stores the sums in a destination

this fashion.

Figure 3. AltiVec features vector operations that either retain the order of the

elements processed (intraelement), or rearranges them (interelement).

Interelement operations take elements from different locations in the source

registers, process them, and place the outcome into different locations in the

destination register. The permute instruction, and its variants such as merge and

splat, perform interelement operations.

In short, intraelement operations fetch, process, and store data elements without

disturbing the order of the data. Interelement operations fetch elements in any order,

process them, and store the results in any order. By organizing the instructions this

way, Motorola could use separate sub-units to implement the instructions. The

separate sub-units improve the parallel processing ability of the vector unit, thus

improving its throughput. AltiVec introduces 162 new instructions and four new data

types. These data types consist of packed data elements that occupy a 128-bit quantity

called a quadword. Quadwords represent the contents of memory or vector registers. As

shown in Figure 4, quadwords can be divided into vectors composed of 16 bytes (8

Referenced by (4):