All Databases MacTech Vol 12-1996

Multiprocessing Systems

Volume Number: 12

Issue Number: 3

Column Tag: Performance Frontiers

A Look at Macintosh Multiprocessing

Three ways to build a “simultaneous screamer”.

By Jim Gochee, Contributing Editor for Performance Processing

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

Information for this article was contributed by: Bruce Lawton, Emerson

Kennedy; Dr. Karsten Jeppesen, YARC Systems; and Chris Cooksey, DayStar Digital.

Introduction

Applications looking for more performance than a single-processor computer can

deliver often look to multiprocessing. Multiprocessing (MP) can take many forms,

from having multiple CPUs on a single motherboard, to plug-in accelerator cards, to a

network of machines. This article gives an overview of the multiprocessing options

available on the Macintosh today, which just got more interesting with the new Apple

Multiprocessor API. With this API, Apple has standardized multiprocessing for the

MacOS. However, as a developer looking for the ultimate in performance speedup, you

shouldn’t rule out other multiprocessing options just yet. For those of you who have

never considered making your application multiprocessor-aware, I would suggest

taking a good look at Apple’s Multiprocessor API. It is easy to use, runs under System

7 today, and is sure to have a sizable installed base of hardware that supports it.

Overview

Multiprocessing occurs when more than one compute engine is involved in solving a

task. These compute engines can be tightly coupled, as is the case with Symmetric

Multiprocessing (SMP), closely coupled, with Asymmetric Multiprocessing (AMP),

or loosely coupled, with Distributed Processing (DP). SMP systems have multiple

processors on the same system bus. The processors in these systems are

cache-coherent, which allows software running on any processor to share main

memory and other system resources with minimal extra support. AMP systems are

composed of multiple processors on a connected bus; however, the CPUs in this

configuration take on a master/client arrangement. Also, each CPU doesn’t necessarily

have access to the entire machine. A card plugged into an expansion slot would be a good

example of an AMP system. DP environments are composed of isolated compute engines

which exchange processing information over a local or wide area network.

Because of the flexibility of SMP and because of its cost being relatively low, this

architecture has become the standard for mainstream multiprocessing. Multitasking

operating systems can run processes on any CPU in a SMP system because each

processor has the same view of the machine. Several flavors of UNIX along with

Windows NT have been supporting SMP machines for a while, and with the introduction

of the Apple MP API, SMP is also the official Macintosh multiprocessing standard. The

Apple Multiprocessor API allows you to create MP tasks which are queued and run on

any available processor. If there are more tasks than processors, or if there is just

one processor, tasks are preemptively scheduled. The tasking model is a subset of the

Copland tasking model, which ensures seamless future compatibility. Coding to the

multiprocessor API signals the system that tasks should be run on multiple

processors; however, it is likely that Copland will support running non-MP aware

tasks on multiple processors as well.

One important consideration is that all of the multiprocessing solutions, as well

as Copland multitasking, have severe limits on what a task in these environments can

do. Preemptive tasks in any operating system can only access system routines which

are designed for reentrancy. Under Copland, preemptive tasks will have access to I/O,

memory management, and other kernel services. Therefore, MP tasks running under

Copland will also have access to these services. However, under System 7, MP tasks

cannot call any part of the MacOS. This may sound odd because there are parts of the

MacOS under System 7 that are reentrant, i.e. anything that you can call from

interrupt handlers. However, these calls contain 68k code, and reentrancy within 68k

code isn’t guaranteed by Apple in the current or future implementations of the MacOS.

So for now, MP tasks running under System 7 will be limited to scanning and

processing shared memory.

Vendor Section

As a software developer looking for more performance, it is important to understand

what kind of multiprocessing is available and what flavor is appropriate for your

application. There are three major MP vendors for the Macintosh market. They are:

DayStar Digital with their Apple-compliant SMP hardware; YARC systems with high

speed accelerator boards; and PowerTap, which allows networked distributed

processing.

The DayStar/Apple combination is the newest, and in many ways the most

compelling, because of its simplicity, versatility, and compatibility with Copland.

DayStar did much of the design and implementation of the new API and library;

however, Apple now claims ownership for the code and guarantees its support in future

releases of the MacOS. Use of the library gives you access to SMP-compliant systems

under System 7 and Copland, while also allowing preemptive threads on uniprocessor

System 7 machines. This is something that wasn’t available with the old cooperatively

scheduled PowerPC threads package. However, the SMP architecture with tightly

coupled processors sharing the same system bus will hinder applications that are

bottlenecked on memory access.

YARC Systems has a solution for this with NuBus- and PCI-based accelerator

cards that have onboard PowerPC processors and fast local RAM. If your application is

extremely CPU intensive and you have access to a network of Macintoshes, you will

also want to look at PowerTap, a software package from Emerson Kennedy that allows

an application to tap into networked CPU resources. While YARC and PowerTap won’t

accelerate applications written to the Apple MP API, both vendors plan to internally

leverage off of the Apple MP API in order to take advantage of multiprocessing on the

host machine.

The three main vendors of Macintosh MP products have supplied sections better

describing their products. Each section contains an overview, a sample fractal

algorithm coded to the vendor API, and a short section on the cost of the product.

DayStar Digital

Overview

DayStar’s new MP systems are standard Macintoshes, with one major exception:

they contain more than one CPU. The Apple MP API, which was designed in conjunction

with DayStar, defines a set of services that allows developers to create and

communicate with multiple elements of execution called “tasks”. When tasks are run

on a multiprocessor system they are scheduled and run simultaneously on all the

available processors.

Task creation is accomplished by providing a pointer to a function already defined

within existing application code. The most obvious advantage of this approach is that

you can use existing tools and build processes to construct an MP-aware application.

No special compilers or packaging of the task code are required. Tasks have complete

access to all the memory in the system. If an application has retrieved and prepared

data for processing it can simply tell the tasks where the data is. It is not necessary to

move any data to specialized task-only memory, thus avoiding expensive transactions

over system busses.

According to the Apple MP API specification the processors in an MP system must

be cache-coherent. This means that the developer need not be concerned with the

possibility that data stored in the cache of one processor has not yet been written to

main memory. If any other processor accesses that memory, the MP hardware will

automatically ensure that the value cached within the other processor is retrieved,

rather than the value in main memory. The MP API’s assumption of cache-coherency

makes programming significantly easier; programming non-cache-coherent systems

is far more error-prone and is not for the faint of heart.

Tasks run preemptively on all systems, including those with a single processor.

If an application is willing to require the presence of PowerPC hardware and the

shared library that provides the MP API services, the creation of MP-aware

applications can be greatly simplified. The application simply creates tasks and

distributes the work accordingly. The tasks created could do all the work while the

application checks for user events and controls the flow of data. The MP API is Apple

system software. It will be carried forward into Copland and is in fact a subset of the

Copland tasking model.

Even though tasks and applications share the same memory, it is very important

that they communicate, at least initially, via one of the three communication

primitives provided: message queues, semaphores and critical regions. Communicating

via these primitives ensures that all former memory accesses made by the

communicant are completed before the recipient starts using those locations, i.e.

ensuring that shared resources are accessed atomically. Using the communication

primitives also provides a method by which a task can yield time if it has to wait for

something that is not yet available.

Task Communication

There are three main inter-task communication mechanisms. The first are

message queues. Message queues are first-in-first-out queues of 96-bit messages.

Messages are useful for telling a task what work to do and where to look for

information relevant to the request being made, such as a pointer into main memory.

They are also useful for indicating that a given request has been processed, and, if

necessary, what the results are. Message queues incur more overhead than the other

two communication primitives. If you cannot avoid frequent synchronization, at least

try to use a semaphore instead of a message queue.

Semaphores store a value between 0 and some arbitrary positive integer value.

The value in a semaphore can be raised and lowered, but never below 0 and never above

the semaphore’s maximum value. Semaphores are useful for keeping track of how

many occurrences of a particular thing are available for use. Binary semaphores,

which have a maximum value of 1, are especially efficient mechanisms for indicating

to some other task that something is ready. When a task or application has finished

preparing data at some previously agreed-upon location, it raises the value of a binary

semaphore, which the target task can be awaiting. The target task lowers the value of

the semaphore, performs any necessary processing, and raises the value of a different

binary semaphore to indicate that it is done with the data. This technique can be used to

replace the message queue pairs described above, using the “Divide And Conquer”

technique. MPCreateBinarySemaphore() is a macro that exists to simplify the

creation of binary semaphores.

Critical regions are used to ensure that no more than one task (or the

application) is executing a given “region” of code at any given time. For example, if

part of a task’s job is to search a tree and modify it before proceeding with its primary

work, then if multiple tasks were allowed to search and try to modify the tree at the

same time, the tree would quickly become corrupted. An easy way to avoid the problem

is to form a critical region around the tree searching and modification code. When a

task tries to enter the critical region, it will be able to do so only if no other task is

currently in it - thus preserving the integrity of the tree.

Cost

The cost of the DayStar Genesis system, which comes with four 604 processors

and a minimum of 16MB and 1GB, will range from $10,000 to $15,000.

Sample Code

The sample code uses two queues as the communication mechanism between tasks.

Each task has a receive queue for messages from the application, and the application

has a global queue for messages from the tasks. When work is being done by the tasks,

the front end could either block on its queue, or poll the queue and call

WaitNextEvent(). When a task finishes a segment of the fractal image, it sends the

results back to the front end and blocks on its queue for another segment to processes.

err = 0

if( !MPLibraryIsLoaded() ) /* Check that the MP library is present */

err = 1;

/* Check that the library is compatible with our header */

if( (err == noErr) && !MPLibraryIsCompatible() )

err = 1;

if( err == noErr )

numProcessors = MPProcessors();

else

numProcessors = 1; /* Only use the host processor */

/* Allocate memory for each processor (each task) */

gTaskData = (TaskData *) NewPtrClear(

numProcessors * sizeof (TaskData));

Referenced by (5):