Multiprocessing Systems
Volume Number: 12
Issue Number: 3
Column Tag: Performance Frontiers
A Look at Macintosh Multiprocessing
Three ways to build a “simultaneous screamer”.
By Jim Gochee, Contributing Editor for Performance Processing
Note: Source code files accompanying article are located on MacTech CD-ROM or
source code disks.
Information for this article was contributed by: Bruce Lawton, Emerson
Kennedy; Dr. Karsten Jeppesen, YARC Systems; and Chris Cooksey, DayStar Digital.
Introduction
Applications looking for more performance than a single-processor computer can
deliver often look to multiprocessing. Multiprocessing (MP) can take many forms,
from having multiple CPUs on a single motherboard, to plug-in accelerator cards, to a
network of machines. This article gives an overview of the multiprocessing options
available on the Macintosh today, which just got more interesting with the new Apple
Multiprocessor API. With this API, Apple has standardized multiprocessing for the
MacOS. However, as a developer looking for the ultimate in performance speedup, you
shouldn’t rule out other multiprocessing options just yet. For those of you who have
never considered making your application multiprocessor-aware, I would suggest
taking a good look at Apple’s Multiprocessor API. It is easy to use, runs under System
7 today, and is sure to have a sizable installed base of hardware that supports it.
Overview
Multiprocessing occurs when more than one compute engine is involved in solving a
task. These compute engines can be tightly coupled, as is the case with Symmetric
Multiprocessing (SMP), closely coupled, with Asymmetric Multiprocessing (AMP),
or loosely coupled, with Distributed Processing (DP). SMP systems have multiple
processors on the same system bus. The processors in these systems are
cache-coherent, which allows software running on any processor to share main
memory and other system resources with minimal extra support. AMP systems are
composed of multiple processors on a connected bus; however, the CPUs in this
configuration take on a master/client arrangement. Also, each CPU doesn’t necessarily
have access to the entire machine. A card plugged into an expansion slot would be a good
example of an AMP system. DP environments are composed of isolated compute engines
which exchange processing information over a local or wide area network.
Because of the flexibility of SMP and because of its cost being relatively low, this
architecture has become the standard for mainstream multiprocessing. Multitasking
operating systems can run processes on any CPU in a SMP system because each
processor has the same view of the machine. Several flavors of UNIX along with
Windows NT have been supporting SMP machines for a while, and with the introduction
of the Apple MP API, SMP is also the official Macintosh multiprocessing standard. The
Apple Multiprocessor API allows you to create MP tasks which are queued and run on
any available processor. If there are more tasks than processors, or if there is just
one processor, tasks are preemptively scheduled. The tasking model is a subset of the
Copland tasking model, which ensures seamless future compatibility. Coding to the
multiprocessor API signals the system that tasks should be run on multiple
processors; however, it is likely that Copland will support running non-MP aware
tasks on multiple processors as well.
One important consideration is that all of the multiprocessing solutions, as well
as Copland multitasking, have severe limits on what a task in these environments can
do. Preemptive tasks in any operating system can only access system routines which
are designed for reentrancy. Under Copland, preemptive tasks will have access to I/O,
memory management, and other kernel services. Therefore, MP tasks running under
Copland will also have access to these services. However, under System 7, MP tasks
cannot call any part of the MacOS. This may sound odd because there are parts of the
MacOS under System 7 that are reentrant, i.e. anything that you can call from
interrupt handlers. However, these calls contain 68k code, and reentrancy within 68k
code isn’t guaranteed by Apple in the current or future implementations of the MacOS.
So for now, MP tasks running under System 7 will be limited to scanning and
processing shared memory.
Vendor Section
As a software developer looking for more performance, it is important to understand
what kind of multiprocessing is available and what flavor is appropriate for your
application. There are three major MP vendors for the Macintosh market. They are:
DayStar Digital with their Apple-compliant SMP hardware; YARC systems with high
speed accelerator boards; and PowerTap, which allows networked distributed
processing.
The DayStar/Apple combination is the newest, and in many ways the most
compelling, because of its simplicity, versatility, and compatibility with Copland.
DayStar did much of the design and implementation of the new API and library;
however, Apple now claims ownership for the code and guarantees its support in future
releases of the MacOS. Use of the library gives you access to SMP-compliant systems
under System 7 and Copland, while also allowing preemptive threads on uniprocessor
System 7 machines. This is something that wasn’t available with the old cooperatively
scheduled PowerPC threads package. However, the SMP architecture with tightly
coupled processors sharing the same system bus will hinder applications that are
bottlenecked on memory access.
YARC Systems has a solution for this with NuBus- and PCI-based accelerator
cards that have onboard PowerPC processors and fast local RAM. If your application is
extremely CPU intensive and you have access to a network of Macintoshes, you will
also want to look at PowerTap, a software package from Emerson Kennedy that allows
an application to tap into networked CPU resources. While YARC and PowerTap won’t
accelerate applications written to the Apple MP API, both vendors plan to internally
leverage off of the Apple MP API in order to take advantage of multiprocessing on the
host machine.
The three main vendors of Macintosh MP products have supplied sections better
describing their products. Each section contains an overview, a sample fractal
algorithm coded to the vendor API, and a short section on the cost of the product.
DayStar Digital
Overview
DayStar’s new MP systems are standard Macintoshes, with one major exception:
they contain more than one CPU. The Apple MP API, which was designed in conjunction
with DayStar, defines a set of services that allows developers to create and
communicate with multiple elements of execution called “tasks”. When tasks are run
on a multiprocessor system they are scheduled and run simultaneously on all the
available processors.
Task creation is accomplished by providing a pointer to a function already defined
within existing application code. The most obvious advantage of this approach is that
you can use existing tools and build processes to construct an MP-aware application.
No special compilers or packaging of the task code are required. Tasks have complete
access to all the memory in the system. If an application has retrieved and prepared
data for processing it can simply tell the tasks where the data is. It is not necessary to
move any data to specialized task-only memory, thus avoiding expensive transactions
over system busses.
According to the Apple MP API specification the processors in an MP system must
be cache-coherent. This means that the developer need not be concerned with the
possibility that data stored in the cache of one processor has not yet been written to
main memory. If any other processor accesses that memory, the MP hardware will
automatically ensure that the value cached within the other processor is retrieved,
rather than the value in main memory. The MP API’s assumption of cache-coherency
makes programming significantly easier; programming non-cache-coherent systems
is far more error-prone and is not for the faint of heart.
Tasks run preemptively on all systems, including those with a single processor.
If an application is willing to require the presence of PowerPC hardware and the
shared library that provides the MP API services, the creation of MP-aware
applications can be greatly simplified. The application simply creates tasks and
distributes the work accordingly. The tasks created could do all the work while the
application checks for user events and controls the flow of data. The MP API is Apple
system software. It will be carried forward into Copland and is in fact a subset of the
Copland tasking model.
Even though tasks and applications share the same memory, it is very important
that they communicate, at least initially, via one of the three communication
primitives provided: message queues, semaphores and critical regions. Communicating
via these primitives ensures that all former memory accesses made by the
communicant are completed before the recipient starts using those locations, i.e.
ensuring that shared resources are accessed atomically. Using the communication
primitives also provides a method by which a task can yield time if it has to wait for
something that is not yet available.
Task Communication
There are three main inter-task communication mechanisms. The first are
message queues. Message queues are first-in-first-out queues of 96-bit messages.
Messages are useful for telling a task what work to do and where to look for
information relevant to the request being made, such as a pointer into main memory.
They are also useful for indicating that a given request has been processed, and, if
necessary, what the results are. Message queues incur more overhead than the other
two communication primitives. If you cannot avoid frequent synchronization, at least
try to use a semaphore instead of a message queue.
Semaphores store a value between 0 and some arbitrary positive integer value.
The value in a semaphore can be raised and lowered, but never below 0 and never above
the semaphore’s maximum value. Semaphores are useful for keeping track of how
many occurrences of a particular thing are available for use. Binary semaphores,
which have a maximum value of 1, are especially efficient mechanisms for indicating
to some other task that something is ready. When a task or application has finished
preparing data at some previously agreed-upon location, it raises the value of a binary
semaphore, which the target task can be awaiting. The target task lowers the value of
the semaphore, performs any necessary processing, and raises the value of a different
binary semaphore to indicate that it is done with the data. This technique can be used to
replace the message queue pairs described above, using the “Divide And Conquer”
technique. MPCreateBinarySemaphore() is a macro that exists to simplify the
creation of binary semaphores.
Critical regions are used to ensure that no more than one task (or the
application) is executing a given “region” of code at any given time. For example, if
part of a task’s job is to search a tree and modify it before proceeding with its primary
work, then if multiple tasks were allowed to search and try to modify the tree at the
same time, the tree would quickly become corrupted. An easy way to avoid the problem
is to form a critical region around the tree searching and modification code. When a
task tries to enter the critical region, it will be able to do so only if no other task is
currently in it - thus preserving the integrity of the tree.
Cost
The cost of the DayStar Genesis system, which comes with four 604 processors
and a minimum of 16MB and 1GB, will range from $10,000 to $15,000.
Sample Code
The sample code uses two queues as the communication mechanism between tasks.
Each task has a receive queue for messages from the application, and the application
has a global queue for messages from the tasks. When work is being done by the tasks,
the front end could either block on its queue, or poll the queue and call
WaitNextEvent(). When a task finishes a segment of the fractal image, it sends the
results back to the front end and blocks on its queue for another segment to processes.
err = 0
if( !MPLibraryIsLoaded() ) /* Check that the MP library is present */
err = 1;
/* Check that the library is compatible with our header */
if( (err == noErr) && !MPLibraryIsCompatible() )
err = 1;
if( err == noErr )
numProcessors = MPProcessors();
else
numProcessors = 1; /* Only use the host processor */
/* Allocate memory for each processor (each task) */
gTaskData = (TaskData *) NewPtrClear(
numProcessors * sizeof (TaskData));