June 94 - Exploiting Graphics Speed on the Power Macintosh
June 94 - Exploiting Graphics Speed on the Power
Macintosh
KONSTANTIN OTHMER, SHANNON HOLLAND, AND
BRIAN COX
The new QuickDraw on the PowerPC platform substantially improves graphics
performance. A study comparing the performance of QuickDraw and custom blitters on
the Power Macintosh and 680x0-based machines provides information you can use to
ensure that the user benefits from those improvements. Further analysis, detailing
where CopyBits spends its time, leads to an implementation strategy for applications
that demand the fastest possible graphics.
Understanding the motivation for and consequences of the changes to QuickDraw on the
Power Macintosh can help you write faster applications. This article presents studies
that show QuickDraw as one of the most speed-critical parts of the Macintosh
Operating System together with studies that break down how applications spend CPU
time. Knowing how much time applications actually spend in various system routines
will help you develop a strategy for writing applications that perform well on both the
Power Macintosh and 680x0-based machines.
In porting QuickDraw to the PowerPCTM platform, Apple took advantage of the
opportunity to make some changes. We'll detail these changes and their consequences
for writing code. With that foundation, we'll move on to an in-depth discussion
comparing the QuickDraw CopyBits routine with custom blitters. The goal is to write
applications using routines that result in the fastest possible graphics performance on
both platforms -- PowerPC and 680x0 -- as well as on machines equipped with
graphics accelerators such as the new Apple Macintosh Display Card 24 AC. Sample
code on this issue's CD demonstrates a method of timing blitter routines so that your
application can use the fastest routine at run time.
HOW SPEED-CRITICAL IS QUICKDRAW?
Most of the Macintosh Operating System is written in 680x0 assembly language. In
order to reach time-to-market goals for the Power Macintosh, Apple had to focus
porting efforts on the most speed- critical parts of the system, so a study was
conducted to profile system usage of several common applications. System usage
depended largely on the operations performed in particular applications, but many
applications showed similar patterns.
Figure 1 is based on a subset of the study. It turns out that most applications spend
from 50% to 95% of their time in system code, with many spending more than 80%.
Figure 2 shows the percentage of total CPU time spent in the most frequently called
system routines for typical applications and for a pointer-based application (one that
avoids using handles).
Figure 1. CPU time breakdown: application versus system
Figure 2. System routine usage
The data made it clear that QuickDraw was one of the most critical components of
Apple's porting efforts. This article discusses QuickDraw version 1.3.5, which was
developed to run on the PowerPC platform. The new QuickDraw is based on QuickDraw
version 1.3.0, the most recent version of QuickDraw running on the Macintosh
Quadra, but with some changes (see the section "What's Different With Version
1.3.5?"). The new version, written in C, was compiled for the Power Macintosh as
QuickDraw version 1.3.5 and shipped with the new machines. The new QuickDraw C
code can also be compiled for 680x0-based machines and will be available in future
software releases.
The graphics speed comparisons made in this article compare the following:
• QuickDraw version 1.3.0 or other 680x0 code running on a 680x0-based
Macintosh (usually a Macintosh Quadra)
• QuickDraw version 1.3.0 or other 680x0 code running through the
emulator on a Power Macintosh
• QuickDraw version 1.3.5 or other PowerPC code running on a Power
Macintosh
TAKING ADVANTAGE OF THE SPEED
Figure 3 compares times of various QuickDraw routines for version 1.3.0 running on
a Macintosh Quadra and version 1.3.5 running on a Power Macintosh -- there's no
question that the new QuickDraw routines run faster. However, published surveys
comparing the speed of 680x0-based machines to the Power Macintosh haven't always
shown the dramatic results indicated by Figure 3. This is partly because some
operations offer greater increased speed than others, so depending on which operations
an application uses heavily, overall speed varies. A second important factor is that the
applications surveyed are often emulated applications.
Figure 3. Comparing QuickDraw version 1.3.0 to version 1.3.5
Emulated applications are those written for 680x0-based machines that run through
the emulator on the Power Macintosh (see "Making the Leap to PowerPC,develop Issue
16). These applications don't benefit fully from the PowerPC platform, because an
application that spends 80% of its time in system code on 680x0-based machines,
when emulated on a Power Macintosh, spends substantially more time in the
application. In general, completely emulated application code runs at about half the
speed of a Macintosh Quadra 700. Those same applications, when recompiled as
PowerPC code, usually run four or five times faster than on a Macintosh Quadra; code
that makes extensive use of floating point may be 20 times or more faster. However,
emulated graphics-intensive code, assuming it uses QuickDraw, is substantially faster
on a Power Macintosh than on a 680x0-based Macintosh because of the increased speed
of QuickDraw 1.3.5.
Clearly, to take full advantage of QuickDraw version 1.3.5, you need to write your
applications for the Power Macintosh in PowerPC code. Beyond that general strategy,
developing awesome applications for the PowerPC platform means figuring out how to
harness all that CPU power -- how to take advantage of the speed. For example, the
high speed of QuickDraw version 1.3.5 allows you to do high-quality animations.
Figure 3 shows that you can now do twice as many (or more) CopyBits operations per
second, which means that animations such as zooming, scrolling, and window dragging
(leave this one to Apple) can be done in real time without being chunky or annoying.
Text drawing is also much faster, so interactive word wrapping while positioning
objects in text is easy to do and looks better than it would on a 680x0-based
Macintosh. Overall, it's an open field for developers.
Tips for increasing the speed of PowerPC code are given in this issue's Balance of
Power column. *
Although this article focuses on QuickDraw, of course there are other, nongraphical,
ways of harnessing the power of the PowerPC processor. Floating point-intensive
applications benefit tremendously from the speed of the new processor.
The Graphing Calculator desk accessory that ships with the Power Macintosh
is an excellent example of harnessing CPU power for both the user interface and
computation-bound part of an application. As a floating point-intensive application,
Graphing Calculator benefits from the speed of the PowerPC processor. The user
interface has a number of nice touches, such as live scrolling, live zooming, and
interactive formula and graph manipulation. *
WHAT'S DIFFERENT WITH VERSION 1.3.5?
In the porting of QuickDraw to the PowerPC platform, many algorithms were
rethought and reimplemented. The result is slightly different (and we hope better!)
behavior. This section outlines some changes to keep in mind when you're writing code.
QDERROR
QuickDraw version 1.3.0 didn't do a very good job of setting and clearing QDError. In
version 1.3.5, every call sets QDError (which can cause problems for applications
that assume QDError will be preserved across most simple calls, like SetRect). In
some cases, version 1.3.0 jumps to SysError if there isn't enough memory; version
1.3.5 returns an error in QDError instead. This is usually an improvement, but it can
lead to strange behavior for applications that depend on SysError being invoked. For
example, some applications might put up a dialog asking the user to increase the
application partition size if QuickDraw invokes SysError. Since QuickDraw version
1.3.5 doesn't invoke SysError (returning a QDError instead), the application code
that puts up the dialog isn't triggered, so the user doesn't know to increase the memory
and the application might fail by not drawing anything. In choosing to always set
QDError, Apple chose the lesser of two evils.
MATCHING COLOR TABLES
QuickDraw version 1.3.0 uses the color table of the pixMap for the current GDevice,
not the color table of the destination pixMap, to map colors to the destination pixMap.
QuickDraw version 1.3.5 sets up a surrogate GDevice to make sure that the the
destination pixMap's and the GDevice's color tables always match. This may cause
problems for applications that relied on undefined behavior when the color tables
didn't match or for applications that were getting the right results by luck under
QuickDraw version 1.3.0. Again, Apple chose the lesser of two evils, and added the
surrogate device (known as the skank device). When QuickDraw is forced to set up the
skank device, the application pays a slight performance penalty. Also, if you do
operations such as index-to-color when your color tables don't match, and then later
use that color in a drawing, you won't necessarily draw with the index you expect. The
easiest cure: use GWorlds!
For more information on QDError, GDevices, pixMaps, and color tables, see
Inside Macintosh: Imaging With QuickDraw or Inside Macintosh Volume V. *
TRANSFER MODES
There's no way to pass the transfer space (the bit depth at which transfer occurs)
when doing transfer modes in QuickDraw. (QuickDraw GX remedies thisshortcoming.)
So if you're using an arithmetic mode from 8-bit to 16-bit, there are noguarantees
whether the transfer will occur at 5 bits per component (16-bit), 8 bits per
component (32-bit), or 16 bits per component (as in the 8-bit color table). It turns
out that most arithmetic modes in QuickDraw version 1.3.0 perform the transfer
operation at a resolution of 16 bits per color, while version 1.3.5 does most
operations at a resolution of 8 bits per color. This sometimes causes slight cosmetic
differences.
DITHERING
The dithering algorithm in QuickDraw version 1.3.5 is slightly different. This makes
it a nightmare to programmatically determine whether version 1.3.5 is generating the
same results as version 1.3.0, but visually the results are nearly identical.
STRETCHING AND SHRINKING IMAGES
The way CopyBits stretches and shrinks images for nonintegral ratios has been
improved in QuickDraw version 1.3.5 (integral ratios still produce the same results).
The advantage of this new algorithm is that it's symmetrical: if you stretch an image
and then shrink it back to the original size, the same pixels that were replicated in the
stretch are combined in the shrink.
The disadvantage of the new algorithm is that some applications stretch or shrink
without knowing it (the classic off-by-one error, resulting in a destination rectangle
that's smaller or larger than the source rectangle by one pixel). Such applications
may now drop (or replicate) a different scan line. This can cause slight cosmetic
blemishes in some applications.
UNEXPECTED REGISTER CONTENTS
Because QuickDraw version 1.3.5 runs PowerPC code, all emulated 680x0 registers
are preserved across calls. Thus, applications that expect the contents of volatile
registers (A0, A1, D0, D1, D2) to contain specific values on exit from a QuickDraw
call will break. (Conversely, don't rely on 680x0 registers being preserved, either!)
There's one exception: for compatibility with some existing applications, CopyBits
always sets D0 to 0.
PATCHING
Patching any QuickDraw version 1.3.5 routine with 680x0 code degrades performance
because of mode-switch overhead time. A mode switch occurs when a 680x0 caller is
calling PowerPC code, or vice versa. 680x0 patches on ShieldCursor are particularly
expensive because ShieldCursor is called by nearly every QuickDraw drawing routine.
For more information on the Mixed Mode Manager and mode switching, see
"Making the Leap to PowerPC" in develop Issue 16.*
DISABLED ACCELERATOR CARDS
QuickDraw version 1.3.0 makes calls through many low-level (undocumented)
vectors. Version 1.3.5 doesn't use these trap vectors, which disables most accelerator
cards. Of course, the frame buffer on these cards continues to work.
THE COPYBITS/CUSTOM BLITTER RACE
A favorite developer sport is complaining about how slow CopyBits is and writing
custom blit loops to replace it. A favorite sport among QuickDraw engineers is working
all night trying to speed up some part of CopyBits. This competition is healthy so long
as speed-critical applications call the faster code.
"Blitter" informally refers to any routine that moves memory, usually visual
information to the screen or an off-screen buffer; the operation is called a "blit."
These terms derive from the PDP-10 block transfer instruction, BLT. *
Through the years, Apple engineers have yearned for a way to get a substantial lead in
the race with the speed-hungry special-case developer. The answer lies in the Power
Macintosh: raw 680x0 code runs substantially slower through the emulator, while
QuickDraw version 1.3.5 CopyBits takes advantage of the lightning-fast RISC
processor.
Figure 4 compares various ways of moving the memory used by an 8-bit, 32-by-32
pixMap and an 8- bit, 400-by-400 pixMap to the screen. BlockMove gives a baseline:
the typical amount of time needed to move that much raw memory. The 680x0 blitter
is a custom blitter written for 680x0-based machines and emulated on the Power
Macintosh. The PowerPC blitter is a custom blitter written for the Power Macintosh
(it can't be run on a 680x0 machine).
Figure 4. CopyBits versus custom blitters
As you can see, the custom PowerPC blitters beat QuickDraw's CopyBits for the small
image hands down for both 680x0-based machines and the Power Macintosh. (With the
small image the constant overhead of CopyBits has a big impact on the overall time.)
However, the 680x0 blitter is much slower than CopyBits on a Power Macintosh. This
is due to the overhead of emulation.
The interesting case is the custom PowerPC blitter versus CopyBits for the large
image on the Power Macintosh. Here CopyBits wins. This is due to optimizations that
CopyBits has for large images that the PowerPC blitter doesn't have. In this case,
CopyBits is also faster than BlockMove, because of optimizations in CopyBits for the
PowerPC processor's frame buffer (which has a 64-bit data path). BlockMove is