Emulation
Volume Number: 10
Issue Number: 9
Column Tag: Machack Best Paper
Winner of Best Paper at MacHack ’94
The Technology of Emulation
68K on PowerPC
By Tom Pittman, Itty Bitty Computers
Note: Source code files accompanying article are located on MacTech CD-ROM or
source code disks.
About the Author
Tom Pittman - Even if you’ve never met Tom, you’ve almost certainly heard of
him. He’s the guy who asks all the questions at WWDC. Or you may have heard of his
HyperTalk compiler, CompileIt!, or his stand-alone shell for XCMDs, Double-XX, both
published by Heizer Software. Even if you still think you haven’t heard of him, you
may very well have used his book The Art of Compiler Design in school. Tom gave this
award winning paper to a receptive crowd at this year’s MacHack.
He’s working on an as-yet unnamed new product for accelerating the PowerMac
emulator, based on the technology in this paper; probably to be released before this
issue hits the streets. For info, e-mail to Itty.Bitty@ applelink. apple.com or
CompuServe [72457,2237]. Or send him a postcard at Itty Bitty Computers, P.O. Box
7278, Spreckels, CA 93962
The technological aspects of processor emulation in software are discussed with
particular reference to efficient emulation of the 68000 family on the PowerPC 601.
Special attention is given to the instruction fetch & decode loop and to condition code
handling. A skeleton emulator is presented, with a statistical performance comparison
to Apple’s.
Introduction
It is well known that virtual machine (VM) emulation generally costs about one
order of magnitude in performance, as compared to native-code execution. The
published figures for Apple’s first-generation Power Macintosh are quite consistent
with this. In this paper we discuss how to get the best performance in VM emulation,
using the PowerPC as host and the 68000 as target. We also look briefly at the way
Apple did it and consider how they might improve it in future products.
Elements of Emulation
An interpreter consists of three essential parts. First is the native
representation of the VM state, including any registers that are accessible to the
emulated VM code. The 68000 family contains eight general data registers, eight
address registers (one of which is defined in the hardware as a stack pointer), a
program counter and a machine status register (mostly, for our purposes, the
condition codes) [1]. There are other registers controlling things like virtual
memory, but programmers are discouraged from accessing them directly so they can
be readily ignored. Similarly, some 68040 processors contain on-chip floating-point
hardware, but Apple explicitly declined to emulate this functionality, and it only
complicates our analysis without adding any understanding; we ignore this also. The
601 processor [2] has 32 general-purpose registers, of which all but perhaps three
or four are available for the emulator. The VM state is therefore easily consigned
entirely to 18 GP registers, leaving another ten for intermediate values required for
emulation.
The second essential part of the VM interpreter is the execution engine, a very
tight loop that fetches each successive instruction from the VM code and dispatches
execution to the specific code to emulate that instruction.
The third part is the specific routines for each supported VM instruction. Except
where the two architectures differ in their results and/or side effects, the actual code
in the third part will closely resemble the cognate code of the emulated machine. There
is also a lot of repetitious similarity in this third part, so apart from a few
representative examples here, we spend most of our time looking at the critical second
part, where most of the emulation overhead is spent.
The Execution Engine
The execution engine, also sometimes known as the Fetch Loop, may be further
subdivided into five elements:
1. Fetching the VM instruction,
2. Incrementing the VM program counter,
3. Extracting the operation code (normally abbreviated “opcode”) from the
instruction,
4. Using it to select and jump to the emulation code for that instruction, and
5. Returning to step 1 in the loop.
Depending on the time/space trade-off desired, the execution engine may also
decode the operand addressing modes and possibly fetch operands before jumping to the
specific emulation code.
In our particular context, we observe that each 68000 instruction is always
some multiple of two bytes, and the first two-byte piece uniquely defines both the
operation to be performed and (with two infrequent exceptions) how many more bytes
of operand address are required by this instruction. This forms a natural bound for
step 1: the VM fetch should get two bytes at a time. Both the 68000 and the 601 -
indeed most modern processors - can combine the fetch and increment of steps 1 and 2
in a single instruction. In the 68000, it is a post-increment operation, that is to say,
the fetch occurs on the unincremented address, then the register (whether the
program counter in the case of the hardware instruction fetch, or the address register
in the auto-increment addressing modes) is updated with the incremented address. In
the 601 and other PowerPC processors, the reverse is true (at least of registers used
in the auto-increment mode): the increment is added first, then the operand is accessed
with the new address. Therefore to combine the fetch and increment, the virtual
program counter must be emulated by an actual register that is always 2 less than the
virtual PC. This decision will also impact branch, jump, and return instructions, but
not by much.
We can further improve performance by prefetching the next opcode before
decoding and executing the current one. Thus the memory latency for fetching each VM
instruction (which is often not in the primary on-chip cache) can be overlapped with
the execution of the previous instruction. This may cost extra time in the case of short
branches and returns, which must discard the prefetched but unused opcode following
the current instruction, and some extra complexity (but no extra cost) for all
sequence-control instructions in the VM, which must initiate a prefetch on the target
address. However, this is no more novel than instruction pipelining, which computer
hardware has been doing for years.
Extracting the opcode can be simple and fast, or complex and relatively slow (but
compact). Let’s consider the latter first. The major opcode of the 68000 family
instruction set is the most significant four bits (see Table 1). This also selects one of
two subdivisions of the remaining 12 bits. In two cases (opcodes 0110 and 0111), the
bits are divided 4+8, with the four-bit piece being used to select a register or a
condition to test (sort of a minor opcode, but not quite), while the lower byte is
immediate data. In most of the other cases the bits are divided into two 6-bit register
and addressing mode selectors, where one or both of these addressing mode fields often
doubles for a minor opcode - usually as determined by the upper four bits. Two of the
major opcodes were originally reserved for linkage to co-processors, but Apple
subverted one of them for system calls. Both are defined as illegal opcodes with their
own interrupt vector address, and are therefore not proper instructions. Many of the
other opcodes are also defined as illegal, or else redefined as some special instructions.
Table 1 ignores these special instructions, but of course the emulator must deal with
them.
Table 1. The 16 Major Opcodes of the 68000
Bits 15-12 ˜General Description
0000 A1 ˜Immediate operations
0001 A2 ˜Move Byte
0010 A2 ˜Move Long
0011 A2 ˜Move Short
0100 A1 ˜Miscellaneous
0101 A1 ˜Add/Sub Quick & CC
0110 D ˜Branch
0111 D ˜Move Quick
1000 A1 ˜Or & Divide
1001 A1 ˜Subtract
1010 - ˜Illegal (A-Traps)
1011 A1 ˜Compare & Xor
1100 A1 ˜And & Multiply
1101 A1 ˜Add
1110 A1 ˜Shift & Rotate
1111 - ˜Illegal (Floating-point)
Key to Instruction formats:
- Undefined
A1 Single-Operand Address Mode in 5-0
A2 Double-Operand Address Modes in 11-0
D Data or Offset in 7-0
The compact (and more complex) instruction decode would extract the four-bit
major opcode and use it to choose from a table of 16 instruction formats, which
further select the subroutines to decode the addressing modes (and special instructions
where they occur). In the 601, a single instruction can extract and position the
four-bit major opcode piece for indexing the major decode table; another instruction
fetches the entry from that table. A table entry typically would be the address of the
code to do the operation, and that code normally begins with either a subroutine call or
inline code to decode the addressing mode and fetch the operands. Memory data fetches
cost a minimum one-cycle penalty even if the data is in the cache (much more if it is
not, but for the interpreter’s decode table we can generally assume the best case). So
the conventional approach here turns out to be less efficient than we might hope for.
Branch instructions in the 601 are free if we can set up the branch predictor
well enough in advance, so it makes more sense to make the decode table entries big