All Databases MacTech Vol 10-1994

Emulation

Volume Number: 10

Issue Number: 9

Column Tag: Machack Best Paper

Winner of Best Paper at MacHack ’94

The Technology of Emulation

68K on PowerPC

By Tom Pittman, Itty Bitty Computers

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

About the Author

Tom Pittman - Even if you’ve never met Tom, you’ve almost certainly heard of

him. He’s the guy who asks all the questions at WWDC. Or you may have heard of his

HyperTalk compiler, CompileIt!, or his stand-alone shell for XCMDs, Double-XX, both

published by Heizer Software. Even if you still think you haven’t heard of him, you

may very well have used his book The Art of Compiler Design in school. Tom gave this

award winning paper to a receptive crowd at this year’s MacHack.

He’s working on an as-yet unnamed new product for accelerating the PowerMac

emulator, based on the technology in this paper; probably to be released before this

issue hits the streets. For info, e-mail to Itty.Bitty@ applelink. apple.com or

CompuServe [72457,2237]. Or send him a postcard at Itty Bitty Computers, P.O. Box

7278, Spreckels, CA 93962

The technological aspects of processor emulation in software are discussed with

particular reference to efficient emulation of the 68000 family on the PowerPC 601.

Special attention is given to the instruction fetch & decode loop and to condition code

handling. A skeleton emulator is presented, with a statistical performance comparison

to Apple’s.

Introduction

It is well known that virtual machine (VM) emulation generally costs about one

order of magnitude in performance, as compared to native-code execution. The

published figures for Apple’s first-generation Power Macintosh are quite consistent

with this. In this paper we discuss how to get the best performance in VM emulation,

using the PowerPC as host and the 68000 as target. We also look briefly at the way

Apple did it and consider how they might improve it in future products.

Elements of Emulation

An interpreter consists of three essential parts. First is the native

representation of the VM state, including any registers that are accessible to the

emulated VM code. The 68000 family contains eight general data registers, eight

address registers (one of which is defined in the hardware as a stack pointer), a

program counter and a machine status register (mostly, for our purposes, the

condition codes) [1]. There are other registers controlling things like virtual

memory, but programmers are discouraged from accessing them directly so they can

be readily ignored. Similarly, some 68040 processors contain on-chip floating-point

hardware, but Apple explicitly declined to emulate this functionality, and it only

complicates our analysis without adding any understanding; we ignore this also. The

601 processor [2] has 32 general-purpose registers, of which all but perhaps three

or four are available for the emulator. The VM state is therefore easily consigned

entirely to 18 GP registers, leaving another ten for intermediate values required for

emulation.

The second essential part of the VM interpreter is the execution engine, a very

tight loop that fetches each successive instruction from the VM code and dispatches

execution to the specific code to emulate that instruction.

The third part is the specific routines for each supported VM instruction. Except

where the two architectures differ in their results and/or side effects, the actual code

in the third part will closely resemble the cognate code of the emulated machine. There

is also a lot of repetitious similarity in this third part, so apart from a few

representative examples here, we spend most of our time looking at the critical second

part, where most of the emulation overhead is spent.

The Execution Engine

The execution engine, also sometimes known as the Fetch Loop, may be further

subdivided into five elements:

1. Fetching the VM instruction,

2. Incrementing the VM program counter,

3. Extracting the operation code (normally abbreviated “opcode”) from the

instruction,

4. Using it to select and jump to the emulation code for that instruction, and

5. Returning to step 1 in the loop.

Depending on the time/space trade-off desired, the execution engine may also

decode the operand addressing modes and possibly fetch operands before jumping to the

specific emulation code.

In our particular context, we observe that each 68000 instruction is always

some multiple of two bytes, and the first two-byte piece uniquely defines both the

operation to be performed and (with two infrequent exceptions) how many more bytes

of operand address are required by this instruction. This forms a natural bound for

step 1: the VM fetch should get two bytes at a time. Both the 68000 and the 601 -

indeed most modern processors - can combine the fetch and increment of steps 1 and 2

in a single instruction. In the 68000, it is a post-increment operation, that is to say,

the fetch occurs on the unincremented address, then the register (whether the

program counter in the case of the hardware instruction fetch, or the address register

in the auto-increment addressing modes) is updated with the incremented address. In

the 601 and other PowerPC processors, the reverse is true (at least of registers used

in the auto-increment mode): the increment is added first, then the operand is accessed

with the new address. Therefore to combine the fetch and increment, the virtual

program counter must be emulated by an actual register that is always 2 less than the

virtual PC. This decision will also impact branch, jump, and return instructions, but

not by much.

We can further improve performance by prefetching the next opcode before

decoding and executing the current one. Thus the memory latency for fetching each VM

instruction (which is often not in the primary on-chip cache) can be overlapped with

the execution of the previous instruction. This may cost extra time in the case of short

branches and returns, which must discard the prefetched but unused opcode following

the current instruction, and some extra complexity (but no extra cost) for all

sequence-control instructions in the VM, which must initiate a prefetch on the target

address. However, this is no more novel than instruction pipelining, which computer

hardware has been doing for years.

Extracting the opcode can be simple and fast, or complex and relatively slow (but

compact). Let’s consider the latter first. The major opcode of the 68000 family

instruction set is the most significant four bits (see Table 1). This also selects one of

two subdivisions of the remaining 12 bits. In two cases (opcodes 0110 and 0111), the

bits are divided 4+8, with the four-bit piece being used to select a register or a

condition to test (sort of a minor opcode, but not quite), while the lower byte is

immediate data. In most of the other cases the bits are divided into two 6-bit register

and addressing mode selectors, where one or both of these addressing mode fields often

doubles for a minor opcode - usually as determined by the upper four bits. Two of the

major opcodes were originally reserved for linkage to co-processors, but Apple

subverted one of them for system calls. Both are defined as illegal opcodes with their

own interrupt vector address, and are therefore not proper instructions. Many of the

other opcodes are also defined as illegal, or else redefined as some special instructions.

Table 1 ignores these special instructions, but of course the emulator must deal with

them.

Table 1. The 16 Major Opcodes of the 68000

Bits 15-12 ˜General Description

0000 A1 ˜Immediate operations

0001 A2 ˜Move Byte

0010 A2 ˜Move Long

0011 A2 ˜Move Short

0100 A1 ˜Miscellaneous

0101 A1 ˜Add/Sub Quick & CC

0110 D ˜Branch

0111 D ˜Move Quick

1000 A1 ˜Or & Divide

1001 A1 ˜Subtract

1010 - ˜Illegal (A-Traps)

1011 A1 ˜Compare & Xor

1100 A1 ˜And & Multiply

1101 A1 ˜Add

1110 A1 ˜Shift & Rotate

1111 - ˜Illegal (Floating-point)

Key to Instruction formats:

- Undefined

A1 Single-Operand Address Mode in 5-0

A2 Double-Operand Address Modes in 11-0

D Data or Offset in 7-0

The compact (and more complex) instruction decode would extract the four-bit

major opcode and use it to choose from a table of 16 instruction formats, which

further select the subroutines to decode the addressing modes (and special instructions

where they occur). In the 601, a single instruction can extract and position the

four-bit major opcode piece for indexing the major decode table; another instruction

fetches the entry from that table. A table entry typically would be the address of the

code to do the operation, and that code normally begins with either a subroutine call or

inline code to decode the addressing mode and fetch the operands. Memory data fetches

cost a minimum one-cycle penalty even if the data is in the cache (much more if it is

not, but for the interpreter’s decode table we can generally assume the best case). So

the conventional approach here turns out to be less efficient than we might hope for.

Branch instructions in the 601 are free if we can set up the branch predictor

well enough in advance, so it makes more sense to make the decode table entries big

Referenced by (12):