All Databases MacTech Vol 16-2000

Associative Arrays

Volume Number: 16

Issue Number: 10

Column Tag: Programming Techniques

How to Harness the Power of Associative

Arrays

by Kas Thomas

A widely underutilized technique can go a long way

toward making your code smaller, more reliable, and

easier to maintain, whether you program in C, Java,

or JavaScript

Fostering reliability, maintainability, compactness, and good performance in code has

been a constant quest for programmers and language designers over the years. It's

rare that one technique can give you all of the above benefits, concurrently. But

intelligent use of associative arrays can do that. If you haven't yet tapped into the

power of associative arrays, you might want to give the issues involved some thought.

The issues are widely applicable to a variety of programming tasks, cutting across all

major languages.

One thing's for sure. If you use a lot of switch statements in your code (to "case out

user-selected actions, for example), you're sitting on a great opportunity to improve

the reliability and maintainability of your code; just follow along and see.

What Is an Associative Array?

An associative array is a collection of data values in which access to any particular

value is obtained via a (single) key. Generally speaking, it's a one-to-one mapping of

data entities ("values") to keys, such that no two values map to the same key (although

two different keys could, coincidentally, contain the same value). For example, in an

array of TV listings, you might very well find that two different channels are showing

the same episode of Gilligan's Island at 4:00 a.m. Tuesday, but you would never find

Channel 2 listed twice, showing two different shows in the same time slot.

In a larger sense, associative-array mapping is an application of set theory, and

languages that have a robust collections package will invariably implement some form

of associative-array mapping.

For a quick example of an associative array, consider the hypothetical entity in Figure

Figure 1. A hypothetical associative array.

In this example, we have an anonymous (nameless) array of tuples shown as keys

mapped to values. The itemType key, for example, maps to the value 'auto'. If this were

a conventional (ordered) array, we would say that the zeroth item in the array is the

(string) value 'auto', the first item is '4WD', etc.

For the array shown in Figure 1 to be useful, we need to be able to hold a reference to

the array in a variable, and we need to have a syntax for obtaining the value of each

member, given the value of the corresponding key. It would also be nice if we could

find out how many items are in the array (since it might be arbitrarily large) and to

know how to iterate through the members. With an ordered array, you can step

through the members in sequence. But in an array like this one, where the keys aren't

integers, how do you enumerate the array's contents?

Associative Arrays in JavaScript

Core JavaScript (or ECMAScript) has a convenient syntax for dealing with

constructions such as this. (Even if you're not a JavaScript programmer, bear with

me here, because I'll be discussing other languages in a moment.) In the world of

object-oriented programming, an arbitrary collection of data entities is often called

an object. In JavaScript, a generic object is, in fact, an associative array (whose

members can be references to functions, in which case those particular members are

called methods). In the OOP world, dot notation is commonly used to "index into" an

object. For example:

var myObject = new Object();

myObject.itemType = 'auto';

myObject.subtype = '4WD';

myObject.make = 'Jeep';

myObject.year = '1998';

myObject.color = 'green';

In JavaScript, this is an entirely acceptable way to implement the object suggested by

Figure 1. An equivalent syntax that uses literal notation is:

var myObject = { 'itemType' : 'auto',

'subtype' : '4WD',

'make' :

'Jeep',

'year' :

'1998',

'color' :

'green' };

But are we justified in calling this an associative array? Yes, we are, because it turns

out we can also use array notation (square brackets) in JavaScript to refer to object

properties:

var myObject = new Object();

myObject['itemType'] = 'auto';

myObject['subtype'] = '4WD';

myObject['make'] = 'Jeep';

myObject['year'] = '1998';

myObject['color'] = 'green';

if (myObject['itemType'] == 'auto'

&& myObject['make'] == [TOKEN:10058]eep') // true

myObject['subtype'] = [TOKEN:10036]WD'; // assign '4WD' to

myObject.subtype

In JavaScript, square brackets can substitute for dot notation when referring to object

properties. The only proviso is, the enclosed property name must be supplied as a

string literal (as shown above). This turns out to be an extremely handy syntactical

construct, because it means we are not limited to one-word property names. We can

now consider constructions like:

myObject['extended warranty'] = true; // legal myObject['original delivery date'] =

'1/1/98'; // legal myObject.'original delivery date' = '1/1/98'; // ERROR!

Iterating through the contents of an associative array (or the properties of an object)

would seem to present a problem, since the contents aren't necessarily ordered and

can't be "indexed into" numerically. In JavaScript, however, there is a very handy

loop syntax for doing this:

var bigstring = new String();

for (var i in myObject)

bigstring += i + "\n";

The above code builds a list of all the property names in myObject (separated by

newlines), concatenating them into a single big string.

Perl

Perl has a built-in associative-array construct called the hash. The syntax for

creating a hash looks like:

%myObject = ('itemType' , 'auto', 'subtype' ,

'4WD', 'make' , 'Jeep',

'year' , '1998', 'color' ,

'green');

Notice that commas are used throughout the pairs list, rather than the more readable

JavaScript technique of putting colons between keys and values (using commas to

separate complete tuples).

To access a hash value in Perl, you use a notation that looks like:

$myObject{ 'color' } = 'red';

Notice the seemingly inconsistent use of the dollar sign (instead of the percent sign)

and curly braces (instead of parentheses) for accessing individual values of a hash.

This is a standard Perl idiom. The percent sign refers only to complete hashes.

Java

With the advent of the Java 2 platform , the java.util package now has a powerful

Collections Framework (not to be confused with JGL, the Generic Collections Library

for Java by ObjectSpace, which was first-to-market and was widely distributed in

IDEs before Sun came out with the J2SE Collections Framework), encompassing sorted

and unsorted sets, maps, and lists, plus iterators for same, in thread-safe and

non-thread-safe versions. Prior to Java 2, the language had Vector and HashTable

classes to accommodate this need, but those classes have been relegated to legacy status.

Today, if you need an associative array, you call on the HashMap class, which

implements the Map interface:

Map map = new HashMap();

map.put("itemType", "auto");

map.put("subtype" , "4WD");

map.put("make" , "Jeep");

map.put("year" , "1998");

map.put("color" , "green");

For sorted maps, you can create a TreeMap (using a Map as input, if necessary),

which stores its elements in an always-balanced tree.

C and C++

In C++, the Standard Template Library provides rich support for collections idioms of

all kinds. The map is an associative container that accomplishes the kind of

functionality we've been talking about:

map myMap;

myMap["itemType"] = "auto";

months["subtype"] = "4WD";

// etc. etc.

To use it, be sure to #include . See any good reference on STL for further details.

ANSI C, on the other hand, has no built-in support for associative arrays. You can

craft your own support, of course, with a little effort. But how? As you might guess

from the repeated use of the strange term "hash" in this context, implementing an

associative array from scratch requires the use of hash techniques.

What, exactly, is a hash? A hash is basically a mathematical function in the canonical

sense: a "black box" that maps domain values to range values, with the requirement

that any one domain value can map to no more than one range value. This is exactly the

mapping behavior we're looking for, of course.

But what does a hash function really look like? The answer is, it can look like

whatever you want it to look like, or (more commonly) whatever you need it to look

like. If that sounds a bit vague, that's because hash techniques fall into a curious

category of computational heuristics with a long background in computer-science lore.

In the old days of limited machine resources, you'd use hash techniques to maximize

performance while minimizing storage requirements. Good hash functions weren't

often published, and if they were, their workings were rarely self-evident.

A Hash-Code Example

A quick example may help. Suppose you are implementing an exception dictionary for a

spellchecker: i.e., a special dictionary to store user-entered words. What you want is

that when a user types a word that is not in the regular dictionary, the exception

dictionary can be consulted; and at that point, you simply need to know whether the

word exists in that dictionary. (If not, you flag the word as a possible misspelling and

notify the user.) You have a maximum of 5 Kbytes of storage available for this task,

because you're running on a 1970s machine with severely limited resources.

A typical answer to this challenge would be to insert exception words into a linked list,

and do a lookup by traversing the list. With 5K of storage and an average word length of

five characters, you could store as many as 1,000 words in the list (more than

adequate for most exception dictionaries). But lookups will be time-proportional to

the number of words in the list, and there will be significant overhead in terms of list

management, because each time a new word is entered in the list, you have to be sure it

doesn't already exist.

This approach gets a D for ingenuity.

A bright young programmer in your department comes to you at this point and says

"Wait! I have a better idea." He explains that if you keep the linked list sorted

alphabetically, you can speed lookups because they will occur in time-proportion to

the log of the number of entries. List-maintenance overhead is reduced as well,

although storage requirements haven't changed.

This represents a significant improvement. Let's give it a C+ (or maybe even a C++)

for effort.

At this point, a consultant walks in the door and offers to solve the problem for you in

such a way that up to 40,000 words can be stored in your exception dictionary and

lookups are instantaneous, with zero table maintenance. You say to him "Thank you,

but that's clearly impossible. Don't let the doorknob hit you in the butt on the way

out." He goes straight to your competitor, implements the solution, and eventually the

competitor puts you out of business. You end up face-down in the gutter, next to an

empty bottle of Ripple.

How did the consultant do it? First of all, since you are only concerned with whether a

given word has an entry (true) in the dictionary or not (false), you're really looking

for a binary (Boolean) answer to a question; hence your 5K dictionary can really store

40K lookups (assuming 8-bit bytes, of course). The dictionary needs to be

implemented as a bit table.

To implement a direct lookup of words, we resort to the following heuristic. To store a

word in the dictionary, we submit the word string to a hash function, which in turn

converts the word from a string to a number in the range of zero to 5K-minus-one.

We then index into the bit table at the offset thus obtained, and store the word by

flipping the bit "on." (A zero bit means the word doesn't exist in the table.) To look up

a word and see if it exists in the exception dictionary, we simple hash the word and do a

direct lookup. If the entry in the table at that lookup is true, the word exists. If not, it

doesn't. Problem solved.

Finally, an A+ solution!

Hash Functions

But we still haven't resolved the question of what a hash function looks like. Earlier we

merely said that it could look like whatever it needs to look like. Let's take the

exception-dictionary example we just discussed. How many bits' worth of information

does a 5-character string really hold? If the string represents an English word, each

character can have one of 26 possible values (except that the first character can be

upper or lower case; hence 52 possible values). Storing 26 possibilities requires 4.7

bits; 52 possibilities requires 5.7 bits. The average word therefore requires 5.7 + 4

* 4.7 == just under 25 bits of bandwidth. Every 5-character English word can

therefore be converted (mapped directly and unambiguously) to a 25-bit number.

Which is to say, a number between zero and 33,554,431.

Clearly, if we had 32 million bits (4 megabytes) of storage available, and words could

never be longer than five characters, you could convert words directly to binary

numbers and use the numbers as offsets into a bit table.

Unfortunately, in the real world, words are not constrained to five characters in

length and four megabytes of table storage are not always available. In fact, our

example calls for no more than 5 Kbytes (40Kbits) of storage.

Why not take the bottom 15 bits of each word (discarding the rest) and use that as an

index into a 40Kbit table? The answer should be obvious. Words often have the same

endings. Taking the last 15 bits of 'eating' and 'drinking' would result in both words

contending for the same spot in the exception dictionary. Totally unsatisfactory.

Referenced by (11):