Associative Arrays
Volume Number: 16
Issue Number: 10
Column Tag: Programming Techniques
How to Harness the Power of Associative
Arrays
by Kas Thomas
A widely underutilized technique can go a long way
toward making your code smaller, more reliable, and
easier to maintain, whether you program in C, Java,
or JavaScript
Fostering reliability, maintainability, compactness, and good performance in code has
been a constant quest for programmers and language designers over the years. It's
rare that one technique can give you all of the above benefits, concurrently. But
intelligent use of associative arrays can do that. If you haven't yet tapped into the
power of associative arrays, you might want to give the issues involved some thought.
The issues are widely applicable to a variety of programming tasks, cutting across all
major languages.
One thing's for sure. If you use a lot of switch statements in your code (to "case out
user-selected actions, for example), you're sitting on a great opportunity to improve
the reliability and maintainability of your code; just follow along and see.
What Is an Associative Array?
An associative array is a collection of data values in which access to any particular
value is obtained via a (single) key. Generally speaking, it's a one-to-one mapping of
data entities ("values") to keys, such that no two values map to the same key (although
two different keys could, coincidentally, contain the same value). For example, in an
array of TV listings, you might very well find that two different channels are showing
the same episode of Gilligan's Island at 4:00 a.m. Tuesday, but you would never find
Channel 2 listed twice, showing two different shows in the same time slot.
In a larger sense, associative-array mapping is an application of set theory, and
languages that have a robust collections package will invariably implement some form
of associative-array mapping.
For a quick example of an associative array, consider the hypothetical entity in Figure
1.
Figure 1. A hypothetical associative array.
In this example, we have an anonymous (nameless) array of tuples shown as keys
mapped to values. The itemType key, for example, maps to the value 'auto'. If this were
a conventional (ordered) array, we would say that the zeroth item in the array is the
(string) value 'auto', the first item is '4WD', etc.
For the array shown in Figure 1 to be useful, we need to be able to hold a reference to
the array in a variable, and we need to have a syntax for obtaining the value of each
member, given the value of the corresponding key. It would also be nice if we could
find out how many items are in the array (since it might be arbitrarily large) and to
know how to iterate through the members. With an ordered array, you can step
through the members in sequence. But in an array like this one, where the keys aren't
integers, how do you enumerate the array's contents?
Associative Arrays in JavaScript
Core JavaScript (or ECMAScript) has a convenient syntax for dealing with
constructions such as this. (Even if you're not a JavaScript programmer, bear with
me here, because I'll be discussing other languages in a moment.) In the world of
object-oriented programming, an arbitrary collection of data entities is often called
an object. In JavaScript, a generic object is, in fact, an associative array (whose
members can be references to functions, in which case those particular members are
called methods). In the OOP world, dot notation is commonly used to "index into" an
object. For example:
var myObject = new Object();
myObject.itemType = 'auto';
myObject.subtype = '4WD';
myObject.make = 'Jeep';
myObject.year = '1998';
myObject.color = 'green';
In JavaScript, this is an entirely acceptable way to implement the object suggested by
Figure 1. An equivalent syntax that uses literal notation is:
var myObject = { 'itemType' : 'auto',
'subtype' : '4WD',
'make' :
'Jeep',
'year' :
'1998',
'color' :
'green' };
But are we justified in calling this an associative array? Yes, we are, because it turns
out we can also use array notation (square brackets) in JavaScript to refer to object
properties:
var myObject = new Object();
myObject['itemType'] = 'auto';
myObject['subtype'] = '4WD';
myObject['make'] = 'Jeep';
myObject['year'] = '1998';
myObject['color'] = 'green';
if (myObject['itemType'] == 'auto'
&& myObject['make'] == [TOKEN:10058]eep') // true
myObject['subtype'] = [TOKEN:10036]WD'; // assign '4WD' to
myObject.subtype
In JavaScript, square brackets can substitute for dot notation when referring to object
properties. The only proviso is, the enclosed property name must be supplied as a
string literal (as shown above). This turns out to be an extremely handy syntactical
construct, because it means we are not limited to one-word property names. We can
now consider constructions like:
myObject['extended warranty'] = true; // legal myObject['original delivery date'] =
'1/1/98'; // legal myObject.'original delivery date' = '1/1/98'; // ERROR!
Iterating through the contents of an associative array (or the properties of an object)
would seem to present a problem, since the contents aren't necessarily ordered and
can't be "indexed into" numerically. In JavaScript, however, there is a very handy
loop syntax for doing this:
var bigstring = new String();
for (var i in myObject)
bigstring += i + "\n";
The above code builds a list of all the property names in myObject (separated by
newlines), concatenating them into a single big string.
Perl
Perl has a built-in associative-array construct called the hash. The syntax for
creating a hash looks like:
%myObject = ('itemType' , 'auto', 'subtype' ,
'4WD', 'make' , 'Jeep',
'year' , '1998', 'color' ,
'green');
Notice that commas are used throughout the pairs list, rather than the more readable
JavaScript technique of putting colons between keys and values (using commas to
separate complete tuples).
To access a hash value in Perl, you use a notation that looks like:
$myObject{ 'color' } = 'red';
Notice the seemingly inconsistent use of the dollar sign (instead of the percent sign)
and curly braces (instead of parentheses) for accessing individual values of a hash.
This is a standard Perl idiom. The percent sign refers only to complete hashes.
Java
With the advent of the Java 2 platform , the java.util package now has a powerful
Collections Framework (not to be confused with JGL, the Generic Collections Library
for Java by ObjectSpace, which was first-to-market and was widely distributed in
IDEs before Sun came out with the J2SE Collections Framework), encompassing sorted
and unsorted sets, maps, and lists, plus iterators for same, in thread-safe and
non-thread-safe versions. Prior to Java 2, the language had Vector and HashTable
classes to accommodate this need, but those classes have been relegated to legacy status.
Today, if you need an associative array, you call on the HashMap class, which
implements the Map interface:
Map map = new HashMap();
map.put("itemType", "auto");
map.put("subtype" , "4WD");
map.put("make" , "Jeep");
map.put("year" , "1998");
map.put("color" , "green");
For sorted maps, you can create a TreeMap (using a Map as input, if necessary),
which stores its elements in an always-balanced tree.
C and C++
In C++, the Standard Template Library provides rich support for collections idioms of
all kinds. The map is an associative container that accomplishes the kind of
functionality we've been talking about:
map myMap;

myMap["itemType"] = "auto";
months["subtype"] = "4WD";
// etc. etc.
To use it, be sure to #include . See any good reference on STL for further details.
ANSI C, on the other hand, has no built-in support for associative arrays. You can
craft your own support, of course, with a little effort. But how? As you might guess
from the repeated use of the strange term "hash" in this context, implementing an
associative array from scratch requires the use of hash techniques.
What, exactly, is a hash? A hash is basically a mathematical function in the canonical
sense: a "black box" that maps domain values to range values, with the requirement
that any one domain value can map to no more than one range value. This is exactly the
mapping behavior we're looking for, of course.
But what does a hash function really look like? The answer is, it can look like
whatever you want it to look like, or (more commonly) whatever you need it to look
like. If that sounds a bit vague, that's because hash techniques fall into a curious
category of computational heuristics with a long background in computer-science lore.
In the old days of limited machine resources, you'd use hash techniques to maximize
performance while minimizing storage requirements. Good hash functions weren't
often published, and if they were, their workings were rarely self-evident.
A Hash-Code Example
A quick example may help. Suppose you are implementing an exception dictionary for a
spellchecker: i.e., a special dictionary to store user-entered words. What you want is
that when a user types a word that is not in the regular dictionary, the exception
dictionary can be consulted; and at that point, you simply need to know whether the
word exists in that dictionary. (If not, you flag the word as a possible misspelling and
notify the user.) You have a maximum of 5 Kbytes of storage available for this task,
because you're running on a 1970s machine with severely limited resources.
A typical answer to this challenge would be to insert exception words into a linked list,
and do a lookup by traversing the list. With 5K of storage and an average word length of
five characters, you could store as many as 1,000 words in the list (more than
adequate for most exception dictionaries). But lookups will be time-proportional to
the number of words in the list, and there will be significant overhead in terms of list
management, because each time a new word is entered in the list, you have to be sure it
doesn't already exist.
This approach gets a D for ingenuity.
A bright young programmer in your department comes to you at this point and says
"Wait! I have a better idea." He explains that if you keep the linked list sorted
alphabetically, you can speed lookups because they will occur in time-proportion to
the log of the number of entries. List-maintenance overhead is reduced as well,
although storage requirements haven't changed.
This represents a significant improvement. Let's give it a C+ (or maybe even a C++)
for effort.
At this point, a consultant walks in the door and offers to solve the problem for you in
such a way that up to 40,000 words can be stored in your exception dictionary and
lookups are instantaneous, with zero table maintenance. You say to him "Thank you,
but that's clearly impossible. Don't let the doorknob hit you in the butt on the way
out." He goes straight to your competitor, implements the solution, and eventually the
competitor puts you out of business. You end up face-down in the gutter, next to an
empty bottle of Ripple.
How did the consultant do it? First of all, since you are only concerned with whether a
given word has an entry (true) in the dictionary or not (false), you're really looking
for a binary (Boolean) answer to a question; hence your 5K dictionary can really store
40K lookups (assuming 8-bit bytes, of course). The dictionary needs to be
implemented as a bit table.
To implement a direct lookup of words, we resort to the following heuristic. To store a
word in the dictionary, we submit the word string to a hash function, which in turn
converts the word from a string to a number in the range of zero to 5K-minus-one.
We then index into the bit table at the offset thus obtained, and store the word by
flipping the bit "on." (A zero bit means the word doesn't exist in the table.) To look up
a word and see if it exists in the exception dictionary, we simple hash the word and do a
direct lookup. If the entry in the table at that lookup is true, the word exists. If not, it
doesn't. Problem solved.
Finally, an A+ solution!
Hash Functions
But we still haven't resolved the question of what a hash function looks like. Earlier we
merely said that it could look like whatever it needs to look like. Let's take the
exception-dictionary example we just discussed. How many bits' worth of information
does a 5-character string really hold? If the string represents an English word, each
character can have one of 26 possible values (except that the first character can be
upper or lower case; hence 52 possible values). Storing 26 possibilities requires 4.7
bits; 52 possibilities requires 5.7 bits. The average word therefore requires 5.7 + 4
* 4.7 == just under 25 bits of bandwidth. Every 5-character English word can
therefore be converted (mapped directly and unambiguously) to a 25-bit number.
Which is to say, a number between zero and 33,554,431.
Clearly, if we had 32 million bits (4 megabytes) of storage available, and words could
never be longer than five characters, you could convert words directly to binary
numbers and use the numbers as offsets into a bit table.
Unfortunately, in the real world, words are not constrained to five characters in
length and four megabytes of table storage are not always available. In fact, our
example calls for no more than 5 Kbytes (40Kbits) of storage.
Why not take the bottom 15 bits of each word (discarding the rest) and use that as an
index into a 40Kbit table? The answer should be obvious. Words often have the same
endings. Taking the last 15 bits of 'eating' and 'drinking' would result in both words
contending for the same spot in the exception dictionary. Totally unsatisfactory.