Nick Bit and Dick Byte are aspiring computer scientists, always up to something. This time they get to grips with hash-based collections.
Nick: “Hey dude, let’s build a data structure in Java that does everything in O(1)”
Dick: “Sounds cool. You can’t get any better than constant time”
[Nick and Dick are of course talking about algorithm time complexity measure and the Big O notation]
Dick: “Where do we start?”
Nick: “We could build it on top of an array. If you know the index of an element in the array you can do everything in O(1)”
Dick: “What do you mean?”
Nick: “Take a look at this code. Knowing the index allows me to insert, access and delete an entry in constant time”
int apple_index = 4; String fruits = new String; String apple = "Apple"; // insert fruits[apple_index] = apple; // access String lunch = fruits[apple_index]; // remove fruits[apple_index] = null;
Dick: “I see… but how do we relate an object with an integer number that we can use as the index into the array without hard-coding it?”
Nick: “ As it happens the java.lang.Object class already provides this mechanism for us with the hashCode() method. It attempts to return distinct integers for distinct objects”
Dick: “Cool. So if we have an array big enough we can store any object in it using the hash code as the index and get constant time performance for all operations!”
Nick: “Let’s do it. Why don’t you start writing the code while I get us a coffee?”
[Dick gets started with the code, but don’t try this at home]
Object objArray = new Object[Integer.MAX_VALUE]; String banana = "Banana"; objArray[banana.hashCode()] = banana;
Nick: “I am back”
Dick: “Wow! I think I just blew the JVM…”
Nick: “Let me see”
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at com.nickanddick.App.main(App.java:7) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Nick: “That’s daft. We are asking the JVM to allocate memory for 2^16 object instances. That’s more than 2 billion objects!”
Dick: “Wow! That’s a lot of memory”
Nick: “We have been concentrating on time complexity but forgot to take space complexity into consideration”
Dick: “That’s efficient memory usage, right?”
Nick: “Right. We don’t really need an array this big. We just need an array big enough to contain the number of elements required by a particular application”
Dick: “But if we use a smaller array, what happens if the hash code returned for an object is outside the array bounds?”
[Good question. It is time we give Nick and Dick some help]
Bucket Arrays and Hash Functions
To efficiently implement their data structure they need the two fundamental components of a hash-based collection: a bucket array and a hash function.
A bucket array is an ordinary array of a pre-defined capacity (number of elements).
A hash function is a function that translates the hash code of an object into a bucket array index. A simple hash function could for example determine the index by finding the remainder of the division of the hash code by the bucket array length like this:
index = hashCode % array.length
[The % operator is called the remainder or modulus operator]
There a number of considerations to take into account when implementing a hash-based collection.
Hash-based collections do not store duplicate values, as duplicate values yield the same index into the bucket array.
There is however a problem. It is possible that a hash function returns the same index for different objects. Suppose we have a bucket size of 128 and we want to add to it two objects obj1 and obj2 whose hash codes are respectively 138 and 266. Applying the simple hash function defined above yields:
Array index for obj1 = 138 % 128 = 10
Array index for obj2 = 266 % 128 = 10
When the hash function returns the same index location for different objects we have what is called a collision. One way to deal with collisions is to use a simple technique called chaining. It works by storing a linked list of objects allocated to the same bucket array index. We will see how that works in practice in the next post.
The probability of collisions is proportional to the capacity of the bucket array. Larger bucket arrays will produce a smaller number of collisions for a given hash function. The relation between the number of elements in a hash collection and the size of the bucket array is called the load factor of the hash collection.
A hash collection with a smaller load factor will produce a smaller number of collisions and therefore will be more efficient.
Object.hashCode() and Object.equals()
Classes whose instances are meant to be used in a hash-based collection should override both the hashCode() and equals() methods of java.lang.Object and obey their general contracts.
The implementation of hashCode() in java.lang.Object returns distinct integers for distinct objects (as much as it is reasonably practical) even when the objects are logically equal and therefore might not produce the expected result when used in a hash-based collection.
Likewise, the java.lang.Object implementation of equals() is not ideal as an object is only equals to itself, and should be overriden.
The general contract and algorithms for overriding hashCode() and equals() are described at length in Effective Java by Joshua Bloch. The key rule is that equal objects must have equal hash codes.
In practice all the major IDEs are capable to generate reasonably correct hashCode() and equals() methods for a class.
Classes whose instances are meant to be used as keys in a Hash Map should also be immutable. It is recommended that the class also implements the java.lang.Comparable interface.
A Simple Hash Set Implementation
In the second part of this post we will look at a simple HashSet implementation to see how it all hangs together in practice. See you then.