Category Archives: Algorithms

How to Implement a Simple HashSet in Java

In the first part of this post we discussed the design of hash-based collections and why classes must obey the general contract stipulated by the equals() and hashCode() methods of the java.lang.Object class. Now we will put in practice what we have discussed by writing a simple implementation of a HashSet in Java.

Our simple HashSet will implement the following interface:

import java.util.Iterator;

public interface SimpleSet {

    // Adds the specified element to this set if it is not already present
    boolean add(Object element);
    // Removes the specified element from this set if it is present
    boolean remove(Object element);
    // Returns true if this set contains the specified element
    boolean contains(Object element);
    // Returns an iterator over the elements in this set
    Iterator iterator();
    // Returns the number of elements in this set
    int size();
}

 
Now let’s move to the implementation details.

Since we will be adopting chaining to resolve collisions, we need an inner class that we can use to build a linked list of collided objects. We will call the inner class Entry.

public class SimpleHashSet implements SimpleSet {

    private static class Entry {
        Object key;
        Entry next;
    }

    // ...

 
The bucket array is an array of Entry instances. We also need a private property to store the size of the Set.

    private Entry[] buckets;

    private int size;

    // ...

 
Our simple HashSet constructor takes a single int argument which defines the capacity of the bucket array:

    public SimpleHashSet(int capacity) {

        buckets = new Entry[capacity];
        size = 0;
    }
    
    // ...

 
Next we need to define a hash function that computes the index into the bucket array giving the hash value of an object:

     private int hashFunction(int hashCode) {

        int index = hashCode;
        if (index < 0) { index = -index; }
        return index % buckets.length;
    }

    // ...

 
We can now start implementing the interface methods.

The algorithm for the add method first computes the index into the bucket array using our hash function.

If there is already one or more entries in that bucket location they are compared for equality with the object to add. If any of the comparisons is positive then the method returns false as a Set cannot store duplicates.

If no duplicates are found than a new Entry is created and assigned to the computed bucket location and linked to the chain if one exists. The size property is incremented and the method returns true to indicate that the object has been added to the Set.

    @Override
    public boolean add(Object element) {

        int index = hashFunction(element.hashCode());
        Entry current = buckets[index];

        while (current != null) {
            // element is already in set
            if (current.key.equals(element)) { return false; }
            // otherwise visit next entry in the bucket
            current = current.next;
        }
        // no element found so add new entry
        Entry entry = new Entry();
        entry.key = element;
        // current Entry is null if bucket is empty
        // if it is not null it becomes next Entry
        entry.next  = buckets[index];
        buckets[index] = entry;
        size++;
        return true;
    }

    // ...

 
The algorithm for the remove method also first computes the index into the bucket array using our hash function.

If there is already one or more entries in that bucket location they are compared for equality with the object to remove. If a match is found the object is removed and the chain, if present, adjusted. The size property is decremented and the method returns true to indicate that the object has been removed.

If no match is found then the method returns false, as there is nothing to remove.

    @Override
    public boolean remove(Object element) {

        int index = hashFunction(element.hashCode());
        Entry current = buckets[index];
        Entry previous = null;

        while (current != null) {
            // element found so remove it
            if (current.key.equals(element)) {

                if (previous == null) {
                    buckets[index] = current.next;
                } else {
                    previous.next = current.next;
                }
                size--;
                return true;
            }

            previous = current;
            current = current.next;
        }
        // no element found nothing to remove
        return false;
    }

    // ...

 
The algorithm for the contains method again first computes the index into the bucket array using our hash function.

If there is already one or more entries in that bucket location they are compared for equality with the object passed as argument. If a match is found the method returns true, otherwise it returns false.

    @Override
    public boolean contains(Object element) {

        int index = hashFunction(element.hashCode());
        Entry current = buckets[index];

        while (current != null) {
            // check if node contains element
            if (current.key.equals(element)) { return true; }
            // otherwise visit next node in the bucket
            current = current.next;
        }
        // no element found
        return false;
    }

    // ...

 
The iterator method simply returns an instance of an inner class that implements the java.util.Iterator interface.

    @Override
    public Iterator iterator() {
        return new SimpleHashSetIterator();
    }

    // ...

 
The code for the Iterator implementation follows:

    class SimpleHashSetIterator implements Iterator {

        private int currentBucket;
        private int previousBucket;
        private Entry currentEntry;
        private Entry previousEntry;

        public SimpleHashSetIterator() {
            currentEntry = null;
            previousEntry = null;
            currentBucket = -1;
            previousBucket = -1;
        }

        @Override
        public boolean hasNext() {

            // currentEntry node has next
            if (currentEntry != null && currentEntry.next != null) { return true; }

            // there are still nodes
            for (int index = currentBucket+1; index < buckets.length; index++) {
                if (buckets[index] != null) { return true; }
            }

            // nothing left
            return false;
        }

        @Override
        public Object next() {

            previousEntry = currentEntry;
            previousBucket = currentBucket;

            // if either the current or next node are null
            if (currentEntry == null || currentEntry.next == null) {

                // go to next bucket
                currentBucket++;

                // keep going until you find a bucket with a node
                while (currentBucket < buckets.length &&
                        buckets[currentBucket] == null) {
                    // go to next bucket
                    currentBucket++;
                }

                // if bucket array index still in bounds
                // make it the current node
                if (currentBucket < buckets.length) {
                    currentEntry = buckets[currentBucket];
                }
                // otherwise there are no more elements
                else {
                    throw new NoSuchElementException();
                }
            }
            // go to the next element in bucket
            else {

                currentEntry = currentEntry.next;
            }

            // return the element in the current node
            return currentEntry.key;

        }

    }

    // ...

 

The implementation of the size method simply returns the value of the size property:

    public int size() {
        return size;
    }

    // ...

 
Finally we override the java.lang.Object toString method to produce a visual representation of our HashSet.

    @Override
    public String toString() {

        Entry currentEntry = null;
        StringBuffer sb = new StringBuffer();

        // loop through the array
        for (int index=0; index < buckets.length; index++) {
            // we have an entry
            if (buckets[index] != null) {
                currentEntry = buckets[index];
                sb.append("[" + index + "]");
                sb.append(" " + currentEntry.key.toString());
                while (currentEntry.next != null) {
                    currentEntry = currentEntry.next;
                    sb.append(" -> " + currentEntry.key.toString());
                }
                sb.append('\n');
            }
        }

        return sb.toString();
    }

    // ...

 
The full source code can be found on GitHub at the following URLs:

SimpleSet (interface)
SimpleHashSet (implementation)
SimpleHashSetTest (unit tests)

This is of course a very basic implementation of a HashSet in Java. It is useful to study and understand the main principles behind hash-based collections, or to code on interviews to demonstrate your knowledge of a hash-based data structure and its algorithms.

If you want to study a production quality implementation with all the required optimizations, I suggest that you take a look at the source code of java.util.HashSet. To find out how to look at the Java library source code in Eclipse please read this post.

Thank you for reading, feel free to post any questions or comments.

Hash-based collections and the Object class

Nick Bit and Dick Byte are aspiring computer scientists, always up to something. This time they get to grips with hash-based collections.

Nick: “Hey dude, let’s build a data structure in Java that does everything in O(1)”
Dick: “Sounds cool. You can’t get any better than constant time”

[Nick and Dick are of course talking about algorithm time complexity measure and the Big O notation]

Dick: “Where do we start?”
Nick: “We could build it on top of an array. If you know the index of an element in the array you can do everything in O(1)”
Dick: “What do you mean?”
Nick: “Take a look at this code. Knowing the index allows me to insert, access and delete an entry in constant time”

int apple_index = 4;

String[] fruits = new String[24];
String apple = "Apple";
// insert
fruits[apple_index] = apple;
// access
String lunch = fruits[apple_index];
// remove
fruits[apple_index] = null;

 

Dick: “I see… but how do we relate an object with an integer number that we can use as the index into the array without hard-coding it?”
Nick: “ As it happens the java.lang.Object class already provides this mechanism for us with the hashCode() method. It attempts to return distinct integers for distinct objects”
Dick: “Cool. So if we have an array big enough we can store any object in it using the hash code as the index and get constant time performance for all operations!”
Nick: “Let’s do it. Why don’t you start writing the code while I get us a coffee?”
Dick: “OK”

[Dick gets started with the code, but don’t try this at home]

Object[] objArray = new Object[Integer.MAX_VALUE];
String banana = "Banana";
objArray[banana.hashCode()] = banana;

 

Nick: “I am back”
Dick: “Wow! I think I just blew the JVM…”
Nick: “Let me see”

Exception in thread "main" java.lang.OutOfMemoryError: 
Requested array size exceeds VM limit

   at com.nickanddick.App.main(App.java:7)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 

Nick: “That’s daft. We are asking the JVM to allocate memory for 2^16 object instances. That’s more than 2 billion objects!”
Dick: “Wow! That’s a lot of memory”
Nick: “We have been concentrating on time complexity but forgot to take space complexity into consideration”
Dick: “That’s efficient memory usage, right?”
Nick: “Right. We don’t really need an array this big. We just need an array big enough to contain the number of elements required by a particular application”
Dick: “But if we use a smaller array, what happens if the hash code returned for an object is outside the array bounds?”

[Good question. It is time we give Nick and Dick some help]

Bucket Arrays and Hash Functions

To efficiently implement their data structure they need the two fundamental components of a hash-based collection: a bucket array and a hash function.

A bucket array is an ordinary array of a pre-defined capacity (number of elements).

A hash function is a function that translates the hash code of an object into a bucket array index. A simple hash function could for example determine the index by finding the remainder of the division of the hash code by the bucket array length like this:

index = hashCode % array.length

[The % operator is called the remainder or modulus operator]

There a number of considerations to take into account when implementing a hash-based collection.

Collisions

Hash-based collections do not store duplicate values, as duplicate values yield the same index into the bucket array.

There is however a problem. It is possible that a hash function returns the same index for different objects. Suppose we have a bucket size of 128 and we want to add to it two objects obj1 and obj2 whose hash codes are respectively 138 and 266. Applying the simple hash function defined above yields:

Array index for obj1 = 138 % 128 = 10
Array index for obj2 = 266 % 128 = 10

When the hash function returns the same index location for different objects we have what is called a collision. One way to deal with collisions is to use a simple technique called chaining. It works by storing a linked list of objects allocated to the same bucket array index. We will see how that works in practice in the next post.

Load Factor

The probability of collisions is proportional to the capacity of the bucket array. Larger bucket arrays will produce a smaller number of collisions for a given hash function. The relation between the number of elements in a hash collection and the size of the bucket array is called the load factor of the hash collection.

A hash collection with a smaller load factor will produce a smaller number of collisions and therefore will be more efficient.

Object.hashCode() and Object.equals()

Classes whose instances are meant to be used in a hash-based collection should override both the hashCode() and equals() methods of java.lang.Object and obey their general contracts.

The implementation of hashCode() in java.lang.Object returns distinct integers for distinct objects (as much as it is reasonably practical) even when the objects are logically equal and therefore might not produce the expected result when used in a hash-based collection.

https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode–

Likewise, the java.lang.Object implementation of equals() is not ideal as an object is only equals to itself, and should be overriden.

https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#equals-java.lang.Object-

The general contract and algorithms for overriding hashCode() and equals() are described at length in Effective Java by Joshua Bloch. The key rule is that equal objects must have equal hash codes.

In practice all the major IDEs are capable to generate reasonably correct hashCode() and equals() methods for a class.

Other Considerations

Classes whose instances are meant to be used as keys in a Hash Map should also be immutable. It is recommended that the class also implements the java.lang.Comparable interface.

A Simple Hash Set Implementation

In the second part of this post we will look at a simple HashSet implementation to see how it all hangs together in practice. See you then.