-8

I had a very large ArrayList in Java, and I often need to check if it contains a particular value. This has proven very slow.

Then I discovered that you can use a data structure based on a Hash. Because apparently, a method like containsKey() is O(1).

Nice result - but, how is this achieved? Clearly it doesn't go key by key checking for a match.

I imagine it is similar to arrays: Why is the complexity of fetching a value from an array be O(1)? - however, I understand it for arrays since you only have to do some arithmetic to get the address of the desired data. But I am not quite sure how is this applied to Hash tables.

Saturn
  • 3,937

3 Answers3

9

The reason for hash access being O(1) is actually quite different from array access.

Arrays are contiguous memory areas. To read element 15, you only have to multiply the element size by 15 and add the start address. Both addition and multiplication are O(1) (it takes just as long to add two big numbers as two small numbers), and since computers provide O(1) memory access, the overall complexity is still O(1).

A hash works quite differently. It stores things at predictable places, but those indexed places are not user-visible. A string that you put into a hashtable isn't stored at the address you specify; instead the table takes the content of that string and computes a fitting address by hashing that content to a number taken from a small set of possibilities. Then it stores the value to that place, and if you ask for that key again, it re-computes the hashed value and looks up that cell.

Since the set of possible hash values is smaller than the set of possible keys, you can have collisions, so that you might have to spend a little more time to find the right value when more than one of them has been put into the same bucket, but it can be shown that this happens infrequently and doesn't affect the overall amortized complexity, which is still O(1).

So you see that an array can find things fast because you tell it where to load from; a hash table returns things fast because it knows where it put them, and can reconstruct that computation efficiently.

gnat
  • 20,543
  • 29
  • 115
  • 306
Kilian Foth
  • 110,899
2

In Java, every object have a hashCode() method that returns a 32 bit integer value, which is always the same for the same object. In the simplest version, a hashtable just contain an array of size 2^32, and a key-value pair is stored at the index corresponding to the hash code of the key. Since array access by index is O(1), hashtable access by key (for storing or retrieving) can also be O(1).

Of course it is a bit more complex in reality. First, you can always have collisions, that is, two different object giving the same hashcode. So the items are not stored directly in the array, rather each array index contains a "bucket", which is an ordinary list of key-value pairs. (In the Java hashtable the buckets are implemented as linked lists.) You have to search through the bucket to find the item and this search will be O(n), but unless your hashtable contains an extreme number of items (or your hash algorithm is bad), the items will be distributed pretty evenly across the array, and each bucket will contain only a few items. (Only one in the best case.)

Second, you will not initially create an array of size 2^32, since that would be a crazy waste of space. Instead you initially create a smaller array, where each entry maps to multiple hashcodes. This will of course lead to higher risk of collision. You keep track of the number of entries, and when they reach a certain threshold you double the size of the array, and then re-distribute the items. Of course this will also have a performance cost. There is some design tradeoff in deciding when to resize the array. The bigger the array relative to the number of items, the fewer collisions and hence better performance, but also more waste of space.

So finding an item is O(n) in the worst case where all items happen to end up in the same bucket, but O(1) in the common case (given a well-behaved hash function. Since anybody can override hashCode() this is of course not guaranteed. If you write int hashCode(){return 17;} you get worst case performance consistently). And if the number of items grows larger than the hash size, the buckets start to grow and again you get O(n) lookup. On 32 bit systems you would run out of memory before this ever happened, but with 64 bit memory it could theoretically be an issue.

Adding an item is also O(1) in the common case, but O(n) if the add triggers a resize of the array. However the aggregate cost of the resize operations are predictable and proportional to the number of items, so the amortized cost for adds is still O(1). This is not the case for the worst case with lookups, since if we are unlucky and all items ends up in the same bucket every lookup will have the worst-case performance and there is no way to amortize this cost.

Of course both the worst case and the common or average case may be relevant. In a real-time system, it is pretty important to know the worst case performance of an operation. For most business application, there average case is the more important metric.

JacquesB
  • 61,955
  • 21
  • 135
  • 189
1

When talking about measurements (even abstract measurements such as "algorithmic complexity") you should always specify exactly what you are measuring, otherwise what you say is completely meaningless.

In this particular case, you are simply saying "hash tables are O(1)", but you are not saying what exactly you are measuring.

In particular, accessing a value by key in a (properly designed) hash table has

  • worst-case step complexity of O(n) (or more precisely, worst-case step complexity of whatever data structure is used for the "bucket", which usually is a simple linked list)
  • amortized worst-case step complexity of O(1)

In other words, your whole confusion is due to the fact that you are talking about the worst-case and the others are talking about the amortized worst-case, but except for @Kilian Foth nobody bothered to mention that.

The argument is similar to the one for why adding an element to a dynamically-sized array is O(n) worst-case and O(1) amortized worst-case. @JacquesB explains how this amortization works for hash tables.

Jörg W Mittag
  • 104,619