Saturday, September 13

When Are Sorted Lists Better Than Hashmaps?

Maps are wonderful things. Unlike arrays or lists, they are keyed and so allow O(1) access, which means their size doesn't affect how long it takes to retrieve a specific (that is, known) item. In an array (of key/value pairs) you'd have to check each and every item for what you're looking for (which is in O(n), or depends on the length of the list).

This can be improved upon if the array is sorted in which case a Binary Search (check the middle value and see if what you're looking for is in the first or second half and then repeat) which would give an access time of O(log n). Unfortunately sorting takes some effort too - either by searching for where to insert a new value (using binary search, above) in which case you'd also have to worry about shifting existing values up a space (if we've implemented using an array, say) or sorting on retrieval (which then only has to be done the one time after each insert). Either way, sorting takes a massive O(n log n) time.

So to recap:

Hashmap: Insert O(1); Retrieval O(1)
Sorted List: Insert O(log n); Retrieval O(log n)


Hmm. It's not looking very good for a sorted list then, is it?

There are issues with maps though. First up, their access speed comes at a cost of space - generally to manage the keys and values store. This may not be an issue for those of us with plenty of memory, although for enterprise applications dealing with millions of items it may be something to think about.

Keys in a hashmap also need to be hashable, or in other words reducible to a unique (or at least relatively rare) number or index, and this can be a hard problem if keys are technically very similar. Having said that, we can define our own hashing functions if we know what kind of keys we'll have.

But there is another use of a hashmap where it doesn't do as well as a list would, and that's in iteration. Although traversing items takes O(n) time in both hashmaps and sorted lists, the straightforwardness of a list will make it faster in most, if not all, cases which require processing of each item in it.

Finally, generally in the real world talking about big-oh or O(n) time can be a bit misleading as the hidden constants involved may outweigh the benefits taken from a good algorithm. So if it takes ages to figure out the hash of a key, it may just be worth a binary search instead (provided the cost of comparing is low too).

So to recap, although a hashmap is generally the best way to store data you'll need random access to, there are times when a sorted list might be better. Things to consider are:

  1. The number of times you'll be randomly accessing items in the collection.
  2. The amount of available space.
  3. The cost of hashing a key versus comparing them during a binary search.
  4. The number of items being stored (a binary search on a hundred items won't take very long).
  5. The relative staticness of the data - if a list isn't added to after initialisation, it only has to be sorted once.
  6. The need to iterate through each item of a data structure.
  7. The need to have instant access to a sorted view of the data.
In my current project, I will be using regular hashmaps for static and globally accessible caches, and sorted lists for those situations where a collection will be repeatedly created (to save space), regularly iterated over, and rarely randomly accessed.

No comments:

Post a Comment