What does it mean if a data structure's members are stored by hash value rather than by index?

by Austin Conlon   Last Updated September 11, 2019 19:26 PM - source

From Picking the right data structure in Swift:

Like we took a look at in “The power of sets in Swift”, one of the big advantages that sets have over arrays is that both inserts and removals can always be performed in constant (O(1)) time, since members are stored by hash value, rather than by index.

Answers 2

You are actually asking what is the difference between Array and Hash map/table/set. This is part of computer science "Data Structures" course and I am sure you can google some high level overview of each. Highly recommended :)

In short: You can imagine an array as a long shelf with cells, where each cell has sequence number (a.k.a. index):

Array: [ dog ][ cat ][ mouse ][ fox ]...
where dog is at cell #0, cat is at #1 and so on.

Now, in array you can retrieve objects using cell index, like "Give me the content of cell #1". But in order to find out if you have a "mouse" in your array - you will have to iterate over all the cells. (Inefficient)

Sets (a.k.a. Hash maps) store objects using another index - "hash code", which is kind of a function that calculates some pseudo-unique number per given object (without going into details). So cat and mouse will have unique hash codes and now for Set it is very efficient to find out if you have a "mouse" in the Set.

Dima Gershman
Dima Gershman
September 09, 2019 14:13 PM

Arrays are allocated as single, large blocks of memory and entries are accessed by their indexes. The order of entries is fixed and they need have no particular identity apart from their position in the array.

Other more complex data structures allow one to store objects identified and accessed using some sort of key. (Hash tables, sets, dictionaries, ...) Let's call these "keyed collections". Some objects have a natural key e.g. "SocialSecurityNumber" but what should one do if a key is needed and there are no obvious candidate field/s in our data object?

Hashing is a technique which sets out to derive a "fairly unique identity" to associate with an object. Think of it as mapping numbers to (arbitrary) data.

  • Although there are some "standard hashing techniques", this is still a field that is evolving and it involves some interesting mathematics.
  • Hashes have purposes including secure hashing (intended to detect and prevent deliberate tampering with data), error detection and - in this case - keyed (or hashed) data access.
  • A (non-secure) hash algorithm generally has to be as fast as possible BUT optimising for speed usually involves some sort of trade-off against the "fairly unique" part of the mapping requirement (while secure hashing is unavoidably - and sometimes deliberately, depending on usage - more slow and expensive)
  • Hashing cannot (ever) guarantee that a given hash value is unique to an object and so the focus has to be on minimising the occurrence of "collisions" and optimising how to deal with them when they occur. This is a difficult subject on its own, when you consider that data has to be treated as "arbitrary" - either appearing to be random, to contain sequences/patterns and/or with duplication.

With that said, assuming we have a "good" hash function, we can - in principle at least - store arbitrary objects in keyed collections.

Important considerations

  1. Arrays offer extremely fast sequential and random access (by index), while insert, delete and growth operations are slow.
  2. Keyed collections have the advantage you quote of offering extremely fast inserts and deletes, but they are very granular in nature and often introduce complexities such as memory fragmentation (memory management is an overhead, added complexity means added cost).
  3. Performance degrades rapidly when collisions start occurring.
  4. There is no such thing as a free lunch and calculating hashes is relatively expensive (compared to simply using an index value or stored value).
  5. There is a specific downside to hashes that "natural keys" and indexes do not have, which is they do not offer a natural ordering/sequence. Processing objects sequentially according to their hash values is tantamount to processing them randomly.

It is always important to choose data structures appropriate to their intended use (but that's what the link you quote is all about;-)

September 11, 2019 07:12 AM

Related Questions

Multi-level indexing in disk structure

Updated September 15, 2019 13:26 PM

Solve: T(n) = T(n/2) + n/2 + 1

Updated May 06, 2017 00:26 AM