Hashing & Consistent Hashing
In real-time applications data storing, searching, and retrieving is the very common use case. So the way that we choose to store/search/retrieve must be very efficient.
Let's proceed with an example — We have a list of items and the task is to search for an item in this list. Now we have two scenarios here —
- When the item list is sorted
- When the item list is not sorted
When the list is sorted then we can use binary search and get the item easily. But when the items list is not sorted then we can not use the binary search and searching of an item is not easy as we have to check each item in the list which is not very efficient way to do when we have large data sets(thousands or millions in size).
The problem statement
In the unsorted list, we don’t know the location of the item that’s why we need to go through each item to compare and find the location. If somehow we know the location of the item then item search could be completed very easily.
This is where hashing comes into picture
What is Hashing?
Hash Function is any function that can be used to map data of arbitrary size into fixed-size values. These values are used to index a fixed-size table called Hash Table. The use of the Hash Function to index a Hash Table is called Hashing.
How Hashing works and helps in searching
To work with Hash Function we have to prepare data in key-value format. And each value will be stored in the table against the respective key. The Hash function will be applied to this key to map it with the fixed size table.
In this example, we have got three components
- Key-value pair of items
- Hash Function
- Fixed-size table which is also known as Hash Table
Let’s say we have a hash table of size 10 (the numbering or say indexing in the hash table is sequential)
Now the task is to apply the hash function on each key of items and map it to the hash table.
1 >> Hash Function >> 324
we have a table size of 10 so to get the exact location on the table we have to map the above hash code in such a way that it fits there. To do so we have to apply modulo which will give us the exact location on the table
1 >> Hash Function >> 324 >> %10 >> 4
Now store entry1 at the 4th location in the table.
Each key pair in the list is Entry
In the same way, we can apply the hash function for other entries
So far so good, we have mapped with the first three entries. Now apply same with the 4th one
Here we got the same index 7 where we already have data stored. This is what knows as Hash Collision
What to do now??
The solution here is the linked list — link the new entry with the previous one
This solved the hash collision issue, any time we got the same location that is already occupied we have to replace that single entry with the linked list and link each item to the same location
If you have used HashMap, then I hope you are able to relate it well — When you get the question “How HashMap works internally”
And the same mechanism of mapping the key with a hash table is applied when we are going to search/retrieve the particular item in the table. The search is performed using the Key, so you have to know the key in advance to find the particular pair/entry in the table.
Problems with Hashing?
The indexes in the hash table are pointing towards some actual physical memory location of the computer. As long as we have all the memory locations intact there is no issue but when some locations get corrupted then we won’t be able to fetch those memory locations with the above function as the size of the table is now changed and to make it right we have to re-map the entire hash table to accommodate the new hash table size.
If we have very large data set then it’s very difficult to do. And main problem here is even a single location corruption can make the entire hash table not reusable without remapping.
Let's take another real-time example when we have distributed systems and table index location is distributed over multiple computers, in this case, if any of the computers is down then those locations would be inaccessible and the above hashing technique will fail drastically here.
In all these cases the main issue is re-mapping of entire hash table
Consistent hashing comes to the rescue in this scenario which saves us from remapping the table entries.
We will try to understand this with the above example and will make changes to see how consistent hashing helps —
Change in Hash table — The hash table indexing has to be changed, earlier the indexing was sequential not it has to be any random number increasing order
Hash Function — It will be still the same. We’ll apply the hash function on the key which gives us some random number(hash code)
Modulo operator to find the exact index on the table — This is where we had tight coupling with the size of the hash table. Now we no longer need it. The new logic to find the location will be directly from the hash code —
New logic to map the keys with the table indexes We will compare the hash code of each key with the indexes on the hash table and the first index which is greater than the hash code will be the location where the respective entry will be stored.
We will repeat this step for all the key-pairs
Now coming to the last key-pair which has given us hash code as 9893 and we have no entry in the hash table which is larger than this number.
What to do now?
Go back to the first index of the table and store the entry there.
This way we got the ring-like structure which is also known as the Consistent Hashing Ring.
One scenario when we have got the same location for other keys also — If you remember this case handing in the hashing then here also it’s same — just link the item with the one already present
Now come to the problem when we have some memory locations down or corrupted then only those location’s data will be disturbed all other locations will be the same and we’ll be able to locate those elements properly with the above logic.
Popular Applications using consistent hashing concept
There are a number of live systems which use consistent hashing including:
- Couchbase automated data partitioning
- Partitioning component of Amazon’s storage system Dynamo
- Data partitioning in Apache Cassandra
- Riak, a distributed key-value database
- Akamai Content Delivery Network
- Discord chat application
That’s all about Consistent hashing.
Do check out the youtube channel Green Learner for more such topics.