In recent years, distributed systems have become popular due to concepts like cloud computing and big data. Distributed caches, which power many high-traffic websites, use a type of distributed hashing that takes advantage of an algorithm called consistent hashing.
Before defining consistent hashing, let's quickly review the concept of hashing.
Simple hashing
Simple hashing is a basic method of generating a hash value from a given input using a hash function. The hash function takes an input of any length and returns a fixed-length output, known as the hash value or hash code. The hash function is designed to be deterministic, meaning that the same input will always produce the same hash value.
Simple hashing is often used for tasks such as password storage, where the hash value is stored instead of the actual password to improve security. However, simple hashing does not address the issues that arise in distributed systems, such as the need to distribute data evenly across multiple servers.
Distributed hashing
In a distributed system, it is often necessary to route an object to a specific server based on their keys, such as a client IP or session ID. To accomplish this, a modulo operation is performed on the hash value of the key to derive a server index where the object should be routed.
The index is calculated as:
server = hash(key) mod N
Where N is the size of the server pool. Common identifiers such as a client IP, API key, or session ID can be used as the key. To store or retrieve an object, the client or load balancer calculates the hash value, applies the modulo N operation, and uses the resulting index to route the object to the appropriate server. This ensures that all objects with the same key are routed to the same server, facilitating efficient data management and retrieval in a distributed system.
Now, this approach works well when the size of the cluster is fixed, and the data distribution is even. But what happen if most requests went to server 1 and few to the rest of servers due to previous equation? Or when new servers are added or deleted?
Drawbacks of distributed Hashing
One disadvantage of distributed hashing is uneven distribution, where a hash function may not distribute keys evenly among servers, leading to some servers being overloaded with data while others are underutilized. This can result in poor performance and system failures. See figure 1
Another disadvantage is that the hash value can change based on the number of servers added or deleted in a distributed system. We need to rehash every single key as the new mapping is dependent on the new number of servers leading to redistribution of keys and increased latency and system downtime. See figure 2.
Consistent hashing
Consistent hashing is an effective technique for addressing the previously described issues of uneven distribution and changes in hash values when the number of servers changes. The goal of consistent hashing is to ensure that almost all objects remain assigned to the same server, regardless of changes in the number of servers, while also maintaining even distribution.
In the consistent hashing we represent the hash key of every object into a virtual ring structure called hashring. Then, we perform hashing for the servers addresses using the same hashing function.
Using the same hash function for servers and object will produce values in the same range called hash space. In opposite to simple hash as described before, we don’t need to perform the modulo operation since the number of servers is not fixed. The hashring is considered to have an infinite number of points, and the server nodes can be placed at random locations on this ring.
Now, to locate the server for a particular object. Let’s assume the ring is ordered so that clockwise traversal of the ring corresponds to increasing order of servers addresses (hashes), each object can be served by the server node that first appears in this clockwise traversal. That is, the first server node with an address greater (hash) than that of the object gets to serve it. If the address of the object is higher than the highest addressed node, it is served by the server node with the lowest address, as the traversal through the ring goes in a circle. Theoretically, each server node ‘owns’ a range of the hashring, and any requests coming in at this range will be served by the same server node.
Let’s see what is happening in case of server failures!
It is obvious that all objects of this server will be reassigned to the next server (which has a greater address hash). And any coming objects will served by this new server as they travels the hashring as described previously. Means the range of the next server widens. Same thing will happen if new server added. The range of the next server will divided, so the only objects with hashes smaller than the new server address will be assigned to the new server and the rest remain with their server.
This approach still not solving the uneven distribution issue as the servers on the ring are likely to be uneven. So, consistent hash introduces the concept of virtual node. The idea is to have each server appear at multiple locations on the ring. Each location is a virtual node representing a server. With virtual nodes, each server handles multiple segments on the ring. As the number of virtual nodes increases, the distribution of objects becomes more balanced.
Concrete Implementation
Steps to implement consistent hashing:
1. Create a variable named 'circle' of mapTree data structure, where circle objects are entries consisting of a hash and PhysicalNode.
var circle = new mapTree<string,PhysicalNode>
2. Store the hashes of each virtual node and map them to their respective physical nodes in the 'circle' map.
3. When a client sends a request, use the Get() function to determine which server the request should be mapped to.
4. Compute the hash of the client IP.
5. Conduct a binary search in the 'circle' map to find the first virtual node in the ring whose hash value is greater than or equal to the hash value of the client IP.
6. Return the physical node associated with that virtual node.
7. If there is no such virtual node, return the physical node associated with the first virtual node in the ring.
In summary, we have explored the concepts of simple and consistent hashing, delving into their respective mechanisms, benefits, and use cases. Simple hashing provides a straightforward approach to distributing data across a set of nodes, while consistent hashing offers a more efficient and balanced data distribution solution that alleviates the challenges of node addition and removal in distributed systems.
I hope you enjoyed this article and found it informative ♡
Comments