Sharp Learning Curve

Consistent Hashing

I’ve recently started working on a new set of architectural challenges. The core requirement, at a high-level, is process millions of transactions a day.

After reading about consistent hashing, it seems to me that it’s one of the best ways to implement APIs that can dynamically scale out and rebalance. Consistent hashing isn’t the entire solution though, it’s just the algorithm used for making consistent assignments or relationships between different sets of data in such a way that if we add or remove items, the algorithm can be recalculated on any machine and produce the same results (hence, consistent).

Implementation challenges

  • Given the same inputs, every client using the algorithm must produce the same results
  • A good implementation will balance associations evenly across the available nodes
  • A good implementation should be capable of quickly rebalancing the associations if a node is removed or added

Calculating Consistent Results

This is the simple part of consistent hashing: use the same value to hash against and use a hashing method that produces the same result given identical inputs and you’re done. I’ve been using MD5, it’s fast and simple to convert strings into 64 bit integers.

The downside to this solution is that there’s no way to guarantee a nice and even distribution of key space since hashing produces consistent yet unpredictable results. That leads us to the next point:

Producing Balanced Distributions

This part is significantly more difficult. Fortunately, it’s been solved frequently by other developers who share their knowledge. The solution is to assign virtual keys for each key (see the fourth slide for an example) so that we decrease the distance between keys and thus even out the key space. The trick is finding the right balance of virtual keys to add per key.

Rebalancing

The rebalancing should actually happen automatically. When removing a key, all the virtual nodes are removed; when adding a key, new virtual keys get added. The catch is making this happen in a performant manner.

Need For Speed

If we keep the list of keys in a list, adding, removing and even looking up the correct key can become costly and the time for each operation increases linearly as the set of virtual keys grows. What we want is a fast way to handle all of these operations that has logarithmic performance. As it happens, binary trees are a great way to improve look-up speed and red-black trees are a very performant way to auto-balance the tree after adds or removes to ensure that the tree is optimized for look-ups. I used a red-black tree algorithm because it’s one that I could handle and I found some good material to help discuss some interesting approaches for implementation.

Applications In Symbiote

There are actually several places I plan to introduce consistent hashing behind the scenes. I’ve already written a class called a Distributor which encapsulates a consistent hash and behaves like a strongly typed collection. This approach will allow me to easily add connection balancing to the Couch, Memcached, Redis, and RabbitMQ APIs. It will also enable a few other essential features (blog posts to follow).

I put together a few slides to provide some illustrations for a way to conceptualize how consistent hashing works.

Slide 1 Slide 2 Slide 3 Slide 4

Comments