Working with huge hashes in Ruby

Working with huge hashes in Ruby

Suppose that while working our current project, we are about to store a lot of people names into an Elastic Search index for fast searching (this just happened to me while working on Baila, the Czech-Slovak book lovers network). And when I say a lot, I really mean a lot, the number of items possibly reaching 20 millions. Also, to make the index more efficient, we need these names to be unique in it.

OK, so how do we deal with that? We'll use a "cache" to store all the names already indexed and before we save a new document into the index, we'll ask the cache first if it knows about it and continue saving only if it doesn't. There's nothing fancy about the cache, so we'll use a simple hash. After all, hashes are fast and efficient, aren't they?

Hash is the answer...

The first-try implementation could look like something as basic as this:

# index_cache.rb
class IndexCache
  def initialize
    @_index_cache = {}
  end

  def contains?(string)
    @_index_cache.has_key?(string)
  end

  def put(string)
    @_index_cache[string] = 1
  end
end

I.e. we are using a plain old ruby hash object as the cache. OK, lets run the code - the following are excerpts from the debug messages logged during the index creation:

... cache_size=66096, elapsed=0.3%, run=2.9s, ...
... cache_size=136828, elapsed=0.6%, run=3.3s, ...
... cache_size=196350, elapsed=0.9%, run=3.1s, ...

In the beginning, everything looks great, each run takes around 3 seconds and the cache size grows happily. But hang on, what's going on after a couple of minutes?

... cache_size=6475100, elapsed=30.8%, run=11.6s, ...
... cache_size=6542401, elapsed=31.1%, run=11.6s, ...
... cache_size=6607802, elapsed=31.4%, run=11.4s, ...

Now it takes more than 11 seconds to index each batch and we are still only around the first 1/3 of the whole process!

How is that possible? At first I thought that maybe Elastic Search slows down when importing into a bigger index then a fresh one but a quick check showed it was a wrong assumption. I tried to search for a possible bug in my ruby code and then it struck me: it is not my code, it's Ruby itself!

... but due to Ruby's garbage collection ...

If you are guessing the garbage collector, you are guessing right. Ruby's GC is a simple naive mark-and-sweep garbage collector, which means that during the collection cycle it goes through all objects in your program's memory and marks (and later sweeps) those that are no longer in use. While collecting garbage objects, all user code is frozen, your application has to wait until the collection is complete. Have a look at the great presentation by Joe Damato and Aman Gupta for a more in-depth view of the ruby GC.

Ruby's GC is barely noticeable when you have a few tens of thousands objects defined. But once you start using a few millions of them (such as when you have a few million strings in a hash), garbage collection becomes a pain in the ass for sure.

Using the ruby's built-in GC profiler I found out that ruby made one collection cycle 1-3 times during each indexing run. First I tried playing with the GC environment variables, which MRI Ruby understands since version 1.9.3. This helped a bit as the collection cycles were less frequent but still caused a gradually unbearable slow-down of the application.

We have to tell ruby to try not to garbage collect our cache at all. How can we do that?

... not the Ruby hash

Luckily, there are simple solutions. One of them, perfectly suited for our purpose, is Google's Sparse hash. This is a hash implementation specifically fine-tuned for the best memory or speed efficiency. In Ruby, we can use the google_hash gem which is a simple interface to the sparse hash C++ library. But it's not only about the hash efficiency. The nice side-effect we get (and the one we want the most now) is the fact that by separating the hash data outside the Ruby memory into the library memory, Ruby no longer directly sees it and does not have to garbage collect it.

The change in our cache code is minimal, just require the gem and replace the Ruby hash with the sparse hash in the constructor (in this case I am actually using the speed-optimized "dense" variant):

# index_cache.rb
require 'google_hash'

class IndexCache
  def initialize
    @_index_cache = GoogleHashDenseRubyToInt.new
  end
  # ... the rest of the class stays the same ...
end

Surely we can't wait to run our indexing code again, here are the results now:

... cache_size=6888861, elapsed=32.6%, run=3.1s, ...
... cache_size=6959339, elapsed=32.9%, run=5.5s, ...
... cache_size=7030223, elapsed=33.2%, run=3.5s, ...
                         ...
... cache_size=19316920, elapsed=99.1%, run=3.5s, ...
... cache_size=19366026, elapsed=99.4%, run=3.4s, ...
... cache_size=19418205, elapsed=99.7%, run=3.3s, ...

Hey, the indexing speed is pretty stable now, even with nearly 20 million items in the cache, yay! This way the whole app finishes in the shortest possible and - even more importantly - predictable time.

Can we do any better?

Still not enough? Speed-wise we have probably reached our limits but we can surely be better memory consumers. What is the problem? The current code stores up to about 20 million people and organization names from all over the world into a hash. With this amount of UTF-8 encoded strings, the app easily allocates over 6 GB of memory on the server, causing quite some trouble for other programs and pushing other data to the swap.

Can we make the strings not consume so much memory? Sure we can, let's try to hash them before storing! (Terminology note: I'm sure you noticed, but this hash is not the same as the hash above, this time I am talking about the hash function, not the hash map structure.) We will sacrifice some of the speed (due to hash calculations) to hopefully save a lot of memory. Let's have a look at the final code:

# index_cache.rb
require 'google_hash'

class IndexCache
  def initialize
    @_index_cache = GoogleHashDenseLongToInt.new
  end

  def contains?(string)
    @_index_cache.has_key?(string.hash)
  end

  def put(string)
    @_index_cache[string.hash] = 1
  end
end

Ruby's hash function is the MurmurHash, which is a very fast and simple function, so the speed decrease should be barely noticeable. It returns up to 64 bits long integer, which should fit nicely into Long, so we can use the "LongToInt" variant of the Google sparse hash for even better memory efficiency.

And hey, after running the code again, I get only 1.5 GB peak memory consumption instead of the 6 GB before, that's four times better! How grateful the server and all the other programs are! The total final number of entries in the hash is the same as before, meaning that the hash function produced no collisions, which is a must.

And you know what? The total time is even a few minutes less than before, not only the hash function does not slow the app down at all, it actually speeds it up as now it doesn't have to fight for memory with other server applications and the swap. How cool is that?

Feel free to discuss this on Google+.

tags: ruby, en