Thursday, April 13, 2017

Reducing system load on cache servers by using Bloom Filter

       In this post, I want to share my experience on how bloom filter was used to reduce system load (CPU, RAM, Disk operations..) on our cache servers at CDNetworks.

How it all started?
       While working at CDNetworks, I got contacted by a recruiter to apply to Japanese company named Rakuten. It was an interesting challenge, so I tried. I had a skype interview with a technical recruiter and he asked me "what is Bloom Filter?", I did not know what it is. I failed the interview, but it taught me what is Bloom Filter.
Bloom filter is a probabilistic data structure, which is similar to HashMap, but insanely memory optimal. If you hold a million URLs in HashMap, it can reach up to 500Mb, whereas BloomFilter can make it with 16Mb (More info here: .

In other words, Bloom Filter is a clown with a bag full of balls marked with random integer numbers. if you ask him whether some ball with number 'X'  is in the bag, he can either tell you 'No!' or 'Yes, But I am sure 90%'" 

Research and analysis
      BloomFilter was awesome, which I thought could be used in our Cache server. So I googled "BloomFilter in CDN industry" which led me to Akamai's research paper ""
Akamai used bloom filter to avoid caching contents which were requested only once in certain period of time. This optimized their cache server and achieved empirical benefits. It increased hit-rate, reduced disk writes, improved latency and etc. The simple idea: Just don't cache all contents, as 70% of them are not requested again! The research paper was amazing, surprisingly nobody in our company knew about this paper. As I was a developer in cache server team, I saw so many possible optimizations that could be applied to our project. But when I talked about bloom filter and one-hit phenomena, people were not so impressed, as they needed proofs and results.
So, I downloaded 400Gb of HTTP request logs which were accumulated in 10 randomly chosen physical machines for the period of 3 days. The log files were zipped, so I had to develop a script to compute the number of unique URLs and their occurrence. It was fun to do it, I faced a challenge to analyze 15 million unique URLs which did not fit into RAM, so I computed on hard-drive with divide and conquer algorithm that led me to the same results which Akamai had:

You can see in the graph that requests which were served only once is taking up 70% of total unique requests. I checked on more servers, but I got the similar results every time, 70% ~ 80% of contents were not requested again. This results proved that our cache servers were just wasting so much energy caching meaningless contents.

Development & Simulation
     "Cache on second time" was the solution which we needed. I quickly implemented it with Bloom Filter, the logic was simple, when the cache server receives the response from upstream, we check if bloom filter contains the content's URL , if it does not contain then we add it the bloom filter and skip caching, otherwise, we do follow caching. I used two bloom filters (primary/secondary) to achieve stable accuracy of false positive by rotating them in turn when primary was 70% full:

Simulation was just replaying previous request logs and sending them to Cache server, custom origin server was used that generates requested contents with their size. We sent 1 day accumulated request logs of a single machine to CS during 3 hour. Every time, we ran Cache server without new logic (as it is), then we switched on new logic with bloom filter.

       Due to certain issues it took long time to get concluding results i.e. proof of concept, I was lucky to share my ideas with my Russian co-worker and achieve results together. He helped to build resource monitoring environment with CollectD and we simulated requests with WRK HTTP benchmarking tool. Test results were just jaw dropping:

With Bloom we reduced system load by 10 times,  and Cache-server(CS) did not die with heavy load. It was stable as it was not busy writing to disk all the time. We can clearly see it in disk metrics:

This graphs show clearly that if we do not cache meaningless contents(70%), our cache server becomes more productive and stable. Cache hit-ratio also increased as expected by 25%, RAM cache usage decreased by 2 times, Disk operations reduced by 4 times.


      We estimated that upstream traffic will increase by around 30%(As 70% requested only once). It happened as expected in the beginning as you see in below graph (network traffic on upstream server):
Amusingly, With a new logic (on the right side) the upstream traffic decreases as time passes. This is because more meaningful contents are cached as they were not evicted from the cache as it was before.

No comments:

Post a Comment