Skip to main content

Reducing system load on cache servers by using Bloom Filter

Intro
       In this post, I want to share my experience on how bloom filter was used to reduce system load (CPU, RAM, Disk operations..) on our cache servers at CDNetworks.

How it all started?
       While working at CDNetworks, I got contacted by a recruiter to apply to Japanese company named Rakuten. It was an interesting challenge, so I tried. I had a skype interview with a technical recruiter and he asked me "what is Bloom Filter?", I did not know what it is. I failed the interview, but it taught me what is Bloom Filter.
Bloom filter is a probabilistic data structure, which is similar to HashMap, but insanely memory optimal. If you hold a million URLs in HashMap, it can reach up to 500Mb, whereas BloomFilter can make it with 16Mb (More info here: http://ahikmat.blogspot.kr/2016/07/intro-bloom-filter-is-probabilistic.html) .

In other words, Bloom Filter is a clown with a bag full of balls marked with random integer numbers. if you ask him whether some ball with number 'X'  is in the bag, he can either tell you 'No!' or 'Yes, But I am sure 90%'" 

Research and analysis
      BloomFilter was awesome, which I thought could be used in our Cache server. So I googled "BloomFilter in CDN industry" which led me to Akamai's research paper "https://www.akamai.com/us/en/multimedia/documents/technical-publication/algorithmic-nuggets-in-content-delivery-technical-publication.pdf"
Akamai used bloom filter to avoid caching contents which were requested only once in certain period of time. This optimized their cache server and achieved empirical benefits. It increased hit-rate, reduced disk writes, improved latency and etc. The simple idea: Just don't cache all contents, as 70% of them are not requested again! The research paper was amazing, surprisingly nobody in our company knew about this paper. As I was a developer in cache server team, I saw so many possible optimizations that could be applied to our project. But when I talked about bloom filter and one-hit phenomena, people were not so impressed, as they needed proofs and results.
So, I downloaded 400Gb of HTTP request logs which were accumulated in 10 randomly chosen physical machines for the period of 3 days. The log files were zipped, so I had to develop a script to compute the number of unique URLs and their occurrence. It was fun to do it, I faced a challenge to analyze 15 million unique URLs which did not fit into RAM, so I computed on hard-drive with divide and conquer algorithm that led me to the same results which Akamai had:

You can see in the graph that requests which were served only once is taking up 70% of total unique requests. I checked on more servers, but I got the similar results every time, 70% ~ 80% of contents were not requested again. This results proved that our cache servers were just wasting so much energy caching meaningless contents.

Development & Simulation
     "Cache on second time" was the solution which we needed. I quickly implemented it with Bloom Filter, the logic was simple, when the cache server receives the response from upstream, we check if bloom filter contains the content's URL , if it does not contain then we add it the bloom filter and skip caching, otherwise, we do follow caching. I used two bloom filters (primary/secondary) to achieve stable accuracy of false positive by rotating them in turn when primary was 70% full:

Simulation was just replaying previous request logs and sending them to Cache server, custom origin server was used that generates requested contents with their size. We sent 1 day accumulated request logs of a single machine to CS during 3 hour. Every time, we ran Cache server without new logic (as it is), then we switched on new logic with bloom filter.

Results
       Due to certain issues it took long time to get concluding results i.e. proof of concept, I was lucky to share my ideas with my Russian co-worker and achieve results together. He helped to build resource monitoring environment with CollectD and we simulated requests with WRK HTTP benchmarking tool. Test results were just jaw dropping:

With Bloom we reduced system load by 10 times,  and Cache-server(CS) did not die with heavy load. It was stable as it was not busy writing to disk all the time. We can clearly see it in disk metrics:


This graphs show clearly that if we do not cache meaningless contents(70%), our cache server becomes more productive and stable. Cache hit-ratio also increased as expected by 25%, RAM cache usage decreased by 2 times, Disk operations reduced by 4 times.

Considerations

      We estimated that upstream traffic will increase by around 30%(As 70% requested only once). It happened as expected in the beginning as you see in below graph (network traffic on upstream server):
Amusingly, With a new logic (on the right side) the upstream traffic decreases as time passes. This is because more meaningful contents are cached as they were not evicted from the cache as it was before.


Comments

Popular posts from this blog

NLP for Uzbek language

    Natural language processing is an essential tool for text mining in data analysis field. In this post, I want to share my approach in developing stemmer for Uzbek language.      Uzbek language is spoken by 27 million people  around the world and there are a lot of textual materials in internet in uzbek language and it is growing. As I was doing my weekend project " FlipUz " (which is news aggregator for Uzbek news sites) I stumbled on a problem of automatic tagging news into different categories. As this requires a good NLP library, I was not able to find one for Uzbek language. That is how I got a motive to develop a stemmer for Uzbek language.       In short,  Stemming  is an algorithm to remove meaningless suffixes at the end, thus showing the core part of the word. For example: rabbits -> rabbit. As Uzbek language is similar to Turkish, I was curious if there is stemmer for Turkish. And I found this: Turkish St...

Three essential things to do while building Hadoop environment

Last year I setup Hadoop environment by using Cloudera manager. (Basically I followed this video tutorial :  http://www.youtube.com/watch?v=CobVqNMiqww ) I used CDH4(cloudera hadoop)  that included HDFS, MapReduce, Hive, ZooKeeper HBase, Flume and other essential components. It also included YARN (MapReduce 2) but it was not stable so I used MapReduce instead. I installed CDH4 on 10 centos nodes, and I set the Flume to collect twitter data, and by using "crontab" I scheduled the indexing the twitter data in Hive. Anyways, I want to share some of my experiences  and challenges that I faced. First, let me give some problem solutions that everyone must had faced while using Hadoop. 1. vm.swappiness warning on hadoop nodes It is easy to get rid of this warning by just simply running this shell command on nodes: >sysctl -w vm.swappiness=0 More details are written on cloudera's site 2. Make sure to synchronize time on all nodes (otherwise it will give error on n...

NAT Traversal or how to make P2P on Android

Many of us used BitTorrent(or uTorrent) to download files on internet in a short time. Their download speed is high due to Peer-to-peer technology. That means, rather than downloading file from server, we are getting the file from another computer. But how two computers that have a local IP and are behind NAT, how they can connect each other? For that, NAT Traversal methodologies come for help. Note that there are mainly 2 types of NAT: Symmetrical(complex NATs:carrier-grade NAT) and Full (home network or small enterprises). let us consider Full NATs first. Methodologies of NAT traversal are: UPnP - old and hardware oriented method NAT-PMP (later succeeded by PCP)- introduced by Apple, also hardware oriented(i.e: not all routers have it, and even if it had, it is turned off by default) UDP Punching  - this is done by STUN which uses public server to discover NAT public IP & port TCP Punching -  similar to UDP punching but more complicated Symmetrical NATs are...