Thursday, April 13, 2017

Reducing system load on cache servers by using Bloom Filter

Intro
       In this post, I want to share my experience on how bloom filter was used to reduce system load (CPU, RAM, Disk operations..) on our cache servers at CDNetworks.

How it all started?
       While working at CDNetworks, I got contacted by a recruiter to apply to Japanese company named Rakuten. It was an interesting challenge, so I tried. I had a skype interview with a technical recruiter and he asked me "what is Bloom Filter?", I did not know what it is. I failed the interview, but it taught me what is Bloom Filter.
Bloom filter is a probabilistic data structure, which is similar to HashMap, but insanely memory optimal. If you hold a million URLs in HashMap, it can reach up to 500Mb, whereas BloomFilter can make it with 16Mb (More info here: http://ahikmat.blogspot.kr/2016/07/intro-bloom-filter-is-probabilistic.html) .

In other words, Bloom Filter is a clown with a bag full of balls marked with random integer numbers. if you ask him whether some ball with number 'X'  is in the bag, he can either tell you 'No!' or 'Yes, But I am sure 90%'" 

Research and analysis
      BloomFilter was awesome, which I thought could be used in our Cache server. So I googled "BloomFilter in CDN industry" which led me to Akamai's research paper "https://www.akamai.com/us/en/multimedia/documents/technical-publication/algorithmic-nuggets-in-content-delivery-technical-publication.pdf"
Akamai used bloom filter to avoid caching contents which were requested only once in certain period of time. This optimized their cache server and achieved empirical benefits. It increased hit-rate, reduced disk writes, improved latency and etc. The simple idea: Just don't cache all contents, as 70% of them are not requested again! The research paper was amazing, surprisingly nobody in our company knew about this paper. As I was a developer in cache server team, I saw so many possible optimizations that could be applied to our project. But when I talked about bloom filter and one-hit phenomena, people were not so impressed, as they needed proofs and results.
So, I downloaded 400Gb of HTTP request logs which were accumulated in 10 randomly chosen physical machines for the period of 3 days. The log files were zipped, so I had to develop a script to compute the number of unique URLs and their occurrence. It was fun to do it, I faced a challenge to analyze 15 million unique URLs which did not fit into RAM, so I computed on hard-drive with divide and conquer algorithm that led me to the same results which Akamai had:

You can see in the graph that requests which were served only once is taking up 70% of total unique requests. I checked on more servers, but I got the similar results every time, 70% ~ 80% of contents were not requested again. This results proved that our cache servers were just wasting so much energy caching meaningless contents.

Development & Simulation
     "Cache on second time" was the solution which we needed. I quickly implemented it with Bloom Filter, the logic was simple, when the cache server receives the response from upstream, we check if bloom filter contains the content's URL , if it does not contain then we add it the bloom filter and skip caching, otherwise, we do follow caching. I used two bloom filters (primary/secondary) to achieve stable accuracy of false positive by rotating them in turn when primary was 70% full:

Simulation was just replaying previous request logs and sending them to Cache server, custom origin server was used that generates requested contents with their size. We sent 1 day accumulated request logs of a single machine to CS during 3 hour. Every time, we ran Cache server without new logic (as it is), then we switched on new logic with bloom filter.

Results
       Due to certain issues it took long time to get concluding results i.e. proof of concept, I was lucky to share my ideas with my Russian co-worker and achieve results together. He helped to build resource monitoring environment with CollectD and we simulated requests with WRK HTTP benchmarking tool. Test results were just jaw dropping:

With Bloom we reduced system load by 10 times,  and Cache-server(CS) did not die with heavy load. It was stable as it was not busy writing to disk all the time. We can clearly see it in disk metrics:


This graphs show clearly that if we do not cache meaningless contents(70%), our cache server becomes more productive and stable. Cache hit-ratio also increased as expected by 25%, RAM cache usage decreased by 2 times, Disk operations reduced by 4 times.

Considerations

      We estimated that upstream traffic will increase by around 30%(As 70% requested only once). It happened as expected in the beginning as you see in below graph (network traffic on upstream server):
Amusingly, With a new logic (on the right side) the upstream traffic decreases as time passes. This is because more meaningful contents are cached as they were not evicted from the cache as it was before.


Monday, September 19, 2016

How to use VisualVM

      VisualVM can be very helpful to discover the performance lags in Java application.
 It is one of the easiest profiling tools for Java.


Download VisualVM
https://visualvm.github.io/


Run VisualVM and check local running java apps: 


Remote Profiling.
Run your java application with following JVM arguments:
-Djavax.management.builder.initial=
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
Above parameters, makes your remote java application to listen to port 9010.
Then, you can connect to it from VisualVM by Menu->File->Add JMX connection
Type your hostname and port. Example: 192.168.10.10:9010
(IP address of remote machine and port)

Performance Profiling
      After you connect to your app from VisualVM, go to "Sampler" tab and press "CPU" button.
It is important to sort by "Total Time(CPU)" to see high CPU consumers on the top of the list.
This gives you some idea, but it is not detail. So, to get detail information,
press "Snapshot" button, this opens you following view:





VisualVM allows you to real-time monitoring which functions are taking up high CPU usage.

This window is very important. From this, you can find which functions, classes,
or packages are causing your Java application to be slow.
It is the key approach to resolve performance issues in your java application.
You can play with sorting options, and navigate through callers,
and check other tabs "Host Spots" etc.

Memory Profiling
      Memory usage analysing is also similar to above. Press "Memory" button in Sampler window.
Sort by "Bytes" to see data types (or classes) which are consuming much memory on the top.
You can also take "Snapshot" to see more details about the monitoring status.

Conclusion
   VisualVM can be very helpful to monitor, analyse, tune Java Application Performance.
This is essential task while developing scalabale, distributed, high-performance applications.














Performance tuning for Web engine


Install Tsung on CentOS

Pre-requisites:
1. Install Erlang:

sudo yum -y update && sudo yum -y upgrade
sudo yum install epel-release

sudo yum -y install erlang perl perl-RRD-Simple.noarch perl-Log-Log4perl-RRDs.noarch gnuplot perl-Template-Toolkit


2. Get Tsung
wget http://tsung.erlang-projects.org/dist/tsung-1.6.0.tar.gz

3. Extract and Install
tar zxvf tsung-1.6.0.tar.gz
cd tsung-1.6.0
./configure && make && sudo make install

Note: Sample XML configurations are located in /usr/share/doc/tsung/examples/http_simple.xml

Setup up Cluster Testing with Tsung

1. Add cluster nodes info in each node's "/etc/hosts"
sudo vi /etc/hosts
# cluster nodes
192.168.10.10       n1
192.168.10.11       n2
192.168.10.12       n3
192.168.10.13       n4

2. Setup ~/.ssh/config file
vi ~/.ssh/config
Host n1
  Hostname n1
  User tsung
  Port 722
  IdentityFile /home/tsung/.ssh/my_key_rsa7
Host n2
  Hostname n2
  User tsung
  Port 722
  IdentityFile /home/tsung/.ssh/my_key_rsa7
.....

Test and Visualize Results
1. Start tsung on master server
tsung -f /home/tsung/test/selected_scenario.xml start

2. Plot graphs with Perl script
/usr/lib/tsung/bin/tsung_stats.pl --stats /home/tsung/.tsung/log/$tsung_path/tsung.log

Change Kernel params

vi /etc/sysctl.conf

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 1024 65000
fs.file-max = 65000

Source: http://tsung.erlang-projects.org/user_manual/faq.html#why-do-i-have-error-connect-emfile-errors

References:
 https://gist.github.com/huberflores/2827890
  https://github.com/ngocdaothanh/tsart
 https://gist.github.com/clasense4/47438a884cabca9e66c8 

 http://www.jeramysingleton.com/install-erlang-and-elixir-on-centos-7-minimal/

 https://jackiechen.org/2015/12/04/use-tsung-to-test-https-site/

Tuesday, July 19, 2016

Java: BloomFilter Benchmark

Intro

Bloom filter is a probabilistic data structure for searching element in a data set.
It is similar to HashSet, similarly it tells us whether the set contains certain element or not. Difference is the output of contains(element)=TRUE is futuristic.
In our example we set futuristic value to 0.01, which means the answer "It contains" is 99% correct.
Read more about Bloom filter from here: https://en.wikipedia.org/wiki/Bloom_filter

Scenario

We create two Arrays of random elements. Elements count in each array is 1,000,000.
Then we insert the first array into BloomFilter, and we iterate the first array and check if the item contains in BloomFilter. Second array is used only for checking non-existing elements.
We do the same for HashSet as described above.

Benchmarking code

We used customized version of Bloom filter  which can accept byte array.
(Previous version of this blog was using encoding of string for every put and contains, which was misguiding the performance of bloom filter)

source code:
(Source code is not organized for compilation, please modify it for your use)

Performance output:

Output
Testing BloomFilter  1000000 elements
add(): 0.176s, 5681818.181818183 elements/s
contains(), existing: 0.171s, 5847953.216374269 elements/s
Testing HashSet  1000000 elements
add(): 0.181s, 5524861.878453039 elements/s
contains(), existing: 0.08s, 1.25E7 elements/s

Memory size:


BloomFilter is the winner here. With 99% correctness the memory footprint is almost 40 times smaller than HashSet.


If we reduce correctness to 90%, then the memory footprint is reduced to 80 times.

Conclusion

We saw that BloomFilter as fast as HashMap. However, it is very space efficient.
If we have a list of URLs in HashMap in Memory, By using BloomFilter we can reduce it to 40 times. For example, if occupied memory is 500 Mb it can be reduced to 12 Mb with correctness of 99%

Sunday, May 8, 2016

NLP for Uzbek language

    Natural language processing is an essential tool for text mining in data analysis field. In this post, I want to share my approach in developing stemmer for Uzbek language.
     Uzbek language is spoken by 27 million people around the world and there are a lot of textual materials in internet in uzbek language and it is growing.
As I was doing my weekend project "FlipUz" (which is news aggregator for Uzbek news sites) I stumbled on a problem of automatic tagging news into different categories. As this requires a good NLP library, I was not able to find one for Uzbek language.
That is how I got a motive to develop a stemmer for Uzbek language.
      In short, Stemming is an algorithm to remove meaningless suffixes at the end, thus showing the core part of the word. For example: rabbits -> rabbit.
As Uzbek language is similar to Turkish, I was curious if there is stemmer for Turkish. And I found this: Turkish Stemmer with Snowball. Their key approach was to use Finite state machine.
This was an interesting approach. I liked the simplicity of it. The only thing that i had to do was to model the suffixes transformations in state machine.
As the stemmer can be applied for all kind of words, in the first step, the Nouns was the target.
Therefore, I created the state machine for nouns:
While drawing this diagram, I referenced the Uzbek language phonetics and word formation rules from the Uzbek language book. The book was very helpful. Though, I still did not use much of it yet.

I used python language, for its easiness and richness of external libraries.
Here is the source code:
from fysom import Fysom
def stem(word):
    fsm = Fysom(initial='start',
                    events=[
                    ('dir', 'start', 'b'),
                    ('dirda', 'start', 'b'),
                    ('ku', 'start', 'b'),
                    ('mi', 'start', 'b'),
                    ('mikan', 'start', 'b'),
                    ('siz', 'start', 'b'),
                    ('day', 'start', 'b'),
                    ('dek', 'start', 'b'),
                    ('niki', 'start', 'b'),
                    ('dagi', 'start', 'b'),
                    ('mas', 'start', 'd'),
                    ('ning', 'start', 'f'),
                    ('lar', 'start', 'g'),
                    ('lar', 'e', 'g'),
                    ('dan', 'd', 'e'),
                    ('da', 'd', 'e'),
                    ('ga', 'd', 'e'),
                    ('ni', 'd', 'e'),
                    ('dan', 'start', 'e'),
                    ('da', 'start', 'e'),
                    ('ga', 'start', 'e'),
                    ('ni', 'start', 'e'),
                    ('lar', 'f', 'g'),
                    ('miz', 'start', 'h'),
                    ('ngiz', 'start', 'h'),
                    ('m', 'start', 'h'),
                    ('si', 'start', 'h'),
                    ('i', 'start', 'h'),
                    ('ng', 'start', 'h'),
                    ('miz', 'f', 'h'),
                    ('ngiz', 'f', 'h'),
                    ('m', 'f', 'h'),
                    ('si', 'f', 'h'),
                    ('i', 'f', 'h'),
                    ('ng', 'f', 'h'),
                    ('miz', 'e', 'h'),
                    ('ngiz', 'e', 'h'),
                    ('m', 'e', 'h'),
                    ('si', 'e', 'h'),
                    ('i', 'e', 'h'),
                    ('ng', 'e', 'h'),
                    ('lar', 'h', 'g'),
                    ('dagi', 'g', 'start')
                    ]
                );
    i = len(word) - 1
    j = len(word)
    while(True):
        if (i<=0):
            break
        v = word[i:j]
        #print v
        res = fsm.can(v)
        if (res):
            if (v == 'i' and fsm.can(word[i-1:j])):
                i = i - 1
                continue
            fsm.trigger(v)
            if (fsm.current == 'h'):
                if (word[i-1:i]=='i'):
                    i = i - 1 #skip i
                    if (word[i-1:i]=='n' ):
                            # ning qushimchasi
                        fsm.current = 'start'
                        continue
            elif (fsm.current == 'b'):
                fsm.current = 'start'
            j = i
            # print fsm.current
        i =  i - 1
    return word[:j]
It is available in github also.
We are collaborating with Uzbek developers friends to develop full-featured NLP library for Uzbek language.
The next step is to apply stemming for Verbs.
Let me know if you have some ideas on this. thanks.

Test outputs:

print stem('mahallamizdagilardanmisiz')
mahalla

Tuesday, December 8, 2015

How to use Docker

Docker
Docker offical webiste: https://www.docker.com/
Setup Docker
Prepare fresh version of CentOS, I am using CentOS 6.7.
Update the yum rep.
> rpm -iUvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
> yum update -y
 
Install Docker
> yum -y install docker-io

Pull some image of container, I am going to use CentOS container.
To pull the latest (CentOs 7)
>  docker pull centos
Or
> docker pull centos:centos6

Check which container images are installed:
> docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
centos              centos6             3bbbf0aca359        2 weeks ago         190.6 MB
centos              latest              ce20c473cd8a        2 weeks ago

Run docker from image:
> docker run -i -t centos:centos6 /bin/bash
Note: this creates a container from image (you can see the ContainerID as hostname)

List containers:
>docker ps
>docker ps -a (To List all Containers running/stopped)

Stop/Start/Remove container:
>docker start ContainerID
>docker stop ContainerID
>docker rm ContainerID
 
Re-connect to Container
>docker attach ContainerID
Or
>docker exe -it ContainerID bash

Run docker container In Background:
> docker run -itd --name cs1 --net=none --hostname=cs1.csteam.net -v /csdata/cs1:/csdata/cs1:rw -v /root/.ssh:/root/.ssh:rw --privileged=true centos:centos6 /bin/bash
To exit Docker console:
Ctrl+P => Ctrl+Q

Networking

For following steps, you need to install pipework script.
#sudo bash -c "curl https://raw.githubusercontent.com/jpetazzo/pipework/master/pipework > /usr/local/bin/pipework"
#sudo chmod +x /usr/local/bin/pipework

If you want to set IP address for container with DHCP. Then you don't need the next section. Setting DHCP settings is easier:
#pipework eth0 ContainerID dhclient
How to expose container with Private IP address from local network: MacVLan method
Host type is easy to setup(-net=host) but the container uses the same network interfaces and can not have their own IP address. That is why, we will use bridged network.
Easier way to setup bridged network is to use "pipework" script that automatizes the procedure.
Install pipework.
>sudo bash -c "curl https://raw.githubusercontent.com/jpetazzo/pipework/master/pipework > /usr/local/bin/pipework"
>sudo chmod +x /usr/local/bin/pipework

Install dependencies:
>yum -y install bridge-utils net-tools
Note:
If your host server is CentOS 6, then you need to upgrade iproute rpm to support "ip netns" command.
Download RPM from:
https://repos.fedorapeople.org/repos/openstack/EOL/openstack-havana/epel-6/iproute-2.6.32-130.el6ost.netns.2.x86_64.rpm
(or iproute-2.6.32-130.el6ost.netns.2.x86_64.rpm)
>rpm -Uvh iproute-2.6.32-130.el6ost.netns.2.x86_64.rpm

Creating bridge(suppose that your Host's Ip=10.40.198.150 on eth0):
>ip addr del 10.40.198.150/24 dev eth0
>ip link add link eth0 dev eth0m type macvlan mode bridge
>ip link set eth0m up
>ip addr add 10.40.198.150/24 dev eth0m
>route add default gw 10.40.198.1
>service network restart
You need to wait for few minutes till the settings get applied.
And, Finally assign $CID container the new private(local) IP :
>pipework eth0 $CID 10.40.198.155/24@10.40.198.1
Done!
you can ping from other local PC, ping 10.40.198.155

How to remove unused virtual network:
>ifconfig br0 down
>brctl delbr br0

Saving Container as Image:
>docker commit $CID myimage:newcs

Thursday, November 26, 2015

NAT Traversal or how to make P2P on Android

Many of us used BitTorrent(or uTorrent) to download files on internet in a short time. Their download speed is high due to Peer-to-peer technology. That means, rather than downloading file from server, we are getting the file from another computer.
But how two computers that have a local IP and are behind NAT, how they can connect each other?
For that, NAT Traversal methodologies come for help.
Note that there are mainly 2 types of NAT:
Symmetrical(complex NATs:carrier-grade NAT) and Full (home network or small enterprises).
let us consider Full NATs first.

Methodologies of NAT traversal are:
UPnP - old and hardware oriented method
NAT-PMP (later succeeded by PCP)- introduced by Apple, also hardware oriented(i.e: not all routers have it, and even if it had, it is turned off by default)
UDP Punching  - this is done by STUN which uses public server to discover NAT public IP & port
TCP Punching -  similar to UDP punching but more complicated

Symmetrical NATs are a big issue. They are hard to be punched as they changed router ports randomly. So there is a tiny chance to establish connection.
There are some approaches which can help, but practically difficult to implement:
"Large Scale Symmetric NAT Traversal with Two Stage Hole Punching":
https://drive.google.com/file/d/0B1IimJ20gG0SY2NvaE4wRVVMbG8/view
Fortunately, Symmetrical NATs are being used only in security restricted areas, and are getting less popular because people are understanding how P2P is important.

So, how we can practically make P2P connection on Android.
I found 2 ways, one to use libraries (harder) and another WebRTC(easier).
Libraries:
https://github.com/jitsi/ice4j
https://github.com/htwg/UCE

WebRTC:
As you know, webRTC uses p2p and internally it has ICE(that combines STUN and TURN) protocol to establish p2p connection.
This option is easier to use because webrtc library takes care of future updates and it is a new cool standard.
https://github.com/pchab/AndroidRTC


References:
http://stackoverflow.com/questions/9656198/java-udp-hole-punching-example-connecting-through-firewall?rq=1
https://en.wikipedia.org/wiki/STUN
https://en.wikipedia.org/wiki/UDP_hole_punching
http://chimera.labs.oreilly.com/books/1230000000545/ch04.html
http://stackoverflow.com/questions/12359502/udp-hole-punching-not-going-through-on-3g?rq=1
Theory: http://www.bford.info/pub/net/p2pnat/index.html#SECTION00040000000000000000
Tutorial on ice4j: http://blog.sharedmemory.fr/en/2014/06/22/gsoc-2014-ice4j-tutorial/