My Technical Blog

Posts

Reducing system load on cache servers by using Bloom Filter

Intro In this post, I want to share my experience on how bloom filter was used to reduce system load (CPU, RAM, Disk operations..) on our cache servers at CDNetworks. How it all started? While working at CDNetworks, I got contacted by a recruiter to apply to Japanese company named Rakuten. It was an interesting challenge, so I tried. I had a skype interview with a technical recruiter and he asked me "what is Bloom Filter?", I did not know what it is. I failed the interview, but it taught me what is Bloom Filter. Bloom filter is a probabilistic data structure, which is similar to HashMap, but insanely memory optimal. If you hold a million URLs in HashMap, it can reach up to 500Mb, whereas BloomFilter can make it with 16Mb (More info here: http://ahikmat.blogspot.kr/2016/07/intro-bloom-filter-is-probabilistic.html ) . In other words, Bloom Filter is a clown with a bag full of balls marked with random integer numbers. if you ask him whether some ball wit

How to use VisualVM

VisualVM can be very helpful to discover the performance lags in Java application. It is one of the easiest profiling tools for Java. Download VisualVM https://visualvm.github.io/ Run VisualVM and check local running java apps: Remote Profiling. Run your java application with following JVM arguments: - Djavax . management . builder . initial = - Dcom . sun . management . jmxremote - Dcom . sun . management . jmxremote . port = 9010 - Dcom . sun . management . jmxremote . local . only = false - Dcom . sun . management . jmxremote . authenticate = false - Dcom . sun . management . jmxremote . ssl = false Above parameters, makes your remote java application to listen to port 9010. Then, you can connect to it from VisualVM by Menu->File->Add JMX connection Type your hostname and port. Example: 192.168.10.10:9010 (IP address of remote machine and port) Performance Profiling After you connect to your app from VisualVM, go to " Sampler "

Performance tuning for Web engine

Install Tsung on CentOS Pre-requisites: 1. Install Erlang: sudo yum -y update && sudo yum -y upgrade sudo yum install epel-release sudo yum -y install erlang perl perl-RRD-Simple.noarch perl-Log-Log4perl-RRDs.noarch gnuplot perl-Template-Toolkit 2. Get Tsung wget http://tsung.erlang-projects.org/dist/tsung-1.6.0.tar.gz 3. Extract and Install tar zxvf tsung-1.6.0.tar.gz cd tsung-1.6.0 ./configure && make && sudo make install Note: Sample XML configurations are located in /usr/share/doc/tsung/examples/http_simple .xml Setup up Cluster Testing with Tsung 1. Add cluster nodes info in each node's "/etc/hosts" sudo vi /etc/hosts # cluster nodes 192.168.10.10 n1 192.168.10.11 n2 192.168.10.12 n3 192.168.10.13 n4 2. Setup ~/.ssh/config file vi ~/. ssh /config Host n1 Hostname n1 User tsung Port 722 IdentityFile /home/tsung/ . ssh /my_key_rsa7 Host n2

Java: BloomFilter Benchmark

Intro Bloom filter is a probabilistic data structure for searching element in a data set. It is similar to HashSet, similarly it tells us whether the set contains certain element or not. Difference is the output of contains(element)=TRUE is futuristic. In our example we set futuristic value to 0.01 , which means the answer "It contains" is 99% correct. Read more about Bloom filter from here: https://en.wikipedia.org/wiki/Bloom_filter Scenario We create two Arrays of random elements. Elements count in each array is 1,000,000. Then we insert the first array into BloomFilter, and we iterate the first array and check if the item contains in BloomFilter. Second array is used only for checking non-existing elements. We do the same for HashSet as described above. Benchmarking code We used customized version of Bloom filter which can accept byte array. (Previous version of this blog was using encoding of string for every put and contains, which was misguiding

NLP for Uzbek language

Natural language processing is an essential tool for text mining in data analysis field. In this post, I want to share my approach in developing stemmer for Uzbek language. Uzbek language is spoken by 27 million people around the world and there are a lot of textual materials in internet in uzbek language and it is growing. As I was doing my weekend project " FlipUz " (which is news aggregator for Uzbek news sites) I stumbled on a problem of automatic tagging news into different categories. As this requires a good NLP library, I was not able to find one for Uzbek language. That is how I got a motive to develop a stemmer for Uzbek language. In short, Stemming is an algorithm to remove meaningless suffixes at the end, thus showing the core part of the word. For example: rabbits -> rabbit. As Uzbek language is similar to Turkish, I was curious if there is stemmer for Turkish. And I found this: Turkish Stemmer with Snowball. Their key approach was to u

How to use Docker

Docker Docker offical webiste: https://www.docker.com/ Setup Docker Prepare fresh version of CentOS, I am using CentOS 6.7. Update the yum rep. > rpm -iUvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm > yum update -y Install Docker > yum -y install docker-io Pull some image of container, I am going to use CentOS container. To pull the latest (CentOs 7) > docker pull centos Or > docker pull centos:centos6 Check which container images are installed: > docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE centos centos6 3bbbf0aca359 2 weeks ago 190.6 MB centos latest ce20c473cd8a 2 weeks ago Run docker from image: > docker run -i -t centos:centos6 /bin/bash Note: this creates a container from image (you can see the ContainerID as hostname) List containers: >docker ps >docker