Friday, May 30, 2014

Running Mahout K-Means example

   CDH4 comes with Mahout library by default. You don't need to install mahout unless if you want to upgrade to the latest version.
Mahout is a scalable machine learning library which supports many algorithms for clustering, classification, topic modeling, prediction and recommendation systems. It can have terabytes of input data and process the clustering or classification in less than an hour depending on how powerful is your hadoop cluster. In my case I am using 4 powerful PCs with virtual nodes.
   I had a task to model topics from twitter data for 2 months. I tried to run LDA(Latent Dirichlet Allocation - topic modeling algorithm) with R on one PC and it took several hours to build a document matrix of one day tweets(around 800,000) . And for improving results the number of topics K should be bigger than 500 thus exponentially increasing the LDA processing time. It is how I turned into Mahout.
   Except Mahout there are other parallel machine learning libraries. Notable ones are MLlib , SparkR, hiveR. Initially I tried sparkR and MLlib, as they depend on Apache Spark(alternative to hadoop mapreduce) it was not mature enough to run huge datasets. Apache Spark is very fast and promising in-memory map-reduce library, which makes it kind a memory eater. There were many cases where I tried to run 2 queries with Apache Shark  on parallel and the Spark just died by itself. It could not handle heavy load.
I did not try to use hiveR yet, I will post later about it.
Three reasons why  I chose Mahout are stability, maturity and a good support.

Let us run Mahout "k-means" example on CDH4 node.
Go to "/opt/cloudera/parcels/CDH/share/doc/mahout-doc-0.7+22/examples/bin" folder and list out contents. There is "cluster-reuters.sh" file that downloads input data set(news contents from reuters), generates sequence files and converts them into vectors(tf, tf-idf, word count etc..)
Note that before running shell script you need to define HADOOP_HOME,MAHOUT_HOME,CLASSPATH..

Add this to the beginning of "clustter-reuters.sh" file:

HADOOP_HOME="/opt/cloudera/parcels/CDH/lib/hadoop"
MAHOUT_HOME="/opt/cloudera/parcels/CDH/lib/mahout"
MAHOUT=$MAHOUT_HOME/bin/mahout
CLASSPATH=${CLASSPATH}:$MAHOUT_HOME/lib/hadoop/hadoop-common.jar
for f in $MAHOUT_HOME/lib/*.jar; do
        CLASSPATH=${CLASSPATH}:$f;
done

Make sure to run script with "hdfs" user privilege and all accessing folders must have hdfs privilege.
>su hdfs
this will make sure that you won't mess up with account privileges, after the dataset is downloaded the sequence files are created on a local directory. 
You can change the script to generate sequence files on HDFS directly. 
Just remove "MAHOUT_LOCAL=true" from the script to generate sequence files on HDFS.

One last thing, change "dfs" commands to "fs" , dfs was the old deprecated version of HDFS command.

Tuesday, May 27, 2014

How I visualized the issues related to Sewol for 04.16 ~ 05.05



세월호 이슈 타임라인 from Hikmat on Vimeo.

In April 16 there was a big tragedy in Korea, the ferry named Sewol (세월호) sank just in few hours causing the death of hundreds of students.
After the tragedy, there were millions of tweets talking about sewol related issues. By analyzing these tweets we could see how people felt about sewol tragedy, how their opinions changed during the time, whom they blamed and how much they hoped for the survival of missing students.

  We calculated  TF-IDF  values with Sewol keywords, and from those we selected the top issues related to this tragedy. We exported results to excel where the rows represented the date and time, and the columns represented the words.
  I used 3ds max for visualization. It has MaxScript programming environment. I made a script to import the data from excel and draw the texts on 3D space, then animated their size by their TF-IDF value. The position of words are distributed randomly. Watch it and feel free to comment what you observed from the video.

here is the MaxScript code

Which device is famous for tweeting in Korea

I was wondering how many people use android to tweet and how many use iphone, so I did little bit a research on this,
Fortunately, there is a "source" field in raw json tweet data, so I grouped tweets by that field for random day.

hive>  select source, count(*) as cnt from twitter where datehour>=2014052000 and datehour<=2014052023 group by source order by cnt desc;  

Result:


Android 593671
Web 204334
Twittbot.net 153035
iPhone 144256
Tweetdeck 65767
iPad 19933














Note that the tweets are for May 20 including only the ones with korean syllables

Monday, May 26, 2014

NLP (Natural Language Processing) libraries for Korean language

There are mainly 2 available NLP libraries for Korean language. The first one is open source and the other is closed source with limited period of license.

1. Kookmin NLP library
(http://nlp.kookmin.ac.kr/)

This library has better dictionary, and word spacing feature which gives a nice output. It has a very long development history and considered to be one the best NLP libraries for Korean language. Its features include: automatic word spacing, morphological analyzer, noun extraction and others.
However, the license is free only for non-commercial purpose.
The download page is in :http://nlp.kookmin.ac.kr/HAM/kor/download.html
But the latest version you can find in this blog: http://cafe.daum.net/nlpk


2. Hannanum project ("http://kldp.net/projects/hannanum")

Good thing about this project is it is fully open-source.
Developed by KAIST graduates using JAVA programming language.
The dictionaries and grammatical rules are open to change and improve.
Also it is available in "R" programming language (you can use it by installing  KoNlp library in R)
I liked the simplicity and and its openness about this project, however its dictionary lacks many common korean words (I added the word 세월호 manually as it was not there), no auto spacing words and missing some grammar rules.

Conclusion:


Kookmin NLP library

Pros: better auto word spacing, rich dictionary, better recognizing of nouns, verbs, adj and etc..
Cons: closed source, license is not free, command line based

Hannanum project

Pros: open-source, Java library, free, simplicity, available in R
Cons: small dictionary, no auto spacing, missing grammar rules(e.g: 학생들 was not recognized, even the word 학생 was in dictionary)

Thursday, May 22, 2014

Three essential things to do while building Hadoop environment

Last year I setup Hadoop environment by using Cloudera manager. (Basically I followed this video tutorial : http://www.youtube.com/watch?v=CobVqNMiqww)
I used CDH4(cloudera hadoop)  that included HDFS, MapReduce, Hive, ZooKeeper HBase, Flume and other essential components. It also included YARN (MapReduce 2) but it was not stable so I used MapReduce instead.
I installed CDH4 on 10 centos nodes, and I set the Flume to collect twitter data, and by using "crontab" I scheduled the indexing the twitter data in Hive.
Anyways, I want to share some of my experiences  and challenges that I faced.
First, let me give some problem solutions that everyone must had faced while using Hadoop.

1. vm.swappiness warning on hadoop nodes

It is easy to get rid of this warning by just simply running this shell command on nodes:
>sysctl -w vm.swappiness=0
More details are written on cloudera's site

2. Make sure to synchronize time on all nodes (otherwise it will give error on nodes)

In Linux there is a module to sync time via internet. Do these for all nodes.
Install ntp: 
>yum install ntp
Run these commands by order
>chkconfig ntpd on
>ntpdate pool.ntp.org
>/etc/init.d/ntpd start

3. Problem uploading files to HDFS (hadoop fs -put localfiles destinationOnHdfs)

This error message says: "INFO hdfs.DFSClient: Exception in createBlockOutputStream” while uploading file to HDFS"

This is caused by firewall settings. Here is the solution:
>service iptables save
>service iptables stop
>chkconfig iptables off
do these on all nodes. Now you should be uploading files to HDFS without any problem.