Skip to main content

Posts

Showing posts from May, 2014

Running Mahout K-Means example

   CDH4 comes with Mahout library by default. You don't need to install mahout unless if you want to upgrade to the latest version. Mahout is a scalable machine learning library which supports many algorithms for clustering, classification, topic modeling, prediction and recommendation systems. It can have terabytes of input data and process the clustering or classification in less than an hour depending on how powerful is your hadoop cluster. In my case I am using 4 powerful PCs with virtual nodes.    I had a task to model topics from twitter data for 2 months. I tried to run LDA(Latent Dirichlet Allocation - topic modeling algorithm) with R on one PC and it took several hours to build a document matrix of one day tweets(around 800,000) . And for improving results the number of topics K should be bigger than 500 thus exponentially increasing the LDA processing time. It is how I turned into Mahout.    Except Mahout there are other parallel machine learning libraries. Notable on

How I visualized the issues related to Sewol for 04.16 ~ 05.05

세월호 이슈 타임라인 from Hikmat on Vimeo . In April 16 there was a big tragedy in Korea, the ferry named Sewol (세월호) sank just in few hours causing the death of hundreds of students. After the tragedy, there were millions of tweets talking about sewol related issues. By analyzing these tweets we could see how people felt about sewol tragedy, how their opinions changed during the time, whom they blamed and how much they hoped for the survival of missing students.   We calculated  TF-IDF  values with Sewol keywords, and from those we selected the top issues related to this tragedy. We exported results to excel where the rows represented the date and time, and the columns represented the words.   I used 3ds max for visualization. It has MaxScript programming environment. I made a script to import the data from excel and draw the texts on 3D space, then animated their size by their TF-IDF value. The position of words are distributed randomly. Watch it and feel free to comment what you

Which device is famous for tweeting in Korea

I was wondering how many people use android to tweet and how many use iphone, so I did little bit a research on this, Fortunately, there is a "source" field in raw json tweet data, so I grouped tweets by that field for random day. hive>  select source, count(*) as cnt from twitter where datehour>=2014052000 and datehour<=2014052023 group by source order by cnt desc;   Result: Android 593671 Web 204334 Twittbot.net 153035 iPhone 144256 Tweetdeck 65767 iPad 19933 Note that the tweets are for May 20 including only the ones with korean syllables

NLP (Natural Language Processing) libraries for Korean language

There are mainly 2 available NLP libraries for Korean language. The first one is open source and the other is closed source with limited period of license. 1. Kookmin NLP library ( http://nlp.kookmin.ac.kr/ ) This library has better dictionary, and word spacing feature which gives a nice output. It has a very long development history and considered to be one the best NLP libraries for Korean language. Its features include: automatic word spacing, morphological analyzer, noun extraction and others. However, the license is free only for non-commercial purpose. The download page is in : http://nlp.kookmin.ac.kr/HAM/kor/download.html But the latest version you can find in this blog: http://cafe.daum.net/nlpk 2. Hannanum project (" http://kldp.net/projects/hannanum ") Good thing about this project is it is fully open-source. Developed by KAIST graduates using JAVA programming language. The dictionaries and grammatical rules are open to change and improve. Also it

Three essential things to do while building Hadoop environment

Last year I setup Hadoop environment by using Cloudera manager. (Basically I followed this video tutorial :  http://www.youtube.com/watch?v=CobVqNMiqww ) I used CDH4(cloudera hadoop)  that included HDFS, MapReduce, Hive, ZooKeeper HBase, Flume and other essential components. It also included YARN (MapReduce 2) but it was not stable so I used MapReduce instead. I installed CDH4 on 10 centos nodes, and I set the Flume to collect twitter data, and by using "crontab" I scheduled the indexing the twitter data in Hive. Anyways, I want to share some of my experiences  and challenges that I faced. First, let me give some problem solutions that everyone must had faced while using Hadoop. 1. vm.swappiness warning on hadoop nodes It is easy to get rid of this warning by just simply running this shell command on nodes: >sysctl -w vm.swappiness=0 More details are written on cloudera's site 2. Make sure to synchronize time on all nodes (otherwise it will give error on n