Skip to main content

Posts

Showing posts from July, 2014

ThreeJs : South Korea's Entry statistics from Asia

     By using ThreeJS (HTML5 supported library) I have been modifying existing code and visualizing the amount of people's entry to South Korea from various countries of Asia. Japan and China is on the top.

Building Term Matrix on HBase and Calculating TF-IDF

     Finding term's related terms in real-time would not be easy if you have 2 millions of tweets and 10 millions of words. Although Hadoop can process huge amount of data in parallel it is not real-time. What if we want to see top  related words for the term "축구" in real-time. Top related words can be found by calculating TF(Term frequency) values for each word a certain set of documents. Then we calculate IDF(Inverted Document Frequency) values by looking at all documents. IDF helps to exclude  meaningless words (이, 다, 나, 너, 니, 하고, 보다  and so on) . Then We calculate TFIDF and sort all of them by the same values. Check out TFIDF on wiki:  http://en.wikipedia.org/wiki/Tf%E2%80%93idf The process is quite complicated, but I will try to explain it as simple as possible. Here are the steps that I went through: 1. Extracting tweetId and the term from Hive into a Text file.    By using Hive I split each tweet into words and separated eac...

Solving java.lang.ClassNotFoundException problem while exporting Jar file from Eclipse

While running my compiled jar files on hadoop cluster I faced this error many times: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration Usually, I don't use maven and attach libraries manually to the project. The libraries that I use are mostly apache's libraries related to hadoop, hive, hbase and etc.. After successful compilation I export the project as a Jar file. I tried to export in many ways but I faced this ClassNotFoundException every time. So I realized that exporting my project with attached jar libraries is not the optimal solution for the problem. Then I found this solution on StackOverflow http://stackoverflow.com/questions/2096283/including-jars-in-classpath-on-commandline-javac-or-apt It describes the way how to add required (dependency) jar files while running the main jar. It is simple! You just put all jars into the same folder where your main jar file is and Run: >java -cp *:. com.kh.demo.RealtimeTFIDF Note ...