Friday, June 13, 2014

Korean NLP on Hive

These days I am building a small platform for doing "Topic Modeling" with Apache Mahout.
While working on this project I tried to maximize the speed of NLP processing as fast as possible. Previously, I was running NLP on a single node after retrieving the tweet messages from Hive. Then I found out about UDF(User Defined Functions) which makes it possible to run custom libraries on Hadoop nodes during the mapping process of MapReduce.
By using UDF, I  attached open-source Korean NLP library Hannanum and made a library to process NLP and Text processing parallel.
Here is how the hive query looks like:

>select extractNoun(text) from twitter;

I put the source code on GitHub
Feel free to use and ask questions.

Here is my K-Means dumped cluster output (processing k=100 topics with one-day tweet data)
Output.txt

No comments:

Post a Comment