These days I am building a small platform for doing "Topic Modeling" with Apache Mahout.
While working on this project I tried to maximize the speed of NLP processing as fast as possible. Previously, I was running NLP on a single node after retrieving the tweet messages from Hive. Then I found out about UDF(User Defined Functions) which makes it possible to run custom libraries on Hadoop nodes during the mapping process of MapReduce.
By using UDF, I attached open-source Korean NLP library Hannanum and made a library to process NLP and Text processing parallel.
Here is how the hive query looks like:
Feel free to use and ask questions.
Here is my K-Means dumped cluster output (processing k=100 topics with one-day tweet data)
Output.txt
While working on this project I tried to maximize the speed of NLP processing as fast as possible. Previously, I was running NLP on a single node after retrieving the tweet messages from Hive. Then I found out about UDF(User Defined Functions) which makes it possible to run custom libraries on Hadoop nodes during the mapping process of MapReduce.
By using UDF, I attached open-source Korean NLP library Hannanum and made a library to process NLP and Text processing parallel.
Here is how the hive query looks like:
>select extractNoun(text) from twitter;
I put the source code on GitHub
Feel free to use and ask questions.
Here is my K-Means dumped cluster output (processing k=100 topics with one-day tweet data)
Output.txt
Comments
Post a Comment