Finding term's related terms in real-time would not be easy if you have 2 millions of tweets and 10 millions of words. Although Hadoop can process huge amount of data in parallel it is not real-time. What if we want to see top related words for the term "축구" in real-time. Top related words can be found by calculating TF(Term frequency) values for each word a certain set of documents. Then we calculate IDF(Inverted Document Frequency) values by looking at all documents. IDF helps to exclude meaningless words (이, 다, 나, 너, 니, 하고, 보다 and so on) . Then We calculate TFIDF and sort all of them by the same values. Check out TFIDF on wiki: http://en.wikipedia.org/wiki/Tf%E2%80%93idf The process is quite complicated, but I will try to explain it as simple as possible. Here are the steps that I went through: 1. Extracting tweetId and the term from Hive into a Text file. By using Hive I split each tweet into words and separated eac...