CDH4 comes with Mahout library by default. You don't need to install mahout unless if you want to upgrade to the latest version. Mahout is a scalable machine learning library which supports many algorithms for clustering, classification, topic modeling, prediction and recommendation systems. It can have terabytes of input data and process the clustering or classification in less than an hour depending on how powerful is your hadoop cluster. In my case I am using 4 powerful PCs with virtual nodes. I had a task to model topics from twitter data for 2 months. I tried to run LDA(Latent Dirichlet Allocation - topic modeling algorithm) with R on one PC and it took several hours to build a document matrix of one day tweets(around 800,000) . And for improving results the number of topics K should be bigger than 500 thus exponentially increasing the LDA processing time. It is how I turned into Mahout. Except Mahout there are other parallel machine learn...
experiences with BIG data collection, analysis and visualization