Classifying documents using Apache Mahout

Written by Jenaiz on 21 Feb 2015

I was wondering how to do some text classification with Java and Apache MahoutIsabel Drost-Fromm gave a talk in the LuceneSolrRevolution Conference (Dublin - 2013) where she was speaking about the topic, how Apache Mahout and Lucene could help you.

It is a good an introduction to the topic. I have enjoyed too much what it was presented in the talk.

Lucene, Mahout and Hadoop (only a little bit) sound really great for a talk about how to do texts classifications.

The general idea behind the complete process to classify documents will follow the below steps:

HTML >> Apache Tika

Fulltext >> Lucene Analyzer

Tokenstream >> FeatureVectorEnconder

Vector >> Online Learner

Of course Isabel was giving the advice of reuse the libraries that you have in your hands, take an internal look to the algorithms used there and improve them, if you need it. As a first approach it is really good for me to see how things work.

Mahout is a really good library for machine learning, it was using map reduce to perfectly integrate with Hadoop (v1.0), although from April of 2014 they have decided to move forward:

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. (You can read that in there web site).
At the end of the video there is a recommendation to everyone to participate in the project: bug fixing, documentation, reporting bugs... There are a lot of things to do in open source projects always. If you are using the libraries there, I recommend you to subscribe to the mailing lists if you are interested in the project.

I really recommend you to see the video if you are interested in the field, I think she was giving a good talk about a good topic. You can take a look to the slides too.

Tags: Machine Learning Apache Lucene Apache Mahout Text Classification




Most read...

Deep learning in a large scale distributed system

Deep learning is interesting in many ways. But when you consider to do it in thousands of cores that can process millions of parameters, then the problem is more interesting and complex at the same time.

Two interesting books to start with Machine Learning

There are a lot of books in the field of Machine Learning, just a fast search in Amazon gives you more than 25.ooo books. I wanted to filter all those books an choose the most useful. I was looking in google, quora and reading some post that I found around internet. There a lot of people giving a list of 10 - 20 books about machine learning, statistical learning, reinforcement learning... I just wanted to find the two interesting books to go into the field.

comments powered by Disqus