Naive Bayesian Classification of Text Categories in Scikit-Learn
I've been playing around a bit with text classifiers for work. In particular, we have a relatively large (in transactional terms, not in data science terms) data set of relationships used as meta data on objects that we report on. Simple recommendation engines are easy to build, since we're usually talking about counting word frequency, and then you can always play around with a few tweaks to get better accuracy.
Building one from scratch is pretty simple, but you could also use a pre-constructed one from any number of various libraries. It's good to know how they work, but once you've built one from scratch--no need to reinvent the wheel. If you want to learn how to build one from scratch, I recommend Hilary Mason's O'Reilly tutorial.
I'm a big fan of Scikit-Learn because it's built on top of a foundation (e.g., numpy, scipy) that you're probably already using, and that foundation is rock solid. Scikit-Learn offers three naive Bayesian classifiers: Gaussian, Multi-nominal, and Bernoulli, and they all can be implemented in very few lines of code.
Let's take a look at the Gaussian.
I'm going to assume that you already have your data set loaded into a Pandas data frame. Start by importing your libraries.
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer
This gives you the GaussianNB()
classifier and the term frequency/inverse document frequency (TF-IDF) vectorizer needed to create a vector from the data.
You then need to get your data ready. Again, I'm going to assume you already have some data prepared, but as an example, here I've pulled out some relationships from a Pandas data frame that I've been working with:
r = relationships[(relationships['Attribute'] == 'Category') & (relationships['Title'] != None)]
x = r.loc[:, 'Title']
y = r.loc[:, 'Attribute']
We now need to vectorize this data:
v = TfidfVectorizer(use_idf = True)
x = v.fit_transform(x.astype('U')).toarray()
Note that we are using the TfidVectorizer
to vectorize the data, but we do not want inverse document frequency to be used for this example. In the second line, we have to shape the Pandas selection by converting it to Unicode prior to the fit_transform()
. The toarray()
on the result then creates a dense array that the Gaussian fit()
method (see below) can accept.
g = GaussianNB()
g = g.fit(x, y)
Very simply, we create the Gaussian naive Bayesian classifier and then call fit()
passing in the labels and features.
From this, we can then test the classifier.
test = v.transform(['Which one of the following foods has the highest bioavailability of vitamin A?']).toarray()
prediction = g.predict(test)
print(prediction)
This will result in an output of ['Nutrition']
, which is the correction attribute label for the feature we entered.
If you want to experiment a little, take a look at Scikit-Learn's naive Bayesian documentation, and try some of the other algorithms. Which one worked best for your situation? You could also play around with the parameters of the TfidfVectorizer()
class to see what yields the best results.