Link Search Menu Expand Document

Naive Bayes

Naive Bayes is a simple but famous and effective classification algorithm for discrete features. Although not necessarily the best algorithm, it often achieves reasonable performance, thus becoming the very good choice of your “first thing to try”.

The Naive Bayes Assumption

Being one of the generative learning algorithms, Naive Bayes performs classification by modelling \(p(x|y)\) and \(p(y)\). The prediction is made by using the Bayes’ rule: \[ p(y|x) = \frac{p(x|y)p(y)}{p(x)}~. \] While \(p(y)\) can be easily modelled by counting the number of occurrence of samples belonging to different classes in the training data, the main headache lies in \(p(x|y)\). To model this conditional probability, Naive Bayes makes a relatively strong assumption. And yes, this assumption is called the Naive Bayes assumption, which assumes each feature of \(x\) are conditionally independent given \(y\). Under this assumption, we can write \(p(x|y)\) as: \[ p(x|y)=p(x_1,x_2,\dots,x_n|y)=\prod_{i=1}^n p(x_i|y)~. \] Therefore, the likelihood of a training sample \(p(x_i,y_i)\) is given by: \[ p(x_i,y_i)=p(y_i)p(x_i|y_i)=p(y_i)\prod_{j=1}^n p(x_{i}^j|y_i)~,\]

The Event Model

Despite the fact that we re-write the likelihood into a series of multiplications, we still don’t know how to compute \(p(x_{i}^j|y_i)\). This is because besides the Naive Bayes assumption, there is still one missing patch that will make the algorithm work, which is the underlying event model. You can simply view event model as the assumption about how the data is generated. To better explain the idea of different event model, I’ll take the example of text classification task here.

The goal of the text classification task is to predict the label of a text (e.g spam or non-spam) given the content of the text. The event model specifies how a spam (or non-spam) email is generated. There are many available event models, Pudding currently only supports multinomial event model. For multinomial event model, to generate an email, we first determine whether it’s spam or not. After that, each word occuring in this email is sampled sequentially but indepedently (the Naive Bayes assumption). It’s easy to see that this means we assume a multinomial model for each label. Under this assumption, we have: \[p(x_{i}^j|y_i)={\phi^j_{y_i}}^{x_{i}^j}~\,\] where \(x_{i}^j\) is the number of occurrence of the j’th word in the vocabulary and \(\phi^j_{y_i}\) is paramter for the multinomal model of label \(y_i\). Another popular event model is called the Bernoulli event model. Under the Bernoulli event model, each email is assumed to be generated by first determining whether it’s spam or not, and then determine independetly whether each word in the vocabulary will exist or not in this email using a Bernoulli mode. Under this assumption, we have: \[p(x_{i}^j|y_i)={\phi^j_{y_i}}^{1\{x_{i}^j\ge 1\}}~\,\] where \(1\{\cdot\}=1\) iff. the condition \(\cdot\) is satisfied.

Using the Multinomial event model, we can estimate the parameter using maximum likelihood estimation, and the result is as follows: \begin{align} &\phi_{c}^k=\frac{\sum_{i=1}^m \left (x_{i}^k \cdot 1\{y_i=c\}\right )}{\sum_{i=1}\left(\sum_{j=1}^{n}x_{i}^j \cdot 1\{y_i=c\}\right)}\\
&p(y=c)=\frac{\sum_{i=1}^m 1\{y_i=c\}}{m}. \end{align}

The Laplace Smoothing

Are we done? Actually not yet. Consider the case when a word neven occurs in the training data, then the above estimation of the parameters corresponding to this word would also be zero. Then, if we encouter a text that contains this word during inference, the probability \(p(y|x)\) would be all 0s (if we omit the denominator) and we wouldn’t be able to make any prediction. To avoid such condition, we can apply Laplace smoothing, which modifies the above estimation into: \begin{align} &\phi_{c}^k=\frac{\alpha+\sum_{i=1}^m \left (x_{i}^k \cdot 1\{y_i=c\}\right )}{\alpha\times n+\sum_{i=1}\left(\sum_{j=1}^{n}x_{i}^j \cdot 1\{y_i=c\}\right)}\, \end{align} where \(\alpha\) is the smoothing factor and \(n\) is the size of the vocabulary. It’s easy to verify that the result estimation is still a valid probability distribution. It’s also worthy noting here that usually we don’t apply the Laplace smoothing to the class prior \(p(y)\) since many machine learning algorithms many closed-world assumption (no new class during testing).

One Application: Text Classification

Naive Bayes is usually used for text classification tasks, here we provide a simple example of using Naive Bayes with multinomial event model to perform classification on the 20newsgroups dataset. You can find the code in examples/classification/naive_bayes/multinomial_text_classification.py.

'''
This example demonstrates the usage of naive bayes with multinomial event model to perform text classification problem using the 20 newsgroups dataset provided by scikir-learn
'''

from sklearn.naive_bayes import MultinomialNB
from pudding.classification import NaiveBayesMultinomial

from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

from time import time

# Prepare the data
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
vectorizer = TfidfVectorizer()
label_encoder = LabelEncoder()
X_train = vectorizer.fit_transform(newsgroups_train.data)
y_train = label_encoder.fit_transform(newsgroups_train.target)
X_train = X_train.todense()

newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
X_test = vectorizer.transform(newsgroups_test.data)
y_test = label_encoder.transform(newsgroups_test.target)
X_test = X_test.todense()

# Fit the model
print('Fitting and predicting using scikit-learn...')
t0 = time()
sklearn_model = MultinomialNB(alpha=0.01)
sklearn_model.fit(X_train, y_train)
sklearn_pred = sklearn_model.predict(X_test)
print(f'Done in {time() - t0:0.3f}s.')
sklearn_f1 = metrics.f1_score(y_test, sklearn_pred, average='macro')

print('Fitting and predicting using Pudding...')
t1 = time()
pudding_model = NaiveBayesMultinomial(n_classes=len(label_encoder.classes_), alpha=0.01)
pudding_model.fit(X_train, y_train)
pudding_pred = pudding_model.predict(X_test)
print(f'Done in {time() - t1:0.3f}s.')
pudding_f1 = metrics.f1_score(y_test, pudding_pred, average='macro')

print('Scikit-learn\'s F1 score: %0.3f' % sklearn_f1)
print('Pudding\'s F1 score: %0.3f' % pudding_f1)