Semi-supervised Naive Bayes in Python
The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.
Naive Bayes trained with EM
In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:
- Train a classifier with your labeled data
- While the model likelihood increases:
- E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
- M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.
The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.
I made a simple implementation of it in Python + Numpy. The code is fairly optimized.
$ git clone http://www.mblondel.org/code/seminb.git
Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.
Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:
However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:
To turn around P(x|c), we use Bayes’rule:
By posing , we get:
This is the softmax function. However, we are back to our initial problem because, since the are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant :
Setting m to , the values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.
In case this is not enough, we can further define:
This truncates the exponentials to zero when . For t=-10, this corresponds to 0.000045. Equivalently we can see it as setting the exponentials to zero when . Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.
Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.