Semi-supervised Naive Bayes in Python


The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.

Naive Bayes trained with EM

In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:

  • Train a classifier with your labeled data
  • While the model likelihood increases:
    • E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
    • M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.

The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.


I made a simple implementation of it in Python + Numpy. The code is fairly optimized.

$ git clone

web interface

Implementation details

Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.

Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:

P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}

However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:

\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)

To turn around P(x|c), we use Bayes’rule:

P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}

By posing z_j = \log P(c_j) + \log P(x_i|c_j), we get:

P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}

This is the softmax function. However, we are back to our initial problem because, since the z_j are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant e^{-m}:

P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}

Setting m to max_j~z_j, the z_j-m values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.

In case this is not enough, we can further define:

P(y_i=c_j|x_i) = \begin{cases} 0, & \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m > t\}} e^{z_k-m}},  & \mbox{otherwise} \end{cases}

This truncates the exponentials to zero when e^{z_j-m} \le e^t. For t=-10, this corresponds to 0.000045. Equivalently we can see it as setting the exponentials to zero when e^{z_j} \le e^{t+m}. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.


Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

19 Responses to “Semi-supervised Naive Bayes in Python”

  1. muriloq Says:

    Hi. I was reading Nigam’s paper and while looking for references I found your blog. Thanks a lot for your code!

    Do you have a complete usage example or at least code for preprocessing the texts and generating the input matrix ? Thanks a lot in advance!

  2. muriloq Says:

    Hello again.

    I’ve written code to generate the input data.

    The standard Naïve Bayes training (using the method SemiNB.train) works fine, but when I try to use unlabeled data the assertion assert(lik_diff >= -1e-10) in train_semi inner loop fails…

    Any clues about how to fix that ?

  3. Mathieu Says:

    Hi. Thanks for giving the code a try. It’s hard to tell without seeing your code but the assertion always passes for me… What you can do is comment the assertion and set debug to true so that you see the likelihood difference at every iteration.

    I’ll upload a code sample later when I have time. For the preprocessing code, you can use;a=tree

  4. muriloq Says:

    Hello again!

    The first value of loglikelihood is already -inf, so at every iteration the difference is nan…

    Maybe the problem is in my TFIDF code (I’m using the values produced by WEKA). I’ll be very glad if you upload the code sample when you have the time to do so. Thanks a lot in advance!

  5. rick wei Says:

    Hi, thanks for your code. I would like to apply your method and refer your paper to handle a texts classification issue. I don’t familiar to python. There is no any output when I executed the May you provide an example of input data? Your help will be appreciated. : )

  6. rick wei Says:

    Hi again~

    I found the way to use training and test data. It can work now. Thank you.

  7. hatice Says:


    Is it possible to to provide an example of input data? Thank you for your assistance in advance.

  8. rick wei Says:


    I have a question about the train_semi. I used the train_semi to train a classifier, but the accuracy is decrease after each iteration.
    Should I use all the unlabeled data in the first iteration?
    or add a part of unlabeled data increasely?
    (I have a large unlabeled data set and a small labeled data set.)

    Thank you.


    To hatice,

    The input data is shown as below. I hope it may help you.

    td,tdu,test data (Each column indicates a document, each row indicates a feature)
    0 0 0 1 0
    1 0 1 1 1
    0 1 0 0 1

    delta(Each column indicates a class, each row indicates a document)
    0 1
    1 0
    1 0
    0 1
    0 1

    the row of “delta” must be the same to the column of “td”.

    If you have any question, you may e-mail me. We can discussed together.

  9. rick wei Says:


    The function-train_semi of confused me.

    This is a part of code.
    # E-step: find the probabilistic labels for unlabeled data
    delta_u = self.predict_proba_all(tdu)

    # M-step: train classifier with the union of
    # labeled and unlabeled data
    self.train(tdu, delta_u, normalize=False, sparse=False)
    The M-step trains the classifier with the union of labeled and unlabeled data. But the input of self.train is just “tdu”(unlabeled).
    Is there anything I miss or ?

  10. Mathieu Says:

    In the E-step, I use the current model to label unlabeled data, i.e., compute the fractional labels for the unlabeled data.
    In the M-step, I compute the counts for unlabeled data, I add the counts for labeled data (those don’t change so they are computed outside of the loop), and finally I normalize to get probabilities.

    You can find a more recent version of my code here:

    You can find unit tests here:

    (Sorry I don’t have time to provide support about these programs)

  11. rick wei Says:

    Very appreciate for your response. It’s very helpful. Thank you.

  12. Rick wei Says:

    Thanks for your program share.

    I met an error where the number of row of numpy.array more than 1600.
    “”" ValueError: setting an array element with a sequence. “”"

    Does the numpy limit the size of array? If the anwer is yes, do you know how to expend the array size?

  13. Mathieu Says:

    I think it’s a bug in your program. It means that you’re doing something like this: arr[i, j] = [1, 2, 3, ...]. You cannot set the element of an array with a sequence.

  14. Quora Says:

    How big the training set should be in the Naive Bayes text classification?…

    * First make  sure that data is balanced. Since simple naive Bayesian algorithm  won’t work for unbalanced dataset. If dataset is unbalanced, then I suggest you to try out complement Bayesian algorithm. Weka( and m…

  15. Abhaya Says:

    Hi Mathieu,

    I am trying to use the Kamal Nigam’s algorithm for the Multi-Variate Burnoulli NB. I am running into the strange problem of likelihood of data decreasing with every EM step! I believe that I might have made a mistake in adopting the algorithm for the Burnoulli case. Do you have any thoughts on how that can be done correctly?


  16. Mathieu Says:

    Just quickly looking at the equations in this post, there doesn’t seem to be any reason that it shouldn’t work. So I guess it’s probably a bug in your implementation. Try to keep in mind the big picture: first train your classifier on the labeled data, then use your current classifier to find probabilistic labels of the unlabeled data, retrain with all the data (labeled and probabilistically labeled) and repeat. Also keep your implementation simple until you get it right and optimize performance afterwards.

  17. hoh Says:

    Does weka support Semi-Supervised Learning?

  18. Purvi Says:

    thanks for your code..
    Can anyone give complete working code with input matrix ? Thanks in advance..

  19. Purvi Says:

    Hello rick wei,

    As you suggest, I have provided input data for td and delta in train function, but I am getting error of objects are not aligned. Here is output of that

    File “/var/test/”, line 156, in train
    self.p_w_c[w,c] += td[w,d] * delta[d,c]
    File “/usr/lib/python2.6/dist-packages/numpy/core/”, line 290, in __mul__
    return, asmatrix(other))
    ValueError: objects are not aligned

    I am running it on ubuntu linux system.

    Please advise what should i do to successfully run train and other functions.