Semi-supervised Naive Bayes in Python
Expectation-Maximization
The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.
Naive Bayes trained with EM
In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:
- Train a classifier with your labeled data
- While the model likelihood increases:
- E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
- M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.
The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.
Code
I made a simple implementation of it in Python + Numpy. The code is fairly optimized.
$ git clone http://www.mblondel.org/code/seminb.git
Implementation details
Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.
Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:
However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:
To turn around P(x|c), we use Bayes’rule:
By posing , we get:
This is the softmax function. However, we are back to our initial problem because, since the are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant
:
Setting m to , the
values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.
In case this is not enough, we can further define:
This truncates the exponentials to zero when . For t=-10, this corresponds to 0.000045. Equivalently we can see it as setting the exponentials to zero when
. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.
Reference
Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.
September 2nd, 2010 at 8:00 pm
Hi. I was reading Nigam’s paper and while looking for references I found your blog. Thanks a lot for your code!
Do you have a complete usage example or at least code for preprocessing the texts and generating the input matrix ? Thanks a lot in advance!
September 2nd, 2010 at 11:55 pm
Hello again.
I’ve written code to generate the input data.
The standard Naïve Bayes training (using the method SemiNB.train) works fine, but when I try to use unlabeled data the assertion assert(lik_diff >= -1e-10) in train_semi inner loop fails…
Any clues about how to fix that ?
September 4th, 2010 at 5:39 pm
Hi. Thanks for giving the code a try. It’s hard to tell without seeing your code but the assertion always passes for me… What you can do is comment the assertion and set debug to true so that you see the likelihood difference at every iteration.
I’ll upload a code sample later when I have time. For the preprocessing code, you can use http://www.mblondel.org/gitweb?p=tfidf.git;a=tree
September 9th, 2010 at 3:41 pm
Hello again!
The first value of loglikelihood is already -inf, so at every iteration the difference is nan…
Maybe the problem is in my TFIDF code (I’m using the values produced by WEKA). I’ll be very glad if you upload the code sample when you have the time to do so. Thanks a lot in advance!
March 28th, 2011 at 12:04 am
Hi, thanks for your code. I would like to apply your method and refer your paper to handle a texts classification issue. I don’t familiar to python. There is no any output when I executed the seminb.py. May you provide an example of input data? Your help will be appreciated. : )
April 1st, 2011 at 11:01 pm
Hi again~
I found the way to use training and test data. It can work now. Thank you.
April 7th, 2011 at 12:56 am
Hi,
Is it possible to to provide an example of input data? Thank you for your assistance in advance.
April 12th, 2011 at 4:09 am
Hi,
I have a question about the train_semi. I used the train_semi to train a classifier, but the accuracy is decrease after each iteration.
Should I use all the unlabeled data in the first iteration?
or add a part of unlabeled data increasely?
(I have a large unlabeled data set and a small labeled data set.)
Thank you.
——————————————————————————————
To hatice,
The input data is shown as below. I hope it may help you.
td,tdu,test data (Each column indicates a document, each row indicates a feature)
0 0 0 1 0
1 0 1 1 1
0 1 0 0 1
delta(Each column indicates a class, each row indicates a document)
0 1
1 0
1 0
0 1
0 1
the row of “delta” must be the same to the column of “td”.
If you have any question, you may e-mail me. We can discussed together.
April 12th, 2011 at 10:27 pm
Hi,
The function-train_semi of seminb.py confused me.
This is a part of code.
#———————————————————————
# E-step: find the probabilistic labels for unlabeled data
delta_u = self.predict_proba_all(tdu)
# M-step: train classifier with the union of
# labeled and unlabeled data
self.train(tdu, delta_u, normalize=False, sparse=False)
#———————————————————————
The M-step trains the classifier with the union of labeled and unlabeled data. But the input of self.train is just “tdu”(unlabeled).
Is there anything I miss or ?
April 13th, 2011 at 4:11 pm
In the E-step, I use the current model to label unlabeled data, i.e., compute the fractional labels for the unlabeled data.
In the M-step, I compute the counts for unlabeled data, I add the counts for labeled data (those don’t change so they are computed outside of the loop), and finally I normalize to get probabilities.
You can find a more recent version of my code here:
https://github.com/mblondel/scikit-learn/blob/semisupervised/scikits/learn/naive_bayes.py
You can find unit tests here:
https://github.com/mblondel/scikit-learn/blob/semisupervised/scikits/learn/tests/test_naive_bayes_semi.py
(Sorry I don’t have time to provide support about these programs)
April 15th, 2011 at 11:08 pm
Hi,
Very appreciate for your response. It’s very helpful. Thank you.
April 22nd, 2011 at 6:14 am
Hi,
Thanks for your program share.
I met an error where the number of row of numpy.array more than 1600.
“”" ValueError: setting an array element with a sequence. “”"
Does the numpy limit the size of array? If the anwer is yes, do you know how to expend the array size?
April 22nd, 2011 at 6:09 pm
I think it’s a bug in your program. It means that you’re doing something like this: arr[i, j] = [1, 2, 3, ...]. You cannot set the element of an array with a sequence.
May 6th, 2011 at 1:38 am
How big the training set should be in the Naive Bayes text classification?…
* First make sure that data is balanced. Since simple naive Bayesian algorithm won’t work for unbalanced dataset. If dataset is unbalanced, then I suggest you to try out complement Bayesian algorithm. Weka(http://www.cs.waikato.ac.nz/ml/weka/) and m…
September 20th, 2011 at 8:18 pm
Hi Mathieu,
I am trying to use the Kamal Nigam’s algorithm for the Multi-Variate Burnoulli NB. I am running into the strange problem of likelihood of data decreasing with every EM step! I believe that I might have made a mistake in adopting the algorithm for the Burnoulli case. Do you have any thoughts on how that can be done correctly?
Regards,
Abhaya
September 22nd, 2011 at 4:42 pm
Just quickly looking at the equations in this post, there doesn’t seem to be any reason that it shouldn’t work. So I guess it’s probably a bug in your implementation. Try to keep in mind the big picture: first train your classifier on the labeled data, then use your current classifier to find probabilistic labels of the unlabeled data, retrain with all the data (labeled and probabilistically labeled) and repeat. Also keep your implementation simple until you get it right and optimize performance afterwards.
November 11th, 2011 at 12:56 am
Does weka support Semi-Supervised Learning?
November 24th, 2011 at 7:26 pm
Hello..
thanks for your code..
Can anyone give complete working code with input matrix ? Thanks in advance..
November 27th, 2011 at 1:27 am
Hello rick wei,
As you suggest, I have provided input data for td and delta in train function, but I am getting error of objects are not aligned. Here is output of that
File “/var/test/seminb.py”, line 156, in train
self.p_w_c[w,c] += td[w,d] * delta[d,c]
File “/usr/lib/python2.6/dist-packages/numpy/core/defmatrix.py”, line 290, in __mul__
return N.dot(self, asmatrix(other))
ValueError: objects are not aligned
I am running it on ubuntu linux system.
Please advise what should i do to successfully run train and other functions.