Archive for June, 2010

Semi-supervised Naive Bayes in Python

Monday, June 21st, 2010

Expectation-Maximization

The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.

Naive Bayes trained with EM

In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:

  • Train a classifier with your labeled data
  • While the model likelihood increases:
    • E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
    • M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.

The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.

Code

I made a simple implementation of it in Python + Numpy. The code is fairly optimized.

$ git clone http://www.mblondel.org/code/seminb.git

web interface

Implementation details

Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.

Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:

P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}

However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:

\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)

To turn around P(x|c), we use Bayes’rule:

P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}

By posing z_j = \log P(c_j) + \log P(x_i|c_j), we get:

P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}

This is the softmax function. However, we are back to our initial problem because, since the z_j are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant e^{-m}:

P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}

Setting m to max_j~z_j, the z_j-m values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.

This trick will improve the situation quite a lot but in case this is not enough:

P(y_i=c_j|x_i) = \begin{cases} 0, & \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m > t\}} e^{z_k-m}},  & \mbox{otherwise} \end{cases}

This sets the exponentials to zero when e^{z_j-m} \le e^t. For t=-10, this is 0.000045. Equivalently this corresponds to setting the exponentials to zero when e^{z_j} \le e^{t+m}. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.

Reference

Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

LSA and pLSA in Python

Sunday, June 13th, 2010

Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.

Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing.

Comparison

  LSA pLSA
1. Theoretical background Linear Algebra Probabilities and Statistics
2. Objective function Frobenius norm Likelihood function
3. Polysemy No Yes
4. Folding-in Straightforward Complicated

1. LSA stems from Linear Algebra as it is nothing more than a Singular Value Decomposition. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).

2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the Frobenius norm of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the least squares solution corresponds to the Maximum Likelihood solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling.

The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.

Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).

3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.

4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called “folding-in”. Clearly, new documents folded-in don’t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time.

In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm.

Implementation in Python

LSA is straightforward to implement as it is nothing more than a SVD and Numpy’s Linear Algebra module has a function “svd” already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn’t mean that the SVD is not full, just that the returned matrices don’t contain vectors corresponding to zero singular values. Scipy’s Linear Algebra package unfortunately doesn’t seem to have a sparse SVD. Likewise, there’s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).

pLSA’s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.

The implementation is sparse. It works with both Numpy’s ndarrays and Scipy’s sparse matrices.

$ git clone http://www.mblondel.org/code/plsa.git

web interface

Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it’s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.