## LSA and pLSA in Python

Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.

Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing.

### Comparison

LSA pLSA
1. Theoretical background Linear Algebra Probabilities and Statistics
2. Objective function Frobenius norm Likelihood function
3. Polysemy No Yes
4. Folding-in Straightforward Complicated

1. LSA stems from Linear Algebra as it is nothing more than a Singular Value Decomposition. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).

2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the Frobenius norm of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the least squares solution corresponds to the Maximum Likelihood solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling.

The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.

Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).

3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.

4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called “folding-in”. Clearly, new documents folded-in don’t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time.

In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm.

### Implementation in Python

LSA is straightforward to implement as it is nothing more than a SVD and Numpy’s Linear Algebra module has a function “svd” already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn’t mean that the SVD is not full, just that the returned matrices don’t contain vectors corresponding to zero singular values. Scipy’s Linear Algebra package unfortunately doesn’t seem to have a sparse SVD. Likewise, there’s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).

pLSA’s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.

The implementation is sparse. It works with both Numpy’s ndarrays and Scipy’s sparse matrices.

$git clone http://www.mblondel.org/code/plsa.git web interface Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it’s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution. ### 26 Responses to “LSA and pLSA in Python” 1. Eugene Says: Thank you for the interesting post. The straightforward SVD is not often used for collaborative filtering nowadays: the problem is that matrixes in collaborative filtering have *missing values* and it’s necessary to deal with them somehow. It is believed that setting them to zero is no very good idea. Instead, it is popular to find matrix factorization$X \approx U*(M^T)$that minimizes L2 error on observed cells:$(U,M) = argmin \sum_{x_ij is observed} (x_ij – )^2\$.

Usually stochastic gradient descent (SGD) is used to minimize this function. It is easy to implement and very fast.

2. Mathieu Says:

Eugene, that sounds interesting! Thank you. Any reference paper ?

I guess that one advantage of SGD is that documents can be added one at a time.

3. Mathieu’s log » Blog Archive » Semi-supervised Naive Bayes in Python Says:

[...] Mathieu’s log Computer science, Chinese, Japanese, random thoughts… « LSA and pLSA in Python [...]

4. Mathieu’s log » Blog Archive » Latent Dirichlet Allocation in Python Says:

[...] Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) – see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents [...]

5. jnhwkim Says:

Note that mistype, sparce->sparse

6. Carlos Castro Says:

Just wanted to let you know that I have been playing around with your PLSA code, and I really like it! Great job!

Also, you mentioned that you wanted a paper reference for the matrix factorization. I would recommend any of the papers from the guys that won the Netflix prize, such as:
Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk. 2008. Matrix factorization and neighbor based algorithms for the netflix prize problem. In Proceedings of the 2008 ACM conference on Recommender systems (RecSys ’08). ACM, New York, NY, USA, 267-274. DOI=10.1145/1454008.1454049 http://doi.acm.org/10.1145/1454008.1454049

7. Passerby D Says:

I have a question that why you choose the symmetric form( P(w|z)P(z)P(d|z) ) to express the likelihood, but not the form P(w|z)P(z|d)P(d), as the later is more directly to computer document topics( P(z|d) ).

8. nick Says:

By the way if you are looking for a good sparse SVD implementation for python, check out http://pypi.python.org/pypi/sparsesvd#downloads which is a Python wrapper around SVDLIBC. It calculates truncated SVD on sparse matrices and is very fast for even very large matrices (depending on sparsity of course).

9. Dan Says:

Great writeup, unusually lucid prose for a math guy.

Thanks for a code ref too. What is the license on it? I thought about trying it for a commercial application. Would you be willing to use a BSD license?

10. Mathieu Says:

Will you ship the software to customers or will it remain server-side? If you don’t ship the software, you can use it even if it’s GPL.

11. Pietro Says:

Hi Mathieu,
thanks for the code reference!

I need a fast implementation of plsa, do you know, if it exists, where can i find a C/C++ implementation??

12. André Caron Says:

Hi Mathieu, long time no see!

I was browsing StackOverflow today, and this post was mentioned! Check it out: http://stackoverflow.com/questions/5783518/c-c-plsa-libraries

13. Mathieu Says:

@Pietro: My code is written in C and wrapped to be usable from Python.

@André: En effet ! Il faut qu’on se contacte par email !

14. Dan Says:

“Will you ship the software to customers or will it remain server-side? If you don’t ship the software, you can use it even if it’s GPL.”

It will not be shipped, not even a webapp. It’s an internal application. Are you saying this code is GPL?

15. Mathieu Says:

You said “commercial application”. If it’s for internal use, you can use open-source software without complying to the license. (And yes this code is GPL)

16. Dan Says:

Thanks!

17. Ania Says:

Great job, saved me a lot of time!
But I have one question. I have learnt model, topic sets looks very good, but I have problem when I try to use model for new query.
Just for testing now, I use one of training documents as ‘new’ query, so I have counted exactly the same words in exactly the same order.
When I use folding_in method on this query, result contains nothing but zeros.
When I use folding_in method, but with train without folding_in parameter, result contains random topic assignment.
When I force to use python version (not wrapped c), assertion assert(lik_diff >= -1e-10) don’t pass.
When I use python version with this assertion commented I get random result again.

I guess I must be doing something wrong, code looks very good… What can I do wrong?

18. Mathieu Says:

There may be a bug in the folding_in method, I haven’t touched this code in a while. What you can do in the meantime is to compute the test topics together with the train topics, i.e. you need to concatenate the training and test examples into a single matrix (n_features * n_train + n_test). This is a kind of transductive learning.

The assertion can be commented out safely, apparently it is too strict.

BTW, the comments in my blog are in moderation mode (I get a lot of comment spam).

19. pietro Says:

Hey,
hi Mathieu!

I implemented a plsa module for my univesity and know i’m writing an article about it.
I have to write something about existent plsa implementation and since yours is one of the best i would like to quote it…
Does it have a ufficial name?

thanks,
pietro

20. Mathieu Says:

I think there are probably much better implementations than mine (and more maintained; I don’t have time to maintain mine anymore) but if you would like to quote it, go for it! Thanks! You can post a link to yours in these comments if you want.

21. pietro Says:

Hey Mathieu,
i have a question, maybe you can help me.
How the “right” number of topics can be estimated?
I tried with a k-fold cross-validation, considering the Area Under a ROC Curve as the metric to evaluate how good is a topic number, but the model seems to have a serious problem of overfitting…:(
Do you know if does exist some way to estimate this parameter correctly?

Thanks you!!
byee

22. Mathieu Says:

Cross-validation is the right way to choose the number of topics. Maybe you didn’t use enough folds or maybe you’re using too many iterations during the EM procedure (early stopping can help with overfitting)?

23. ali Says:

dear can any body help me to give the how to construct the corpus of source code to be read from lsi

24. Mathieu Says:

To Andrew Polar: Any message from you will now directly go to the trash. Save us some time and stop spamming my blog.

25. PLSA and PLSM | Recognizing Attack Patterns: Clustering of Optical Flow Vectors in RoboCup Soccer Says:

[...] solution to this is the use of PLSA combined with connected component analysis. Thanks to M. Blondel, I now have an implementation of the PLSA algorithm. To use it, I will have to map the current [...]

26. aeweiwi Says:

Hi,
I have been looking at your pLSA code in python, and debugging your program as some has noticed that there exist some failure in the folding-in process for new sample. I think I have resolved it by noticing that your normalize function have a strange behavior when you ask for normalizing an array of shape (1,Z) in the query. I noticed it set all values almost to zero except one (Thats why the maximization step sometimes goes below zero and sometimes have only one iteration to converge-which is strange), which is strange for a random initializer. Probably you may like to fix this error some day as so many guys ( including me ) are frequently reading your post.