LSA and pLSA in Python
Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.
Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing.
Comparison
| LSA | pLSA | |
|---|---|---|
| 1. Theoretical background | Linear Algebra | Probabilities and Statistics |
| 2. Objective function | Frobenius norm | Likelihood function |
| 3. Polysemy | No | Yes |
| 4. Folding-in | Straightforward | Complicated |
1. LSA stems from Linear Algebra as it is nothing more than a Singular Value Decomposition. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).
2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the Frobenius norm of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the least squares solution corresponds to the Maximum Likelihood solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling.
The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.
Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).
3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.
4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called “folding-in”. Clearly, new documents folded-in don’t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time.
In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm.
Implementation in Python
LSA is straightforward to implement as it is nothing more than a SVD and Numpy’s Linear Algebra module has a function “svd” already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn’t mean that the SVD is not full, just that the returned matrices don’t contain vectors corresponding to zero singular values. Scipy’s Linear Algebra package unfortunately doesn’t seem to have a sparse SVD. Likewise, there’s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).
pLSA’s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.
The implementation is sparse. It works with both Numpy’s ndarrays and Scipy’s sparse matrices.
$ git clone http://www.mblondel.org/code/plsa.git
Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it’s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.
June 13th, 2010 at 7:18 pm
Thank you for the interesting post.
The straightforward SVD is not often used for collaborative filtering nowadays: the problem is that matrixes in collaborative filtering have *missing values* and it’s necessary to deal with them somehow. It is believed that setting them to zero is no very good idea.
Instead, it is popular to find matrix factorization $X \approx U*(M^T)$ that minimizes L2 error on observed cells:
$(U,M) = argmin \sum_{x_ij is observed} (x_ij – )^2$.
Usually stochastic gradient descent (SGD) is used to minimize this function. It is easy to implement and very fast.
June 13th, 2010 at 10:09 pm
Eugene, that sounds interesting! Thank you. Any reference paper ?
I guess that one advantage of SGD is that documents can be added one at a time.
June 21st, 2010 at 6:47 pm
[...] Mathieu’s log Computer science, Chinese, Japanese, random thoughts… « LSA and pLSA in Python [...]
August 21st, 2010 at 9:52 pm
[...] Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) – see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents [...]
September 10th, 2010 at 3:37 am
I have been always enjoying reading your articles, really appreciating.
Note that mistype, sparce->sparse
November 17th, 2010 at 10:17 pm
Just wanted to let you know that I have been playing around with your PLSA code, and I really like it! Great job!
Also, you mentioned that you wanted a paper reference for the matrix factorization. I would recommend any of the papers from the guys that won the Netflix prize, such as:
Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk. 2008. Matrix factorization and neighbor based algorithms for the netflix prize problem. In Proceedings of the 2008 ACM conference on Recommender systems (RecSys ’08). ACM, New York, NY, USA, 267-274. DOI=10.1145/1454008.1454049 http://doi.acm.org/10.1145/1454008.1454049
February 23rd, 2011 at 5:26 pm
Thanks for your code.
I have a question that why you choose the symmetric form( P(w|z)P(z)P(d|z) ) to express the likelihood, but not the form P(w|z)P(z|d)P(d), as the later is more directly to computer document topics( P(z|d) ).
March 9th, 2011 at 2:02 am
By the way if you are looking for a good sparse SVD implementation for python, check out http://pypi.python.org/pypi/sparsesvd#downloads which is a Python wrapper around SVDLIBC. It calculates truncated SVD on sparse matrices and is very fast for even very large matrices (depending on sparsity of course).
April 15th, 2011 at 4:15 am
Great writeup, unusually lucid prose for a math guy.
Thanks for a code ref too. What is the license on it? I thought about trying it for a commercial application. Would you be willing to use a BSD license?
April 15th, 2011 at 5:54 pm
Will you ship the software to customers or will it remain server-side? If you don’t ship the software, you can use it even if it’s GPL.
April 26th, 2011 at 7:07 am
Hi Mathieu,
thanks for the code reference!
I need a fast implementation of plsa, do you know, if it exists, where can i find a C/C++ implementation??
April 26th, 2011 at 7:08 am
Hi Mathieu, long time no see!
I was browsing StackOverflow today, and this post was mentioned! Check it out: http://stackoverflow.com/questions/5783518/c-c-plsa-libraries
April 26th, 2011 at 4:25 pm
@Pietro: My code is written in C and wrapped to be usable from Python.
@André: En effet ! Il faut qu’on se contacte par email !
June 11th, 2011 at 7:38 am
“Will you ship the software to customers or will it remain server-side? If you don’t ship the software, you can use it even if it’s GPL.”
It will not be shipped, not even a webapp. It’s an internal application. Are you saying this code is GPL?
June 12th, 2011 at 1:47 am
You said “commercial application”. If it’s for internal use, you can use open-source software without complying to the license. (And yes this code is GPL)
June 15th, 2011 at 3:00 am
Thanks!
July 4th, 2011 at 2:49 pm
Great job, saved me a lot of time!
But I have one question. I have learnt model, topic sets looks very good, but I have problem when I try to use model for new query.
Just for testing now, I use one of training documents as ‘new’ query, so I have counted exactly the same words in exactly the same order.
When I use folding_in method on this query, result contains nothing but zeros.
When I use folding_in method, but with train without folding_in parameter, result contains random topic assignment.
When I force to use python version (not wrapped c), assertion assert(lik_diff >= -1e-10) don’t pass.
When I use python version with this assertion commented I get random result again.
I guess I must be doing something wrong, code looks very good… What can I do wrong?
Thanks in advance
July 5th, 2011 at 3:05 pm
There may be a bug in the folding_in method, I haven’t touched this code in a while. What you can do in the meantime is to compute the test topics together with the train topics, i.e. you need to concatenate the training and test examples into a single matrix (n_features * n_train + n_test). This is a kind of transductive learning.
The assertion can be commented out safely, apparently it is too strict.
BTW, the comments in my blog are in moderation mode (I get a lot of comment spam).
September 21st, 2011 at 3:02 am
Hey,
hi Mathieu!
I implemented a plsa module for my univesity and know i’m writing an article about it.
I have to write something about existent plsa implementation and since yours is one of the best i would like to quote it…
Does it have a ufficial name?
thanks,
pietro
September 22nd, 2011 at 4:34 pm
I think there are probably much better implementations than mine (and more maintained; I don’t have time to maintain mine anymore) but if you would like to quote it, go for it! Thanks! You can post a link to yours in these comments if you want.
October 6th, 2011 at 6:20 pm
Hey Mathieu,
i have a question, maybe you can help me.
How the “right” number of topics can be estimated?
I tried with a k-fold cross-validation, considering the Area Under a ROC Curve as the metric to evaluate how good is a topic number, but the model seems to have a serious problem of overfitting…:(
Do you know if does exist some way to estimate this parameter correctly?
Thanks you!!
byee
October 10th, 2011 at 2:19 am
Cross-validation is the right way to choose the number of topics. Maybe you didn’t use enough folds or maybe you’re using too many iterations during the EM procedure (early stopping can help with overfitting)?
November 2nd, 2011 at 10:11 am
dear can any body help me to give the how to construct the corpus of source code to be read from lsi
November 16th, 2011 at 3:56 am
To Andrew Polar: Any message from you will now directly go to the trash. Save us some time and stop spamming my blog.
June 6th, 2012 at 12:45 am
[...] solution to this is the use of PLSA combined with connected component analysis. Thanks to M. Blondel, I now have an implementation of the PLSA algorithm. To use it, I will have to map the current [...]
July 18th, 2012 at 5:17 pm
Hi,
I have been looking at your pLSA code in python, and debugging your program as some has noticed that there exist some failure in the folding-in process for new sample. I think I have resolved it by noticing that your normalize function have a strange behavior when you ask for normalizing an array of shape (1,Z) in the query. I noticed it set all values almost to zero except one (Thats why the maximization step sometimes goes below zero and sometimes have only one iteration to converge-which is strange), which is strange for a random initializer. Probably you may like to fix this error some day as so many guys ( including me ) are frequently reading your post.