Latent Dirichlet Allocation in Python
Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) – see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. LDA can be seen as a Bayesian extension of pLSA.
As Blei, the author of LDA, points out, the topic proportions in pLSA are tied with the training documents. This is problematic: 1) the number of parameters grows linearly with the number of training documents, which can cause serious overfitting 2) it is difficult to generalize to new documents and requires so-called “folding-in”. LDA fixes those issues by being a fully generative model: where pLSA uses a matrix of P(topic|document) probabilities, LDA uses a distribution over topics.
To date, there exists several parameter estimation schemes for LDA: variational Bayes, expectation propagation and Gibbs sampling. I’ve chosen to implement the latter. It has first been described in a paper entitled “Finding scientific topics”, by Griffiths and Steyvers.
As with all model-based algorithms, during the early development phase, it is useful to work with artificial data, generated by following the model assumptions. In the case of LDA (and pLSA), the core assumption is that words (w) in documents are generated by mixture of topics (z). In other words, the probability of a word is:
The generative process can be summarized as follows: 1) set the topic proportions once for all when the collection is instantiated and 2) for each document and for as many words as needed, sample a topic from the topic distribution and sample a word from the word distribution of the selected topic. Obviously, this is only an approximation of how documents are created in reality.
To generate an artificial dataset, we can fix the word distribution of each topic and then generate documents as explained above. Since we generated documents by sticking to the generative assumption of the model, if the algorithm is correctly implemented, it should be able to recover the word distribution of each topic, from the generated documents.
To gain insight and intuition, we can reuse the graphical example from Griffiths and Steyvers’ paper.
In the bag-of-words model, documents are represented by vectors of dimension , where is the vocabulary size. Moreover, an image of size has pixels: it can thus be stored as a string/vector of length/size . This means that a document in the bag-of-words model can be represented as an image, where pixels correspond to words and pixel intensities correspond to word counts!
As put previously, we first need to fix the word distribution of each topic. Let’s arbitrarily create 10 topics.
5 with “vertical” bars:
and another 5 with “horizontal” bars:
Each topic distribution is represented by a 5×5 image, so the vocabulary is of size 25. Black pixels correspond to words that the topic will never possibly generate. White pixels correspond to words that the topic can generate with probability 1/5.
Now let’s generate 500 documents using the generative process previously described. Here are 3 examples of such generated documents.
We clearly see bars emerging from the documents and can thus confirm that documents are mixtures of topics.
We can now use the generated documents as training data. If the Gibbs sampler is correctly implemented, we should be able to recover the original topics. Here are the results for the 1st, 6th and 26th iterations. The number between brackets is the log-likelihood.
1st iteration (-278541.7835):
5th iteration (-165139.56193):
26th iteration (-129272.328181):
After a few iterations, we see that the algorithm recovered the topics correctly. Also, the log-likelihood increases: as the number of iterations increases, it becomes more and more likely that the model generated the data. The fact that it works pretty well is not surprising: the data used were generated by sticking to the model assumptions.
The Gibbs sampler used is said to be collapsed: the parameters of interest are not sampled directly. Instead we sample the topic assignments and the parameters can be computed in terms of those.
It is not necessarily obvious from the equation of the full conditional distribution (from which the topic assignments are sampled) but the sampler is naturally sparse: it doesn’t need to iterate over words with zero-count. This is a nice property, given that sampling algorithms are often considered slow.
Fairly readable and compact code but to be considered a toy implementation.
- “MCMC lecture at MLSS09” (Iain Murray). Nice for a first general overview and the insights.
- “Gibbs sampling for the uninitiated” (Resnik and Hardisty). Nice for a first general overview and the insights.
- “Pattern Recognition and Machine Learning” (Bishop), Chapters 8 and 11 on graphical models and sampling methods. Excellent chapters.
- “Review Course: Markov Chains and Monte Carlo Methods” (Cosma and Evers). Very nice free online course and solutions to exercises in Python and R!
- “Latent Dirichlet Allocation” (Blei et al, 2003). By Blei himself.
- “Finding scientific topics” (Griffiths and Steyvers). Insightful comments and nice intuitive graphical example.
- “Parameter Estimation for text analysis” (Heinrich). Very nice introduction to Bayesian thinking. Pseudo-code for the LDA Gibbs sampler.
- “On an equivalence between PLSI and LDA” (Girolami and Kaban). Connections between pLSA and LDA.
- “Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling” (Carpenter). Very detailed, step-by-step derivation of the collapsed Gibbs samplers for LDA and NB.
- “Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details” (Wang). Insightful comments and pseudo-code of the LDA Gibbs sampler.
Other Python implementations
- nrolland’s pyLDA. Works fine but mixes Python-style and Numpy-style.
- alextp’s pylda. Numpy-style but not tested.