Transparent system-wide proxy

June 27th, 2011

Proxies can be a powerful way to enforce anonymity or to bypass various kinds of restrictions on Internet (government censorship, regional contents, …). In this post, I’ll describe a simple technique to create a transparent proxy at the system level. It’s especially useful in cases when you want to make sure that all connections make it through the proxy or when your application of interest doesn’t have proxy support. I don’t think this technique is that well-known, hence this post.
Read the rest of this entry »

Regularized Least Squares

February 9th, 2011

Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.

Ridge regression

Given a centered input \mathbf{x}^* \in \mathbf{R}^D, a linear regression model predicts the output y^* \in \mathbf{R} by

y^* = f(\mathbf{x}^*) = \mathbf{x}^* \cdot \mathbf{w}

where \mathbf{w} \in \mathbf{R}^D is a weight vector.

Given a set of training instances X = (\mathbf{x}_1,\dots,\mathbf{x}_N)^T and its associated set of outputs \mathbf{y} = (y_1,\dots,y_N)^T, the method of least-squares finds the weight vector \mathbf{w} which minimizes the sum of square differences between the predictions and the actual labels.

L_{LS}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2

However, minizing the sum of square errors on the training set doesn’t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes

L_{Ridge}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2

where \lambda is a parameter which controls the regularization strength. Minimizing L_{Ridge} is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some bias (the regularization), we are purposely limiting the freedom of the model and thus reducing the variance of the weight vector. Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.

Calculating the gradient of L_{Ridge} with respect to \mathbf{w} and solving for \mathbf{w}, we find that \mathbf{w} has a closed-form solution

\mathbf{w} = H^{-1} X^T \mathbf{y}

where H = (X^T X + \lambda I). Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations H\mathbf{w}= X^T \mathbf{y}. Since H is symmetric positive-definite, one typically first finds the Cholesky decomposition H=LL^T. Then since L is lower triangular, solving the system is simply a matter of applying forward and backward substitution. Another commonly used method is the conjugate gradient.

Note that not only does regularization improve accuracy, it is often necessary because it makes H invertible when X^T X is not. See Tikhonov regularization.

Kernel ridge

As per the representer theorem, just like we did for kernel perceptron, we can replace \mathbf{w} with X^T\boldsymbol{\alpha}. Substituting into L_{Ridge} then deriving with respect to \boldsymbol{\alpha} and solving for \boldsymbol{\alpha}, we find

\boldsymbol{\alpha} = G^{-1} \mathbf{y}

where G =(K + \lambda I) and K is the kernel matrix. The prediction function becomes

y^* = f(\mathbf{x}^*) = \sum_{i=1}^N \alpha_i k(\mathbf{x}^*,\mathbf{x}_i)

where k is the kernel function. However, in contrast to SVM, \boldsymbol{\alpha} is not sparse. This means that the whole training set needs to be used at prediction time.

Multiple output

So far we have considered the case when there is only one output y per instance. We can easily extend to the M output case. To do that, we simply need to solve HW= X^T Y, where W is D \times M and Y is N \times M. Note that the Cholesky decomposition needs to be computed only once. Since it takes O(D^3) time, it is much more efficient to solve HW=X^T Y directly than to solve H\mathbf{w}=X^T \mathbf{y}, M times.

Classification

M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix Y by setting Y_{ij} to 1 if the i^{th} instance is associated with the j^{th} label and to -1 otherwise. Then prediction is just a matter of taking the most confident label for each instance, just like one-vs-all SVM.

Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.

Efficient leave-one-out cross-validation

Estimating \lambda is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn N models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.

The first trick is to take the eigendecomposition

K=Q \Lambda Q^T.

Using basic linear algebra, we obtain

G^{-1}=Q(\Lambda + \lambda I)^{-1} Q^T.

Since (\Lambda + \lambda I) is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition, G can thus be inverted inexpensively for many different \lambda.

Let f_i(\mathbf{x}_i) be the prediction value when the model was fitted to all training instances but \mathbf{x}_i. The second trick, sometimes called generalized cross-validation, will allow us to calculate the f_i(\mathbf{x}_i) almost for free.

First, let \mathbf{y}^i be the label vector of the model when \mathbf{x}_i is held out.

\mathbf{y}^i = (y_1,\dots,f_i(\mathbf{x}_i),\dots,y_N)^T

When we replace \mathbf{y} by \mathbf{y}^i in L_{Ridge}, clearly, \mathbf{x_i} doesn’t affect the solution (as desired) and we have

\boldsymbol{\alpha}^i = G^{-1} \mathbf{y}^i


f_i(\mathbf{x}_i) = (K \boldsymbol{\alpha}^i)_i.

However, this is a circular definition, since \mathbf{y}^i itself depends on f_i(\mathbf{x}_i). We can calculate the difference with f(\mathbf{x}_i), which will allow us to factorize f_i(\mathbf{x}_i).

f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1} \mathbf{y}^i)_i - (KG^{-1}\mathbf{y})_i

By definition, the i^{th} element of \mathbf{y}^i and \mathbf{y} are f_i(\mathbf{x}_i) and y_i, respectively. Therefore,

f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1})_{ii} (f_i(\mathbf{x}_i) - y_i)

and finally, by simplifying, we obtain

f_i(\mathbf{x}_i) = \frac{f(\mathbf{x}_i) -(K G^{-1})_{ii} y_i}{1 - (K G^{-1})_{ii}}.

Similar computational tricks exist for Gaussian Process Regression as well.

References

Regularized Least Squares, Ryan Rifkin

Kernel Perceptron in Python

October 31st, 2010

The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It’s also trivial to kernelize, which makes it an ideal candidate to gain insights on kernel methods.

Perceptron

The Perceptron predicts the class of an input \mathbf{x} \in \mathcal{X} with the function

f(\mathbf{x}) = sign(\mathbf{w}\cdot\phi(\mathbf{x}))

where sign(y) = +1 if y > 0 and -1 if y < 0, \phi is a feature-space transformation and \mathbf{w} is a feature weight vector. If \mathbf{x} is already a feature vector and a projection to another space is not needed, then \phi(\mathbf{x})=\mathbf{x}.

Given a data set comprising n training examples \mathbf{x}_1,\dots,\mathbf{x}_n and their corresponding labels y_1,\dots,y_n, where y_n \in \{-1,+1\}, the Perceptron makes a prediction for each (\mathbf{x}_i, y_i) using the current estimate of \mathbf{w}. When the prediction is correct (equal to the label y_i), the algorithm jumps to the next example. When the prediction is incorrect, in order to correct for the mistake, if y_i = +1 then \mathbf{w} is incremented by \phi(\mathbf{x}_i), otherwise it is decremented by \phi(\mathbf{x}_i). Since y_i \in \{-1,+1\}, the update rule is thus:

\mathbf{w} = \mathbf{w} + y_i \phi(\mathbf{x}_i)

This procedure can be repeated by doing T passes over the data set. Ideally T must be optimized. This can be done efficiently by measuring the performance after each pass on a validation set.


Figure 1: The decision boundary (hyperplane), to which \mathbf{w} is normal, partitions the vector space into two sets, one for each class. An update changes the direction of the decision boundary to correct for the incorrect prediction. (From this document)


Figure 2: Decision boundary in the linear case.

Kernel Perceptron

From the update rule above, we clearly see that, if \mathbf{w} is initialized to the zero vector, it is a linear combination of the training examples.

\mathbf{w} = \sum_i \alpha_i y_i \phi(\mathbf{x}_i)

Injecting \mathbf{w} into the prediction function f(\mathbf{x}), we get

f(\mathbf{x}) = sign(\sum_i \alpha_i y_i \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x})) = sign(\sum_i \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}))

where K(\mathbf{x}_i,\mathbf{x}) = \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x}) is a Mercer kernel.

The update rule for when a mistake is made predicting \mathbf{x}_i now simply becomes

\alpha_i = \alpha_i + 1

i.e., \alpha_i is the number of times a mistake has been made with respect to \mathbf{x}_i.

A few remarks:

  • Instead of learning a weight vector \mathbf{w} with respect to features, we’re now learning a weight vector \boldsymbol{\alpha} with respect to examples. The feature and kernel representations are duals (they are interchangeable).
  • To predict the class of an input \mathbf{x}, we need to store the training examples \mathbf{x}_i. Kernel methods are memory-based methods, like k-NN. However, we only need to store examples for which a mistake has been made, i.e. {\alpha}_i \neq 0 . In the context of Support Vector Machines (SVMs), these are called support vectors. SVMs, however, not only store examples for which a mistake has been made, they also store examples that lie inside the margin, i.e. y_i (\mathbf{w} \cdot \phi(\mathbf{x}_i)) \le 1. (See Figure 4 from my post on SVMs)
  • In the online learning setting, the number of support vectors can grow and grow as more mistakes are made. The Forgetron is an extension of the Kernel Perceptron which can learn with a “memory budget”. When the budget is exceeded, some support vectors are “forgotten”.
  • For some kinds of objects like sequences, trees and graphs, it might be difficult to map objects to feature vectors while it is easy to come up with a similarity measure K(\mathbf{x}_i,\mathbf{x}_j) between any two objects \mathbf{x}_i and \mathbf{x}_j. In this case, kernel methods are very useful, since they can be used “as is”.

For a brief and intuitive introduction to kernel classifiers, I highly recommend these slides, by Mark Johnson.


Figure 3: Decision boundary when using a gaussian kernel. Green dots indicate support vectors.

The voted and average Perceptrons are two straightforward extensions to the Perceptron, which for some applications are competitive with SVMs. For details, see “Large Margin Classification Using the Perceptron Algorithm” by Freund and Schapire.

Source

http://gist.github.com/656147

Support Vector Machines in Python

September 19th, 2010

Support Vector Machines (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let’s play a little bit with SVM in Python! To make it easier to read, we will use the same notation as in Christopher Bishop’s book “Pattern Recognition and Machine Learning”.

Maximum margin

In SVM, the class of an input vector \mathbf{x} can be decided by evaluating the sign of y(\mathbf{x}).

(7.1)~y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})+b

If y(\mathbf{x}) > 0 we assign \mathbf{x} to class +1 and if y(\mathbf{x}) < 0, we assign it to class -1. Here \phi(\mathbf{x}) is a feature-space transformation, which can map \mathbf{x} to a space of higher, possibly infinite, dimensions.

Given a data set comprising N input vectors \mathbf{x}_1,\dots,\mathbf{x}_n and their corresponding labels t_1,\dots,t_n, where t_n \in \{-1,+1\}, we would like to find \mathbf{w} and b such that it explains the training data: y(\mathbf{x}_n) \ge 1 when t_n=+1 and y(\mathbf{x}_n) \le 1 when t_n=-1. This can be rewritten in a single constraint:

(7.5)~t_n (\mathbf{w}^T\phi(\mathbf{x})+b) \ge 1,~n=1,\dots,N

In addition, \mathbf{w} and b are chosen so that the distance between the decision boundary \mathbf{w}^T\phi(\mathbf{x})+b=0 (a line in the 2-d case, a plane in the 3-d case, a hyperplane in the n-d case) and the closest points to it is maximized. This distance is called the margin, hence the name maximum margin classifier. Geometrically, the margin is found to be 2/\|\mathbf{w}\|^2 and so the maximum margin problem can be equivalently expressed as the minimization problem:

(7.6)~arg min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^2

subject to constraint (7.5).

Figure 1: The linearly separable case. The decision line is the plain line and the margin is the gap between the two dotted lines. Only 3 support vectors (green dots) out of 180 training examples are necessary.

Dual representation

Since this is a constrained optimization problem, we can introduce Lagrange multipliers a_n \ge 0 (one per training example), differentiate the Lagrangian function with respect to \mathbf{w} and b and inject the solution back in the Lagrangian function (equations (7.7) to (7.9) in Bishop’s book), so that it depends only on the Lagrange multipliers. Doing so, we find that the maximum margin equivalently emerges from the maximization of

(7.10)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)

subject to the constraints

(7.11)~a_n \ge 0,~n=1,\dots,N


(7.12)~\sum_{n=1}^Na_nt_n = 0

This is the so-called dual representation and is a quadratic programming (QP) problem. k(\mathbf{x}_n,\mathbf{x}_m)= \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m) is called the kernel function.

Similarly to the objective function, y(\mathbf{x}) can also be re-expressed solely in terms of the Lagrange multipliers.

(7.13)~y(\mathbf{x})= \sum_{n=1}^N a_n t_n k(\mathbf{x},\mathbf{x}_n)

The important thing to notice here is that we’ve gone from a sum over M dimensions (the dot product in equation (7.1)) to a sum over N points. This may seem like a disadvantage as the number of training examples N is usually bigger than the number of dimensions M. However, this is very useful and is called the kernel trick: this allows to use SVM, originally a linear classifier, to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, but without explicitly computing \phi(\mathbf{x}).

In many situations, only a small proportion of the Lagrange multipliers a_n will be non-zero, therefore we only need to store the corresponding training examples \mathbf{x}_n. These are called the support vectors and this is why SVMs are sometimes called sparse kernel machines.

That being said, in the linear case, i.e. when \phi(\mathbf{x})=\mathbf{x}, it is faster to directly compute y(\mathbf{x}) from equation (7.1). \mathbf{w} and b can be computed in terms of the Lagrange multipliers by equations (7.8) and (7.18) in Bishop’s book.

Figure 2: The non-linearly separable case. Example of a gaussian kernel with parameter sigma=5.0. Perfect prediction is achieved on the held-out 20 data points.

QP solver

We want to find the Lagrange multipliers a_n maximizing equation (7.10) subject to the constraints (7.11) and (7.11). This can be done by a standard QP solver such as cvxopt.

Minimize

\frac{1}{2} \mathbf{x}^T P\mathbf{x} + \mathbf{q}^T \mathbf{x}

subject to

G\mathbf{x} \leq \mathbf h (inequality constraint)


A\mathbf{x} = \mathbf b (equality constraint)

The unknow is \mathbf{x}, which in our case correspond to the Lagrange multipliers \mathbf{a}=a_1,\dots,a_n. We just need to rework the formulation a little bit to use matrix notation and be a minimization (hence the -1 multiplicative factors).

# Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
    for j in range(n_samples):
        K[i,j] = self.kernel(X[i], X[j])
 
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
 
# Solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
 
# Lagrange multipliers
a = np.ravel(solution['x'])

Note here that P is a N \times N matrix. Thus, a standard QP solver can’t be used for a large number of training examples, as P needs to be stored in memory. There exists a number of algorithms in order to decompose the original QP problem into smaller QP problems that target only a few training samples at a time. One such algorithm is Sequential Minimal Optimization (SMO). One advantage of SMO is that the smaller QP problems can be solved analytically and so SMO doesn’t even need a QP solver.

Soft margin

The problem with the formulation we have used thus far is that it doesn’t allow for misclassification of the training examples. This can lead to poor generalization if there is overlap between the distributions of the two classes. To solve this problem, we can rework constraint (7.5) as

(7.20)~t_n y(\mathbf{x}_n) \ge 1 - \xi_n,~n=1,\dots,N


\xi_n are called slack variables and are introduced to allow the misclassification of some examples. If \xi_n=0, the corresponding training example is correctly classified. If 0 < \xi_n \le 1, the training example lies inside the margin but is still on the correct side of the decision boundary. If \xi_n > 1, the training example is misclassified. Equation (7.6) then becomes

(7.21)~C \sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|^2

C > 0 is the parameter which controls the trade-off between the slack variable penality and the margin. Again, we can introduce Lagrange multipliers, derive the Lagrangian function with respect to \mathbf{w}, b and \xi_n, and inject the solutions back in the Lagragian function (equations (7.22) to (7.31) in Bishop’s book).

(7.32)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)

Which is identical to the hard margin case! The constraints become:

(7.33)~0 \le a_n \le C


(7.34)~\sum_{n=1}^N a_n t_n = 0

Interestingly, the slack variables \xi_n have vanished and the only difference with the hard margin is that the inequality constraint now has an upper bound, C.

The attentive reader will have noticed that the inequality constraints in cvxopt have an upper bound but no lower bound.

G\mathbf{x} \leq \mathbf h

The trick is to rewrite constraint (7.33) as a system of inequations, in matrix notation. Example with 2 training examples:

\begin{pmatrix}-1 & 0 \\ 0 & -1 \\ 1 & 0  \\ 0 & 1\end{pmatrix}\begin{pmatrix}a_1\\ a_2\end{pmatrix} \le \begin{pmatrix}0 \\ 0 \\ C \\ C\end{pmatrix}

Figure 3: The hard margin case. 180 vectors out of 180 are support vectors! And the classifier only achieves 11 correct predictions out of 20, on held-out data.

Figure 4: The soft margin case (C=0.1). 36 vectors out of 180 are support vectors. The classifier achieves 19 correct predictions out of 20!

Source

http://gist.github.com/586753

References

Pattern Recognition and Machine Learning, Christopher Bishop, 2006.

ソフトマージンSVM, 人工知能に関する断想録

Latent Dirichlet Allocation in Python

August 21st, 2010

Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) – see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. LDA can be seen as a Bayesian extension of pLSA.

As Blei, the author of LDA, points out, the topic proportions in pLSA are tied with the training documents. This is problematic: 1) the number of parameters grows linearly with the number of training documents, which can cause serious overfitting 2) it is difficult to generalize to new documents and requires so-called “folding-in”. LDA fixes those issues by being a fully generative model: where pLSA uses a matrix of P(topic|document) probabilities, LDA uses a distribution over topics.

To date, there exists several parameter estimation schemes for LDA: variational Bayes, expectation propagation and Gibbs sampling. I’ve chosen to implement the latter. It has first been described in a paper entitled “Finding scientific topics”, by Griffiths and Steyvers.

Artificial data

As with all model-based algorithms, during the early development phase, it is useful to work with artificial data, generated by following the model assumptions. In the case of LDA (and pLSA), the core assumption is that words (w) in documents are generated by mixture of topics (z). In other words, the probability of a word is:

P(w) = \sum_{z} P(w|z) P(z)

The generative process can be summarized as follows: 1) set the topic proportions once for all when the collection is instantiated and 2) for each document and for as many words as needed, sample a topic from the topic distribution and sample a word from the word distribution of the selected topic. Obviously, this is only an approximation of how documents are created in reality.

To generate an artificial dataset, we can fix the word distribution of each topic and then generate documents as explained above. Since we generated documents by sticking to the generative assumption of the model, if the algorithm is correctly implemented, it should be able to recover the word distribution of each topic, from the generated documents.

Graphical example

To gain insight and intuition, we can reuse the graphical example from Griffiths and Steyvers’ paper.

In the bag-of-words model, documents are represented by vectors of dimension V, where V is the vocabulary size. Moreover, an image of size \sqrt{V} \times \sqrt{V} has V pixels: it can thus be stored as a string/vector of length/size V. This means that a document in the bag-of-words model can be represented as an image, where pixels correspond to words and pixel intensities correspond to word counts!

As put previously, we first need to fix the word distribution of each topic. Let’s arbitrarily create 10 topics.

5 with “vertical” bars:

and another 5 with “horizontal” bars:

Each topic distribution is represented by a 5×5 image, so the vocabulary is of size 25. Black pixels correspond to words that the topic will never possibly generate. White pixels correspond to words that the topic can generate with probability 1/5.

Now let’s generate 500 documents using the generative process previously described. Here are 3 examples of such generated documents.

We clearly see bars emerging from the documents and can thus confirm that documents are mixtures of topics.

We can now use the generated documents as training data. If the Gibbs sampler is correctly implemented, we should be able to recover the original topics. Here are the results for the 1st, 6th and 26th iterations. The number between brackets is the log-likelihood.

1st iteration (-278541.7835):

5th iteration (-165139.56193):

[...]

26th iteration (-129272.328181):

After a few iterations, we see that the algorithm recovered the topics correctly. Also, the log-likelihood increases: as the number of iterations increases, it becomes more and more likely that the model generated the data. The fact that it works pretty well is not surprising: the data used were generated by sticking to the model assumptions.

Gibbs sampling

The Gibbs sampler used is said to be collapsed: the parameters of interest are not sampled directly. Instead we sample the topic assignments and the parameters can be computed in terms of those.

It is not necessarily obvious from the equation of the full conditional distribution (from which the topic assignments are sampled) but the sampler is naturally sparse: it doesn’t need to iterate over words with zero-count. This is a nice property, given that sampling algorithms are often considered slow.

Source code

http://gist.github.com/542786

Fairly readable and compact code but to be considered a toy implementation.

Useful Resources

MCMC

- “MCMC lecture at MLSS09” (Iain Murray). Nice for a first general overview and the insights.

- “Gibbs sampling for the uninitiated” (Resnik and Hardisty). Nice for a first general overview and the insights.

- “Pattern Recognition and Machine Learning” (Bishop), Chapters 8 and 11 on graphical models and sampling methods. Excellent chapters.

- “Review Course: Markov Chains and Monte Carlo Methods” (Cosma and Evers). Very nice free online course and solutions to exercises in Python and R!

LDA

- “Latent Dirichlet Allocation” (Blei et al, 2003). By Blei himself.

- “Finding scientific topics” (Griffiths and Steyvers). Insightful comments and nice intuitive graphical example.

- “Parameter Estimation for text analysis” (Heinrich). Very nice introduction to Bayesian thinking. Pseudo-code for the LDA Gibbs sampler.

- “On an equivalence between PLSI and LDA” (Girolami and Kaban). Connections between pLSA and LDA.

- “Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling” (Carpenter). Very detailed, step-by-step derivation of the collapsed Gibbs samplers for LDA and NB.

- “Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details” (Wang). Insightful comments and pseudo-code of the LDA Gibbs sampler.

Other Python implementations

- nrolland’s pyLDA. Works fine but mixes Python-style and Numpy-style.

- alextp’s pylda. Numpy-style but not tested.

Semi-supervised Naive Bayes in Python

June 21st, 2010

Expectation-Maximization

The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.

Naive Bayes trained with EM

In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:

  • Train a classifier with your labeled data
  • While the model likelihood increases:
    • E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
    • M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.

The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.

Code

I made a simple implementation of it in Python + Numpy. The code is fairly optimized.

$ git clone http://www.mblondel.org/code/seminb.git

web interface

Implementation details

Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.

Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:

P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}

However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:

\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)

To turn around P(x|c), we use Bayes’rule:

P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}

By posing z_j = \log P(c_j) + \log P(x_i|c_j), we get:

P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}

This is the softmax function. However, we are back to our initial problem because, since the z_j are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant e^{-m}:

P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}

Setting m to max_j~z_j, the z_j-m values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.

This trick will improve the situation quite a lot but in case this is not enough:

P(y_i=c_j|x_i) = \begin{cases} 0, & \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m > t\}} e^{z_k-m}},  & \mbox{otherwise} \end{cases}

This sets the exponentials to zero when e^{z_j-m} \le e^t. For t=-10, this is 0.000045. Equivalently this corresponds to setting the exponentials to zero when e^{z_j} \le e^{t+m}. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.

Reference

Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

LSA and pLSA in Python

June 13th, 2010

Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.

Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing.

Comparison

  LSA pLSA
1. Theoretical background Linear Algebra Probabilities and Statistics
2. Objective function Frobenius norm Likelihood function
3. Polysemy No Yes
4. Folding-in Straightforward Complicated

1. LSA stems from Linear Algebra as it is nothing more than a Singular Value Decomposition. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).

2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the Frobenius norm of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the least squares solution corresponds to the Maximum Likelihood solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling.

The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.

Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).

3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.

4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called “folding-in”. Clearly, new documents folded-in don’t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time.

In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm.

Implementation in Python

LSA is straightforward to implement as it is nothing more than a SVD and Numpy’s Linear Algebra module has a function “svd” already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn’t mean that the SVD is not full, just that the returned matrices don’t contain vectors corresponding to zero singular values. Scipy’s Linear Algebra package unfortunately doesn’t seem to have a sparse SVD. Likewise, there’s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).

pLSA’s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.

The implementation is sparse. It works with both Numpy’s ndarrays and Scipy’s sparse matrices.

$ git clone http://www.mblondel.org/code/plsa.git

web interface

Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it’s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.

The Little Machine Learner

February 18th, 2010

The idea

I’ve been having this idea on my mind for quite some time: wouldn’t it be nice to write a book about Machine Learning where each chapter is a literate program?

From Wikipedia:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

From the PyLit homepage:

The idea is that you do not document programs (after the fact), but write documents that contain the programs.

There are plenty of great textbooks about Machine Learning out there, so the point would not be to write yet another one, but write something different. Here’s what I had been thinking.

  • Each chapter written as a literate program, organized so as to maximize understanding
  • Code in Python (+Numpy + Scipy but without any additional dependencies)
  • Readability over Performance
  • Intuitions, nice figures, useful tips or tricks
  • Real-world applications at the end of each chapter
  • Don’t shy away from the maths, especially if at high-school or undergraduate level…

I bet that quite a few algorithms can be written this way, yet remain very concise!

Except for the maths part, the closest book to this idea that I know of is probably “Programming Collective Intelligence: Building Smart Web 2.0 Applications”, by Toby Segaran.

An example with logistic regression

So, in order to experiment with what such a book could look like, I’ve decided to write a chapter about Logistic Regression. Topics I cover include Maximum Likelihood Estimation, Regularization and Cross-validation. At the end, I use heart disease prediction as an example of real-world application. Probably many things could be improved or added but the point for now is mainly to show what it could look like.

Tools

For the documentation tool, I’ve decided to go for Sphinx, which seems to be emerging as the de-facto documentation tool in the Python community. It has nice features like syntax highlighting, latex support and matplotlib plots support and can output to HTML and PDF.

Normally, in literate programming, there’s the literate source, which uses some kind of markup-language and tools are used to generate either code or documentation from it. I took a different approach. In my case, the source file is the code and the documentation is extracted from the comments in the code. Technically, it’s therefore closer to extensively documented code than actual literate programming. It has some limitations but the main advantages are that the program is runnable directly (since Python is interpreted) and the programmer can benefit from syntax highlighting. I wrote a simple program that converts Python source code to reStructuredText, as necessary for integration in Sphinx.

Interested?

It took quite some time to collect the information and do the actual writing but I feel like I improved my own understanding in the process, so I’m thinking of writing a chapter from time to time. If I do so, at the end of my PhD, I may have gathered enough material to make it a real book! The book could affectionately be entitled “The Little Machine Learner”, hence the title of this post.

Since Machine Learning is a very large field and to write a better book than I could possibly write alone, I’m also thinking that it could actually be a collaborative effort (by researchers, students and practitioners). If you’re interested, please leave a comment. I will create a discussion group if there’s enough interest.

As usual, the source code is available in my git repo:

$ git clone http://www.mblondel.org/code/tlml.git

web interface

Seam Carving in Python

February 9th, 2010

Seam Carving is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper “Seam Carving for Content-Aware Image Resizing“.


Miyako Island, Okinawa, Japan.

The principle is very simple. Find the connected paths of low energy pixels (“the seams”). This can be done efficiently by dynamic programming (see my post on DTW).


Same image in the gradient domain showing the vertical and horizontal seams of lowest cumulated energy.

The seams of lowest cumulated energy can be seen as the pixels contributing the least to an image. By repeatedly removing or adding seams, it is thus possible to perform “content-aware” image reduction or extension. The resulting images feel more natural, less “streched”.


Height reduced by 50% by seam carving.


Height reduced by 50% by traditional rescaling.

Although seam carving doesn’t need human intervention, in the original paper, a graphical user interface (GUI) was also developed to let the user define areas that can’t be removed, or conversely, that must be removed.

In my opinion, seam carving is simple and elegant. No sophisticated object recognition algorithm was used, yet the results are quite impressive.

You can find my implementation in 250 lines of Python in my git repo:

$ git clone http://www.mblondel.org/code/seam-carving.git

web interface

Unfortunately, it’s too slow to be real-time.

Caching computation tasks

January 27th, 2010

When I work on computationally expensive projects (e.g., Machine Learning), I always find myself in the same situation: my programs can be broken down into a chain of tasks, where tasks may depend on the results of other tasks. A typical such chain would be:

preprocessing -> feature-extraction -> training -> evaluation

If I make a modification in my training algorithm and want to re-evaluate it, I do need to re-run the “training” and “evaluation” tasks, but I don’t need and don’t want to re-run the “processing” and “feature-extraction” tasks, especially if they take time to compute.

At first, I tried to save and load task results manually. This quickly proved unmanageable so I started to think of ways to automate this. Since I had quite a precise idea of what I wanted, I’ve decided to write my own tool, at the risk of reinventing the wheel. (I suspect it’s quite hard to come up with a universal tool, though) To keep things simple, I’ve decided to limit the tool’s scope to projects that can be run on a single computer, typically with multi-cores. In particular, it won’t support any kind of distributed computing.
Read the rest of this entry »