Regularized Least Squares
Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.
Given a centered input , a linear regression model predicts the output by
where is a weight vector.
Given a set of training instances and its associated set of outputs , the method of least-squares finds the weight vector which minimizes the sum of square differences between the predictions and the actual labels.
However, minizing the sum of square errors on the training set doesn’t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes
where is a parameter which controls the regularization strength. Minimizing is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some bias (the regularization), we are purposely limiting the freedom of the model and thus reducing the variance of the weight vector. Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.
Calculating the gradient of with respect to and solving for , we find that has a closed-form solution
where . Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations . Since is symmetric positive-definite, one typically first finds the Cholesky decomposition . Then since is lower triangular, solving the system is simply a matter of applying forward and backward substitution. Another commonly used method is the conjugate gradient.
Note that not only does regularization improve accuracy, it is often necessary because it makes invertible when is not. See Tikhonov regularization.
As per the representer theorem, just like we did for kernel perceptron, we can replace with . Substituting into then deriving with respect to and solving for , we find
where and is the kernel matrix. The prediction function becomes
where is the kernel function. However, in contrast to SVM, is not sparse. This means that the whole training set needs to be used at prediction time.
So far we have considered the case when there is only one output per instance. We can easily extend to the output case. To do that, we simply need to solve , where is and is . Note that the Cholesky decomposition needs to be computed only once. Since it takes time, it is much more efficient to solve directly than to solve , times.
M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix by setting to if the instance is associated with the label and to otherwise. Then prediction is just a matter of taking the most confident label for each instance, just like one-vs-all SVM.
Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.
Efficient leave-one-out cross-validation
Estimating is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.
The first trick is to take the eigendecomposition
Using basic linear algebra, we obtain
Since is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition, can thus be inverted inexpensively for many different .
Let be the prediction value when the model was fitted to all training instances but . The second trick, sometimes called generalized cross-validation, will allow us to calculate the almost for free.
First, let be the label vector of the model when is held out.
When we replace by in , clearly, doesn’t affect the solution (as desired) and we have
However, this is a circular definition, since itself depends on . We can calculate the difference with , which will allow us to factorize .
By definition, the element of and are and , respectively. Therefore,
and finally, by simplifying, we obtain
Similar computational tricks exist for Gaussian Process Regression as well.
Regularized Least Squares, Ryan Rifkin