## Regularized Least Squares

Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.

### Ridge regression

Given a centered input , a linear regression model predicts the output by

where is a weight vector.

Given a set of training instances and its associated set of outputs , the method of least-squares finds the weight vector which minimizes the sum of square differences between the predictions and the actual labels.

However, minizing the sum of square errors on the training set doesn’t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes

where is a parameter which controls the regularization strength. Minimizing is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some *bias* (the regularization), we are purposely limiting the freedom of the model and thus reducing the *variance* of the weight vector. Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.

Calculating the gradient of with respect to and solving for , we find that has a closed-form solution

where . Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations . Since is symmetric positive-definite, one typically first finds the Cholesky decomposition . Then since is lower triangular, solving the system is simply a matter of applying forward and backward substitution. Another commonly used method is the conjugate gradient.

Note that not only does regularization improve accuracy, it is often necessary because it makes invertible when is not. See Tikhonov regularization.

### Kernel ridge

As per the representer theorem, just like we did for kernel perceptron, we can replace with . Substituting into then deriving with respect to and solving for , we find

where and is the kernel matrix. The prediction function becomes

where is the kernel function. However, in contrast to SVM, is not sparse. This means that the whole training set needs to be used at prediction time.

### Multiple output

So far we have considered the case when there is only one output per instance. We can easily extend to the output case. To do that, we simply need to solve , where is and is . Note that the Cholesky decomposition needs to be computed only once. Since it takes time, it is much more efficient to solve directly than to solve , times.

### Classification

M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix by setting to if the instance is associated with the label and to otherwise. Then prediction is just a matter of taking the most confident label for each instance, just like one-vs-all SVM.

Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.

### Efficient leave-one-out cross-validation

Estimating is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.

The first trick is to take the eigendecomposition

Using basic linear algebra, we obtain

Since is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition, can thus be inverted inexpensively for many different .

Let be the prediction value when the model was fitted to all training instances but . The second trick, sometimes called generalized cross-validation, will allow us to calculate the almost for free.

First, let be the label vector of the model when is held out.

When we replace by in , clearly, doesn’t affect the solution (as desired) and we have

However, this is a circular definition, since itself depends on . We can calculate the difference with , which will allow us to factorize .

By definition, the element of and are and , respectively. Therefore,

and finally, by simplifying, we obtain

Similar computational tricks exist for Gaussian Process Regression as well.

### References

Regularized Least Squares, Ryan Rifkin

February 12th, 2011 at 7:19 pm

Nice post (as usual).

Any chance that you can contribute a very condensed version of it to the scikit’s documentation? I am thinking in particular that the GCV should be mentioned in the documentation.

February 13th, 2011 at 4:11 am

Sure, will do that next week!

February 13th, 2011 at 9:05 pm

This is awesome, it really motivates me to write something similar about the Least Angle regression implementation in the scikit.

Cheers,

Fabian.

February 14th, 2011 at 1:12 am

That would be nice! I have much to learn about LARS.

October 31st, 2011 at 8:15 pm

Hi!

Very interesting post. I have one question though, is there any error analysis w.r.t. ridge regression? What happens when for example the parameter vector w gets less and less sparser.

October 3rd, 2012 at 7:30 am

Was a great post. I am struggling with kernel ridge for a while. I have a question regarding the regularizer. As far as I have read there two method for regularization of kernel matrix. One is , adding a small value to the diagonal of kernel matrix.and the second one is using truncated eigendecomposition which means just to put out the small eigenvalues .But I kind of stuck in the second case.I actually can’t understand why it’s a case.I know it kind of related to matrix inversion.Do you have any idea about that?