Regularized Least Squares
Wednesday, February 9th, 2011Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.
Ridge regression
Given a centered input , a linear regression model predicts the output
by
where is a weight vector.
Given a set of training instances and its associated set of outputs
, the method of least-squares finds the weight vector
which minimizes the sum of square differences between the predictions and the actual labels.
However, minizing the sum of square errors on the training set doesn’t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes
where is a parameter which controls the regularization strength. Minimizing
is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some bias (the regularization), we are purposely limiting the freedom of the model and thus reducing the variance of the weight vector. Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.
Calculating the gradient of with respect to
and solving for
, we find that
has a closed-form solution
where . Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations
. Since
is symmetric positive-definite, one typically first finds the Cholesky decomposition
. Then since
is lower triangular, solving the system is simply a matter of applying forward and backward substitution. Another commonly used method is the conjugate gradient.
Note that not only does regularization improve accuracy, it is often necessary because it makes invertible when
is not. See Tikhonov regularization.
Kernel ridge
As per the representer theorem, just like we did for kernel perceptron, we can replace with
. Substituting into
then deriving with respect to
and solving for
, we find
where and
is the kernel matrix. The prediction function becomes
where is the kernel function. However, in contrast to SVM,
is not sparse. This means that the whole training set needs to be used at prediction time.
Multiple output
So far we have considered the case when there is only one output per instance. We can easily extend to the
output case. To do that, we simply need to solve
, where
is
and
is
. Note that the Cholesky decomposition needs to be computed only once. Since it takes
time, it is much more efficient to solve
directly than to solve
,
times.
Classification
M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix by setting
to
if the
instance is associated with the
label and to
otherwise. Then prediction is just a matter of taking the most confident label for each instance, just like one-vs-all SVM.
Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.
Efficient leave-one-out cross-validation
Estimating is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn
models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.
The first trick is to take the eigendecomposition
Using basic linear algebra, we obtain
Since is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition,
can thus be inverted inexpensively for many different
.
Let be the prediction value when the model was fitted to all training instances but
. The second trick, sometimes called generalized cross-validation, will allow us to calculate the
almost for free.
First, let be the label vector of the model when
is held out.
When we replace by
in
, clearly,
doesn’t affect the solution (as desired) and we have
However, this is a circular definition, since itself depends on
. We can calculate the difference with
, which will allow us to factorize
.
By definition, the element of
and
are
and
, respectively. Therefore,
and finally, by simplifying, we obtain
Similar computational tricks exist for Gaussian Process Regression as well.
References
Regularized Least Squares, Ryan Rifkin

















































