## Regularized Least Squares

Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.

### Ridge regression

Given a centered input $\mathbf{x}^* \in \mathbf{R}^D$, a linear regression model predicts the output $y^* \in \mathbf{R}$ by

$y^* = f(\mathbf{x}^*) = \mathbf{x}^* \cdot \mathbf{w}$

where $\mathbf{w} \in \mathbf{R}^D$ is a weight vector.

Given a set of training instances $X = (\mathbf{x}_1,\dots,\mathbf{x}_N)^T$ and its associated set of outputs $\mathbf{y} = (y_1,\dots,y_N)^T$, the method of least-squares finds the weight vector $\mathbf{w}$ which minimizes the sum of square differences between the predictions and the actual labels.

$L_{LS}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2$

However, minizing the sum of square errors on the training set doesn’t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes

$L_{Ridge}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2$

where $\lambda$ is a parameter which controls the regularization strength. Minimizing $L_{Ridge}$ is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some bias (the regularization), we are purposely limiting the freedom of the model and thus reducing the variance of the weight vector. Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.

Calculating the gradient of $L_{Ridge}$ with respect to $\mathbf{w}$ and solving for $\mathbf{w}$, we find that $\mathbf{w}$ has a closed-form solution

$\mathbf{w} = H^{-1} X^T \mathbf{y}$

where $H = (X^T X + \lambda I)$. Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations $H\mathbf{w}= X^T \mathbf{y}$. Since $H$ is symmetric positive-definite, one typically first finds the Cholesky decomposition $H=LL^T$. Then since $L$ is lower triangular, solving the system is simply a matter of applying forward and backward substitution. Another commonly used method is the conjugate gradient.

Note that not only does regularization improve accuracy, it is often necessary because it makes $H$ invertible when $X^T X$ is not. See Tikhonov regularization.

### Kernel ridge

As per the representer theorem, just like we did for kernel perceptron, we can replace $\mathbf{w}$ with $X^T\boldsymbol{\alpha}$. Substituting into $L_{Ridge}$ then deriving with respect to $\boldsymbol{\alpha}$ and solving for $\boldsymbol{\alpha}$, we find

$\boldsymbol{\alpha} = G^{-1} \mathbf{y}$

where $G =(K + \lambda I)$ and $K$ is the kernel matrix. The prediction function becomes

$y^* = f(\mathbf{x}^*) = \sum_{i=1}^N \alpha_i k(\mathbf{x}^*,\mathbf{x}_i)$

where $k$ is the kernel function. However, in contrast to SVM, $\boldsymbol{\alpha}$ is not sparse. This means that the whole training set needs to be used at prediction time.

### Multiple output

So far we have considered the case when there is only one output $y$ per instance. We can easily extend to the $M$ output case. To do that, we simply need to solve $HW= X^T Y$, where $W$ is $D \times M$ and $Y$ is $N \times M$. Note that the Cholesky decomposition needs to be computed only once. Since it takes $O(D^3)$ time, it is much more efficient to solve $HW=X^T Y$ directly than to solve $H\mathbf{w}=X^T \mathbf{y}$, $M$ times.

### Classification

M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix $Y$ by setting $Y_{ij}$ to $1$ if the $i^{th}$ instance is associated with the $j^{th}$ label and to $-1$ otherwise. Then prediction is just a matter of taking the most confident label for each instance, just like one-vs-all SVM.

Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.

### Efficient leave-one-out cross-validation

Estimating $\lambda$ is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn $N$ models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.

The first trick is to take the eigendecomposition

$K=Q \Lambda Q^T.$

Using basic linear algebra, we obtain

$G^{-1}=Q(\Lambda + \lambda I)^{-1} Q^T.$

Since $(\Lambda + \lambda I)$ is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition, $G$ can thus be inverted inexpensively for many different $\lambda$.

Let $f_i(\mathbf{x}_i)$ be the prediction value when the model was fitted to all training instances but $\mathbf{x}_i$. The second trick, sometimes called generalized cross-validation, will allow us to calculate the $f_i(\mathbf{x}_i)$ almost for free.

First, let $\mathbf{y}^i$ be the label vector of the model when $\mathbf{x}_i$ is held out.

$\mathbf{y}^i = (y_1,\dots,f_i(\mathbf{x}_i),\dots,y_N)^T$

When we replace $\mathbf{y}$ by $\mathbf{y}^i$ in $L_{Ridge}$, clearly, $\mathbf{x_i}$ doesn’t affect the solution (as desired) and we have

$\boldsymbol{\alpha}^i = G^{-1} \mathbf{y}^i$

$f_i(\mathbf{x}_i) = (K \boldsymbol{\alpha}^i)_i.$

However, this is a circular definition, since $\mathbf{y}^i$ itself depends on $f_i(\mathbf{x}_i)$. We can calculate the difference with $f(\mathbf{x}_i)$, which will allow us to factorize $f_i(\mathbf{x}_i)$.

$f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1} \mathbf{y}^i)_i - (KG^{-1}\mathbf{y})_i$

By definition, the $i^{th}$ element of $\mathbf{y}^i$ and $\mathbf{y}$ are $f_i(\mathbf{x}_i)$ and $y_i$, respectively. Therefore,

$f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1})_{ii} (f_i(\mathbf{x}_i) - y_i)$

and finally, by simplifying, we obtain

$f_i(\mathbf{x}_i) = \frac{f(\mathbf{x}_i) -(K G^{-1})_{ii} y_i}{1 - (K G^{-1})_{ii}}.$

Similar computational tricks exist for Gaussian Process Regression as well.

### References

Regularized Least Squares, Ryan Rifkin

### 6 Responses to “Regularized Least Squares”

1. Gael Varoquaux Says:

Nice post (as usual).

Any chance that you can contribute a very condensed version of it to the scikit’s documentation? I am thinking in particular that the GCV should be mentioned in the documentation.

2. Mathieu Says:

Sure, will do that next week!

3. Fabian Pedregosa Says:

This is awesome, it really motivates me to write something similar about the Least Angle regression implementation in the scikit.

Cheers,

Fabian.

4. Mathieu Says:

That would be nice! I have much to learn about LARS.

5. Stratis Says:

Hi!

Very interesting post. I have one question though, is there any error analysis w.r.t. ridge regression? What happens when for example the parameter vector w gets less and less sparser.

6. Mohsen Says:

Was a great post. I am struggling with kernel ridge for a while. I have a question regarding the regularizer. As far as I have read there two method for regularization of kernel matrix. One is , adding a small value to the diagonal of kernel matrix.and the second one is using truncated eigendecomposition which means just to put out the small eigenvalues .But I kind of stuck in the second case.I actually can’t understand why it’s a case.I know it kind of related to matrix inversion.Do you have any idea about that?