Support Vector Machines in Python

Support Vector Machines (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let’s play a little bit with SVM in Python! To make it easier to read, we will use the same notation as in Christopher Bishop’s book “Pattern Recognition and Machine Learning”.

Maximum margin

In SVM, the class of an input vector \mathbf{x} can be decided by evaluating the sign of y(\mathbf{x}).

(7.1)~y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})+b

If y(\mathbf{x}) > 0 we assign \mathbf{x} to class +1 and if y(\mathbf{x}) < 0, we assign it to class -1. Here \phi(\mathbf{x}) is a feature-space transformation, which can map \mathbf{x} to a space of higher, possibly infinite, dimensions.

Given a data set comprising N input vectors \mathbf{x}_1,\dots,\mathbf{x}_n and their corresponding labels t_1,\dots,t_n, where t_n \in \{-1,+1\}, we would like to find \mathbf{w} and b such that it explains the training data: y(\mathbf{x}_n) \ge 1 when t_n=+1 and y(\mathbf{x}_n) \le 1 when t_n=-1. This can be rewritten in a single constraint:

(7.5)~t_n (\mathbf{w}^T\phi(\mathbf{x})+b) \ge 1,~n=1,\dots,N

In addition, \mathbf{w} and b are chosen so that the distance between the decision boundary \mathbf{w}^T\phi(\mathbf{x})+b=0 (a line in the 2-d case, a plane in the 3-d case, a hyperplane in the n-d case) and the closest points to it is maximized. This distance is called the margin, hence the name maximum margin classifier. Geometrically, the margin is found to be 2/\|\mathbf{w}\|^2 and so the maximum margin problem can be equivalently expressed as the minimization problem:

(7.6)~arg min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^2

subject to constraint (7.5).

Figure 1: The linearly separable case. The decision line is the plain line and the margin is the gap between the two dotted lines. Only 3 support vectors (green dots) out of 180 training examples are necessary.

Dual representation

Since this is a constrained optimization problem, we can introduce Lagrange multipliers a_n \ge 0 (one per training example), differentiate the Lagrangian function with respect to \mathbf{w} and b and inject the solution back in the Lagrangian function (equations (7.7) to (7.9) in Bishop’s book), so that it depends only on the Lagrange multipliers. Doing so, we find that the maximum margin equivalently emerges from the maximization of

(7.10)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)

subject to the constraints

(7.11)~a_n \ge 0,~n=1,\dots,N


(7.12)~\sum_{n=1}^Na_nt_n = 0

This is the so-called dual representation and is a quadratic programming (QP) problem. k(\mathbf{x}_n,\mathbf{x}_m)= \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m) is called the kernel function.

Similarly to the objective function, y(\mathbf{x}) can also be re-expressed solely in terms of the Lagrange multipliers.

(7.13)~y(\mathbf{x})= \sum_{n=1}^N a_n t_n k(\mathbf{x},\mathbf{x}_n)

The important thing to notice here is that we’ve gone from a sum over M dimensions (the dot product in equation (7.1)) to a sum over N points. This may seem like a disadvantage as the number of training examples N is usually bigger than the number of dimensions M. However, this is very useful and is called the kernel trick: this allows to use SVM, originally a linear classifier, to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, but without explicitly computing \phi(\mathbf{x}).

In many situations, only a small proportion of the Lagrange multipliers a_n will be non-zero, therefore we only need to store the corresponding training examples \mathbf{x}_n. These are called the support vectors and this is why SVMs are sometimes called sparse kernel machines.

That being said, in the linear case, i.e. when \phi(\mathbf{x})=\mathbf{x}, it is faster to directly compute y(\mathbf{x}) from equation (7.1). \mathbf{w} and b can be computed in terms of the Lagrange multipliers by equations (7.8) and (7.18) in Bishop’s book.

Figure 2: The non-linearly separable case. Example of a gaussian kernel with parameter sigma=5.0. Perfect prediction is achieved on the held-out 20 data points.

QP solver

We want to find the Lagrange multipliers a_n maximizing equation (7.10) subject to the constraints (7.11) and (7.11). This can be done by a standard QP solver such as cvxopt.

Minimize

\frac{1}{2} \mathbf{x}^T P\mathbf{x} + \mathbf{q}^T \mathbf{x}

subject to

G\mathbf{x} \leq \mathbf h (inequality constraint)


A\mathbf{x} = \mathbf b (equality constraint)

The unknow is \mathbf{x}, which in our case correspond to the Lagrange multipliers \mathbf{a}=a_1,\dots,a_n. We just need to rework the formulation a little bit to use matrix notation and be a minimization (hence the -1 multiplicative factors).

# Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
    for j in range(n_samples):
        K[i,j] = self.kernel(X[i], X[j])
 
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
 
# Solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
 
# Lagrange multipliers
a = np.ravel(solution['x'])

Note here that P is a N \times N matrix. Thus, a standard QP solver can’t be used for a large number of training examples, as P needs to be stored in memory. There exists a number of algorithms in order to decompose the original QP problem into smaller QP problems that target only a few training samples at a time. One such algorithm is Sequential Minimal Optimization (SMO). One advantage of SMO is that the smaller QP problems can be solved analytically and so SMO doesn’t even need a QP solver.

Soft margin

The problem with the formulation we have used thus far is that it doesn’t allow for misclassification of the training examples. This can lead to poor generalization if there is overlap between the distributions of the two classes. To solve this problem, we can rework constraint (7.5) as

(7.20)~t_n y(\mathbf{x}_n) \ge 1 - \xi_n,~n=1,\dots,N


\xi_n are called slack variables and are introduced to allow the misclassification of some examples. If \xi_n=0, the corresponding training example is correctly classified. If 0 < \xi_n \le 1, the training example lies inside the margin but is still on the correct side of the decision boundary. If \xi_n > 1, the training example is misclassified. Equation (7.6) then becomes

(7.21)~C \sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|^2

C > 0 is the parameter which controls the trade-off between the slack variable penality and the margin. Again, we can introduce Lagrange multipliers, derive the Lagrangian function with respect to \mathbf{w}, b and \xi_n, and inject the solutions back in the Lagragian function (equations (7.22) to (7.31) in Bishop’s book).

(7.32)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)

Which is identical to the hard margin case! The constraints become:

(7.33)~0 \le a_n \le C


(7.34)~\sum_{n=1}^N a_n t_n = 0

Interestingly, the slack variables \xi_n have vanished and the only difference with the hard margin is that the inequality constraint now has an upper bound, C.

The attentive reader will have noticed that the inequality constraints in cvxopt have an upper bound but no lower bound.

G\mathbf{x} \leq \mathbf h

The trick is to rewrite constraint (7.33) as a system of inequations, in matrix notation. Example with 2 training examples:

\begin{pmatrix}-1 & 0 \\ 0 & -1 \\ 1 & 0  \\ 0 & 1\end{pmatrix}\begin{pmatrix}a_1\\ a_2\end{pmatrix} \le \begin{pmatrix}0 \\ 0 \\ C \\ C\end{pmatrix}

Figure 3: The hard margin case. 180 vectors out of 180 are support vectors! And the classifier only achieves 11 correct predictions out of 20, on held-out data.

Figure 4: The soft margin case (C=0.1). 36 vectors out of 180 are support vectors. The classifier achieves 19 correct predictions out of 20!

Source

http://gist.github.com/586753

References

Pattern Recognition and Machine Learning, Christopher Bishop, 2006.

ソフトマージンSVM, 人工知能に関する断想録

2 Responses to “Support Vector Machines in Python”

  1. Michele Mattioni Says:

    I see the latex markup but not the results on the text of the post.
    Does it happen to you?

  2. Mathieu Says:

    The latex rendering is provided by wordpress.com so if you can’t see the equations, it means that this website is unavailable.