Support Vector Machines in Python
Support Vector Machines (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let’s play a little bit with SVM in Python! To make it easier to read, we will use the same notation as in Christopher Bishop’s book “Pattern Recognition and Machine Learning”.
In SVM, the class of an input vector can be decided by evaluating the sign of .
If we assign to class +1 and if , we assign it to class -1. Here is a feature-space transformation, which can map to a space of higher, possibly infinite, dimensions.
Given a data set comprising input vectors and their corresponding labels , where , we would like to find and such that it explains the training data: when and when . This can be rewritten in a single constraint:
In addition, and are chosen so that the distance between the decision boundary (a line in the 2-d case, a plane in the 3-d case, a hyperplane in the n-d case) and the closest points to it is maximized. This distance is called the margin, hence the name maximum margin classifier. Geometrically, the margin is found to be and so the maximum margin problem can be equivalently expressed as the minimization problem:
subject to constraint (7.5).
Figure 1: The linearly separable case. The decision line is the plain line and the margin is the gap between the two dotted lines. Only 3 support vectors (green dots) out of 180 training examples are necessary.
Since this is a constrained optimization problem, we can introduce Lagrange multipliers (one per training example), differentiate the Lagrangian function with respect to and and inject the solution back in the Lagrangian function (equations (7.7) to (7.9) in Bishop’s book), so that it depends only on the Lagrange multipliers. Doing so, we find that the maximum margin equivalently emerges from the maximization of
subject to the constraints
This is the so-called dual representation and is a quadratic programming (QP) problem. is called the kernel function.
Similarly to the objective function, can also be re-expressed solely in terms of the Lagrange multipliers.
The important thing to notice here is that we’ve gone from a sum over dimensions (the dot product in equation (7.1)) to a sum over points. This may seem like a disadvantage as the number of training examples is usually bigger than the number of dimensions . However, this is very useful and is called the kernel trick: this allows to use SVM, originally a linear classifier, to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, but without explicitly computing .
In many situations, only a small proportion of the Lagrange multipliers will be non-zero, therefore we only need to store the corresponding training examples . These are called the support vectors and this is why SVMs are sometimes called sparse kernel machines.
That being said, in the linear case, i.e. when , it is faster to directly compute from equation (7.1). and can be computed in terms of the Lagrange multipliers by equations (7.8) and (7.18) in Bishop’s book.
Figure 2: The non-linearly separable case. Example of a gaussian kernel with parameter sigma=5.0. Perfect prediction is achieved on the held-out 20 data points.
We want to find the Lagrange multipliers maximizing equation (7.10) subject to the constraints (7.11) and (7.11). This can be done by a standard QP solver such as cvxopt.
subject to(inequality constraint)
The unknow is , which in our case correspond to the Lagrange multipliers . We just need to rework the formulation a little bit to use matrix notation and be a minimization (hence the -1 multiplicative factors).
# Gram matrix K = np.zeros((n_samples, n_samples)) for i in range(n_samples): for j in range(n_samples): K[i,j] = self.kernel(X[i], X[j]) P = cvxopt.matrix(np.outer(y,y) * K) q = cvxopt.matrix(np.ones(n_samples) * -1) A = cvxopt.matrix(y, (1,n_samples)) b = cvxopt.matrix(0.0) G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1)) h = cvxopt.matrix(np.zeros(n_samples)) # Solve QP problem solution = cvxopt.solvers.qp(P, q, G, h, A, b) # Lagrange multipliers a = np.ravel(solution['x'])
Note here that is a matrix. Thus, a standard QP solver can’t be used for a large number of training examples, as P needs to be stored in memory. There exists a number of algorithms in order to decompose the original QP problem into smaller QP problems that target only a few training samples at a time. One such algorithm is Sequential Minimal Optimization (SMO). One advantage of SMO is that the smaller QP problems can be solved analytically and so SMO doesn’t even need a QP solver.
The problem with the formulation we have used thus far is that it doesn’t allow for misclassification of the training examples. This can lead to poor generalization if there is overlap between the distributions of the two classes. To solve this problem, we can rework constraint (7.5) as
are called slack variables and are introduced to allow the misclassification of some examples. If , the corresponding training example is correctly classified. If , the training example lies inside the margin but is still on the correct side of the decision boundary. If , the training example is misclassified. Equation (7.6) then becomes
is the parameter which controls the trade-off between the slack variable penality and the margin. Again, we can introduce Lagrange multipliers, derive the Lagrangian function with respect to , and , and inject the solutions back in the Lagragian function (equations (7.22) to (7.31) in Bishop’s book).
Which is identical to the hard margin case! The constraints become:
Interestingly, the slack variables have vanished and the only difference with the hard margin is that the inequality constraint now has an upper bound, .
The attentive reader will have noticed that the inequality constraints in cvxopt have an upper bound but no lower bound.
The trick is to rewrite constraint (7.33) as a system of inequations, in matrix notation. Example with 2 training examples:
Figure 3: The hard margin case. 180 vectors out of 180 are support vectors! And the classifier only achieves 11 correct predictions out of 20, on held-out data.
Figure 4: The soft margin case (C=0.1). 36 vectors out of 180 are support vectors. The classifier achieves 19 correct predictions out of 20!
Pattern Recognition and Machine Learning, Christopher Bishop, 2006.