Kernel Perceptron in Python

The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It’s also trivial to kernelize, which makes it an ideal candidate to gain insights on kernel methods.


The Perceptron predicts the class of an input \mathbf{x} \in \mathcal{X} with the function

f(\mathbf{x}) = sign(\mathbf{w}\cdot\phi(\mathbf{x}))

where sign(y) = +1 if y > 0 and -1 if y < 0, \phi is a feature-space transformation and \mathbf{w} is a feature weight vector. If \mathbf{x} is already a feature vector and a projection to another space is not needed, then \phi(\mathbf{x})=\mathbf{x}.

Given a data set comprising n training examples \mathbf{x}_1,\dots,\mathbf{x}_n and their corresponding labels y_1,\dots,y_n, where y_n \in \{-1,+1\}, the Perceptron makes a prediction for each (\mathbf{x}_i, y_i) using the current estimate of \mathbf{w}. When the prediction is correct (equal to the label y_i), the algorithm jumps to the next example. When the prediction is incorrect, in order to correct for the mistake, if y_i = +1 then \mathbf{w} is incremented by \phi(\mathbf{x}_i), otherwise it is decremented by \phi(\mathbf{x}_i). Since y_i \in \{-1,+1\}, the update rule is thus:

\mathbf{w} = \mathbf{w} + y_i \phi(\mathbf{x}_i)

Figure 1: The decision boundary (hyperplane), to which \mathbf{w} is normal, partitions the vector space into two sets, one for each class. An update changes the direction of the decision boundary to correct for the incorrect prediction. (From this document)

Figure 2: Decision boundary in the linear case.

Kernel Perceptron

From the update rule above, we clearly see that, if \mathbf{w} is initialized to the zero vector, it is a linear combination of the training examples.

\mathbf{w} = \sum_i \alpha_i y_i \phi(\mathbf{x}_i)

Injecting \mathbf{w} into the prediction function f(\mathbf{x}), we get

f(\mathbf{x}) = sign(\sum_i \alpha_i y_i \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x})) = sign(\sum_i \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}))

where K(\mathbf{x}_i,\mathbf{x}) = \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x}) is a Mercer kernel.

The update rule for when a mistake is made predicting \mathbf{x}_i now simply becomes

\alpha_i = \alpha_i + 1

i.e., \alpha_i is the number of times a mistake has been made with respect to \mathbf{x}_i.

A few remarks:

  • Instead of learning a weight vector \mathbf{w} with respect to features, we’re now learning a weight vector \boldsymbol{\alpha} with respect to examples.
  • To predict the class of an input \mathbf{x}, we need to store the training examples \mathbf{x}_i. Kernel methods are memory-based methods, like k-NN. However, we only need to store examples for which a mistake has been made, i.e. {\alpha}_i \neq 0 . In the context of Support Vector Machines (SVMs), these are called support vectors. SVMs, however, not only store examples for which a mistake has been made, they also store examples that lie inside the margin, i.e. y_i (\mathbf{w} \cdot \phi(\mathbf{x}_i)) \le 1. (See Figure 4 from my post on SVMs)
  • In the online learning setting, the number of support vectors can grow and grow as more mistakes are made. The Forgetron is an extension of the Kernel Perceptron which can learn with a “memory budget”. When the budget is exceeded, some support vectors are “forgotten”.
  • For some kinds of objects like sequences, trees and graphs, it might be difficult to map objects to feature vectors while it is easy to come up with a similarity measure K(\mathbf{x}_i,\mathbf{x}_j) between any two objects \mathbf{x}_i and \mathbf{x}_j. In this case, kernel methods are very useful, since they can be used “as is”.

For a brief and intuitive introduction to kernel classifiers, I highly recommend these slides, by Mark Johnson.

Figure 3: Decision boundary when using a gaussian kernel. Green dots indicate support vectors.

The voted and average Perceptrons are two straightforward extensions to the Perceptron, which for some applications are competitive with SVMs. For details, see “Large Margin Classification Using the Perceptron Algorithm” by Freund and Schapire.


3 Responses to “Kernel Perceptron in Python”

  1. Daniel Says:

    Hi Mathieu,

    love your Blog =)
    The link to Johnsons slides does not work. There is an empty space between the http:// and the IP adress in the link.


  2. Mathieu Says:

    Fixed, thanks!

  3. Mathieu's log » Blog Archive » Regularized Least Squares Says:

    [...] Mathieu's log Machine Learning, Data Mining, Natural Language Processing… « Kernel Perceptron in Python [...]