<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mathieu&#039;s log</title>
	<atom:link href="http://www.mblondel.org/journal/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mblondel.org/journal</link>
	<description>Machine Learning, Data Mining, Natural Language Processing…</description>
	<lastBuildDate>Sun, 12 May 2013 12:52:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>Large-scale sparse multiclass classification</title>
		<link>http://www.mblondel.org/journal/2013/05/12/large-scale-sparse-multiclass-classification/</link>
		<comments>http://www.mblondel.org/journal/2013/05/12/large-scale-sparse-multiclass-classification/#comments</comments>
		<pubDate>Sun, 12 May 2013 12:52:56 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=334</guid>
		<description><![CDATA[I&#8217;m thrilled to announce that my paper &#8220;Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classiﬁcation&#8221; (published in the Machine Learning journal) is now online: PDF, BibTeX [*]. Abstract Over the past decade, l1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., l1/l2) regularization has [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m thrilled to announce that my paper &#8220;Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classiﬁcation&#8221; (published in the Machine Learning journal) is now online: <a href="http://www.mblondel.org/publications/mblondel-mlj2013.pdf">PDF</a>, <a href="http://www.mblondel.org/publications/bib/mblondel-mlj2013.txt">BibTeX</a> [*].</p>
<h3>Abstract</h3>
<p>Over the past decade, l1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., l1/l2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs l1/l2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun, 2009) and the other without (Richtárik and Takáč, 2012). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms l1/l2- regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.
</p></blockquote>
<h3>Code</h3>
<p>The code of the proposed multiclass method is available in my Python/Cython machine learning library, <a href="https://github.com/mblondel/lightning">lightning</a>. Below is an example of how to use it on the News20 dataset.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> sklearn.<span style="color: black;">datasets</span> <span style="color: #ff7700;font-weight:bold;">import</span> fetch_20newsgroups_vectorized
<span style="color: #ff7700;font-weight:bold;">from</span> lightning.<span style="color: black;">primal_cd</span> <span style="color: #ff7700;font-weight:bold;">import</span> CDClassifier
&nbsp;
bunch = fetch_20newsgroups_vectorized<span style="color: black;">&#40;</span>subset=<span style="color: #483d8b;">&quot;all&quot;</span><span style="color: black;">&#41;</span>
X = bunch.<span style="color: black;">data</span>
y = bunch.<span style="color: black;">target</span>
&nbsp;
clf = CDClassifier<span style="color: black;">&#40;</span>penalty=<span style="color: #483d8b;">&quot;l1/l2&quot;</span>,
                   loss=<span style="color: #483d8b;">&quot;squared_hinge&quot;</span>,
                   multiclass=<span style="color: #008000;">True</span>,
                   max_iter=<span style="color: #ff4500;">20</span>,
                   alpha=1e-4,
                   C=<span style="color: #ff4500;">1.0</span> / X.<span style="color: black;">shape</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>,
                   tol=1e-3<span style="color: black;">&#41;</span>
clf.<span style="color: black;">fit</span><span style="color: black;">&#40;</span>X, y<span style="color: black;">&#41;</span>
<span style="color: #808080; font-style: italic;"># accuracy</span>
<span style="color: #ff7700;font-weight:bold;">print</span> clf.<span style="color: black;">score</span><span style="color: black;">&#40;</span>X, y<span style="color: black;">&#41;</span> 
<span style="color: #808080; font-style: italic;"># percentage of selected features</span>
<span style="color: #ff7700;font-weight:bold;">print</span> clf.<span style="color: black;">n_nonzero</span><span style="color: black;">&#40;</span>percentage=<span style="color: #008000;">True</span><span style="color: black;">&#41;</span></pre></div></div>

<p>To use the variant without line search (as presented in the paper), add the max_steps=0 option to CDClassifier.</p>
<h3>Data</h3>
<p>I also released the <a href="http://www.mblondel.org/data/">Amazon7</a> dataset used in the paper. It contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen &#038; housewares, music, video) and is represented as a 262,144-dimensional vector. It is, to my knowledge, one of the largest publically available multiclass classification dataset.</p>
<p>[*] The final publication is available <a href="http://link.springer.com/article/10.1007%2Fs10994-013-5367-2">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2013/05/12/large-scale-sparse-multiclass-classification/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Transparent system-wide proxy</title>
		<link>http://www.mblondel.org/journal/2011/06/27/transparent-system-wide-proxy/</link>
		<comments>http://www.mblondel.org/journal/2011/06/27/transparent-system-wide-proxy/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 23:18:29 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Sysadmin]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=172</guid>
		<description><![CDATA[Proxies can be a powerful way to enforce anonymity or to bypass various kinds of restrictions on Internet (government censorship, regional contents, &#8230;). In this post, I&#8217;ll describe a simple technique to create a transparent proxy at the system level. It&#8217;s especially useful in cases when you want to make sure that all connections make [...]]]></description>
			<content:encoded><![CDATA[<p>Proxies can be a powerful way to enforce anonymity or to bypass various kinds of restrictions on Internet (government censorship, regional contents, &#8230;). In this post, I&#8217;ll describe a simple technique to create a transparent proxy at the system level. It&#8217;s especially useful in cases when you want to make sure that all connections make it through the proxy or when your application of interest doesn&#8217;t have proxy support. I don&#8217;t think this technique is that well-known, hence this post.<br />
<span id="more-172"></span></p>
<h3>1. Create SOCKS proxy with OpenSSH</h3>
<p>If you have a user account on a remote machine, the simplest way to create a proxy is to use the following command.</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">$ ssh -N -D 1080 username@serverhost</pre></div></div>

<p>It creates a <a href="http://en.wikipedia.org/wiki/SOCKS">SOCKS</a> proxy listening to port 1080 on your local machine. The main advantage of the SOCKS protocol is that it can route connections from any port between the client and the server.</p>
<h3>2. Forward connection transparently with iptables or ipfw</h3>
<p>Most modern browsers can let the user define a SOCKS proxy in their advanced networking preference section. Yet, many applications may not support the SOCKS protocol at all. The solution is to use system tools such as iptables (Linux) or ipfw (FreeBSD, OS X) to enforce the routing at the system level. For example, on OS X, I use the following command to redirect port 80 (HTTP).</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">$ sudo ipfw add 100 fwd 127.0.0.1,12345 dst-port 80</pre></div></div>

<h3>3. Redirect connections to SOCKS proxy with redsocks</h3>
<p>iptables and ipfw don&#8217;t have built-in support for the SOCKS protocol. It is thus necessary to use an additional program to transform the connections on the fly. This is where <a href="https://github.com/darkk/redsocks">redsocks</a> comes in.</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">$ redsocks -c config_file</pre></div></div>

<p>In the configuration file, you need to configure redsocks to listen to port 12345 and to redirect to port 1080. &#8220;generic&#8221; can be used as the redirector option. The github repository includes a sample configuration file. </p>
<p>And voila! We now have configured the system to transparently route connections for us. To summarize, here is the big picture:</p>

<div class="wp_syntax"><div class="code"><pre class="shell" style="font-family:monospace;">local machine -&gt; redsocks -&gt; SOCKS proxy -&gt; target server</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2011/06/27/transparent-system-wide-proxy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Regularized Least Squares</title>
		<link>http://www.mblondel.org/journal/2011/02/09/regularized-least-squares/</link>
		<comments>http://www.mblondel.org/journal/2011/02/09/regularized-least-squares/#comments</comments>
		<pubDate>Wed, 09 Feb 2011 14:20:28 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=135</guid>
		<description><![CDATA[Recently, I&#8217;ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I&#8217;m receiving good feedback regarding my posts on Machine Learning, I&#8217;m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression. Ridge regression Given a centered [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in <a href="http://scikit-learn.sourceforge.net/">scikits.learn</a>. Since I&#8217;m receiving good feedback regarding my posts on Machine Learning, I&#8217;m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.<br />
<span id="more-135"></span></p>
<h3>Ridge regression</h3>
<p>Given a centered input <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D%5E%2A%20%5Cin%20%5Cmathbf%7BR%7D%5ED&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}^* \in \mathbf{R}^D' title='\mathbf{x}^* \in \mathbf{R}^D' class='latex' />, a linear regression model predicts the output <img src='http://s.wordpress.com/latex.php?latex=y%5E%2A%20%5Cin%20%5Cmathbf%7BR%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y^* \in \mathbf{R}' title='y^* \in \mathbf{R}' class='latex' /> by</p>
<img src='http://s.wordpress.com/latex.php?latex=y%5E%2A%20%3D%20f%28%5Cmathbf%7Bx%7D%5E%2A%29%20%3D%20%5Cmathbf%7Bx%7D%5E%2A%20%5Ccdot%20%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y^* = f(\mathbf{x}^*) = \mathbf{x}^* \cdot \mathbf{w}' title='y^* = f(\mathbf{x}^*) = \mathbf{x}^* \cdot \mathbf{w}' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D%20%5Cin%20%5Cmathbf%7BR%7D%5ED&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w} \in \mathbf{R}^D' title='\mathbf{w} \in \mathbf{R}^D' class='latex' /> is a weight vector.</p>
<p>Given a set of training instances <img src='http://s.wordpress.com/latex.php?latex=X%20%3D%20%28%5Cmathbf%7Bx%7D_1%2C%5Cdots%2C%5Cmathbf%7Bx%7D_N%29%5ET&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X = (\mathbf{x}_1,\dots,\mathbf{x}_N)^T' title='X = (\mathbf{x}_1,\dots,\mathbf{x}_N)^T' class='latex' /> and its associated set of outputs <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D%20%3D%20%28y_1%2C%5Cdots%2Cy_N%29%5ET&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y} = (y_1,\dots,y_N)^T' title='\mathbf{y} = (y_1,\dots,y_N)^T' class='latex' />, the method of least-squares finds the weight vector <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> which minimizes the sum of square differences between the predictions and the actual labels. </p>
<img src='http://s.wordpress.com/latex.php?latex=L_%7BLS%7D%28%5Cmathbf%7Bw%7D%29%20%3D%20%5Csum_%7Bi%3D1%7D%5EN%20%28y_i%20-%20f%28%5Cmathbf%7Bx%7D_i%29%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L_{LS}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2' title='L_{LS}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2' class='latex' />
<p>However, minizing the sum of square errors on the training set doesn&#8217;t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes </p>
<img src='http://s.wordpress.com/latex.php?latex=L_%7BRidge%7D%28%5Cmathbf%7Bw%7D%29%20%3D%20%5Csum_%7Bi%3D1%7D%5EN%20%28y_i%20-%20f%28%5Cmathbf%7Bx%7D_i%29%29%5E2%20%2B%20%5Cfrac%7B%5Clambda%7D%7B2%7D%20%5C%7C%5Cmathbf%7Bw%7D%5C%7C%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L_{Ridge}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2' title='L_{Ridge}(\mathbf{w}) = \sum_{i=1}^N (y_i - f(\mathbf{x}_i))^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' /> is a parameter which controls the regularization strength. Minimizing <img src='http://s.wordpress.com/latex.php?latex=L_%7BRidge%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L_{Ridge}' title='L_{Ridge}' class='latex' /> is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some <em>bias</em> (the regularization), we are purposely limiting the freedom of the model and thus reducing the <em>variance</em> of the weight vector.  Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.</p>
<p>Calculating the gradient of <img src='http://s.wordpress.com/latex.php?latex=L_%7BRidge%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L_{Ridge}' title='L_{Ridge}' class='latex' /> with respect to <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> and solving for <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' />, we find that <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> has a closed-form solution</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D%20%3D%20H%5E%7B-1%7D%20X%5ET%20%5Cmathbf%7By%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w} = H^{-1} X^T \mathbf{y}' title='\mathbf{w} = H^{-1} X^T \mathbf{y}' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=H%20%3D%20%28X%5ET%20X%20%2B%20%5Clambda%20I%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='H = (X^T X + \lambda I)' title='H = (X^T X + \lambda I)' class='latex' />. Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations <img src='http://s.wordpress.com/latex.php?latex=H%5Cmathbf%7Bw%7D%3D%20X%5ET%20%5Cmathbf%7By%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='H\mathbf{w}= X^T \mathbf{y}' title='H\mathbf{w}= X^T \mathbf{y}' class='latex' />. Since <img src='http://s.wordpress.com/latex.php?latex=H&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='H' title='H' class='latex' /> is symmetric positive-definite, one typically first finds the <a href="http://en.wikipedia.org/wiki/Cholesky_decomposition">Cholesky decomposition</a> <img src='http://s.wordpress.com/latex.php?latex=H%3DLL%5ET&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='H=LL^T' title='H=LL^T' class='latex' />. Then since <img src='http://s.wordpress.com/latex.php?latex=L&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L' title='L' class='latex' /> is lower triangular, solving the system is simply a matter of applying <a href="http://en.wikipedia.org/wiki/Triangular_matrix#Forward_and_back_substitution">forward and backward substitution</a>. Another commonly used method is the <a href="http://en.wikipedia.org/wiki/Conjugate_gradient_method">conjugate gradient</a>.</p>
<p>Note that not only does regularization improve accuracy, it is often necessary because it makes <img src='http://s.wordpress.com/latex.php?latex=H&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='H' title='H' class='latex' /> invertible when <img src='http://s.wordpress.com/latex.php?latex=X%5ET%20X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X^T X' title='X^T X' class='latex' /> is not. See <a href="http://en.wikipedia.org/wiki/Tikhonov_regularization">Tikhonov regularization</a>.</p>
<h3>Kernel ridge</h3>
<p>As per the representer theorem, just like we did for <a href="http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-in-python/">kernel perceptron</a>, we can replace <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> with <img src='http://s.wordpress.com/latex.php?latex=X%5ET%5Cboldsymbol%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X^T\boldsymbol{\alpha}' title='X^T\boldsymbol{\alpha}' class='latex' />. Substituting into <img src='http://s.wordpress.com/latex.php?latex=L_%7BRidge%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L_{Ridge}' title='L_{Ridge}' class='latex' /> then deriving with respect to <img src='http://s.wordpress.com/latex.php?latex=%5Cboldsymbol%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\boldsymbol{\alpha}' title='\boldsymbol{\alpha}' class='latex' /> and solving for <img src='http://s.wordpress.com/latex.php?latex=%5Cboldsymbol%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\boldsymbol{\alpha}' title='\boldsymbol{\alpha}' class='latex' />, we find</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cboldsymbol%7B%5Calpha%7D%20%3D%20G%5E%7B-1%7D%20%5Cmathbf%7By%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\boldsymbol{\alpha} = G^{-1} \mathbf{y}' title='\boldsymbol{\alpha} = G^{-1} \mathbf{y}' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=G%20%3D%28K%20%2B%20%5Clambda%20I%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G =(K + \lambda I)' title='G =(K + \lambda I)' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=K&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='K' title='K' class='latex' /> is the kernel matrix. The prediction function becomes</p>
<img src='http://s.wordpress.com/latex.php?latex=y%5E%2A%20%3D%20f%28%5Cmathbf%7Bx%7D%5E%2A%29%20%3D%20%5Csum_%7Bi%3D1%7D%5EN%20%5Calpha_i%20k%28%5Cmathbf%7Bx%7D%5E%2A%2C%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y^* = f(\mathbf{x}^*) = \sum_{i=1}^N \alpha_i k(\mathbf{x}^*,\mathbf{x}_i)' title='y^* = f(\mathbf{x}^*) = \sum_{i=1}^N \alpha_i k(\mathbf{x}^*,\mathbf{x}_i)' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> is the kernel function. However, in contrast to SVM, <img src='http://s.wordpress.com/latex.php?latex=%5Cboldsymbol%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\boldsymbol{\alpha}' title='\boldsymbol{\alpha}' class='latex' /> is not sparse. This means that the whole training set needs to be used at prediction time. </p>
<h3>Multiple output</h3>
<p>So far we have considered the case when there is only one output <img src='http://s.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' /> per instance. We can easily extend to the <img src='http://s.wordpress.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> output case. To do that, we simply need to solve <img src='http://s.wordpress.com/latex.php?latex=HW%3D%20X%5ET%20Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='HW= X^T Y' title='HW= X^T Y' class='latex' />, where <img src='http://s.wordpress.com/latex.php?latex=W&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='W' title='W' class='latex' /> is <img src='http://s.wordpress.com/latex.php?latex=D%20%5Ctimes%20M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='D \times M' title='D \times M' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> is <img src='http://s.wordpress.com/latex.php?latex=N%20%5Ctimes%20M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N \times M' title='N \times M' class='latex' />. Note that the Cholesky decomposition needs to be computed only once. Since it takes <img src='http://s.wordpress.com/latex.php?latex=O%28D%5E3%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='O(D^3)' title='O(D^3)' class='latex' /> time, it is much more efficient to solve <img src='http://s.wordpress.com/latex.php?latex=HW%3DX%5ET%20Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='HW=X^T Y' title='HW=X^T Y' class='latex' /> directly than to solve <img src='http://s.wordpress.com/latex.php?latex=H%5Cmathbf%7Bw%7D%3DX%5ET%20%5Cmathbf%7By%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='H\mathbf{w}=X^T \mathbf{y}' title='H\mathbf{w}=X^T \mathbf{y}' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> times. </p>
<h3>Classification</h3>
<p>M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> by setting <img src='http://s.wordpress.com/latex.php?latex=Y_%7Bij%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y_{ij}' title='Y_{ij}' class='latex' /> to <img src='http://s.wordpress.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' /> if the <img src='http://s.wordpress.com/latex.php?latex=i%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i^{th}' title='i^{th}' class='latex' /> instance is associated with the <img src='http://s.wordpress.com/latex.php?latex=j%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j^{th}' title='j^{th}' class='latex' /> label and to <img src='http://s.wordpress.com/latex.php?latex=-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='-1' title='-1' class='latex' /> otherwise. Then prediction is just a matter of  taking the most confident label for each instance, just like one-vs-all SVM.</p>
<p>Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.</p>
<h3>Efficient leave-one-out cross-validation</h3>
<p>Estimating <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' /> is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.</p>
<p>The first trick is to take the eigendecomposition </p>
<img src='http://s.wordpress.com/latex.php?latex=K%3DQ%20%5CLambda%20Q%5ET.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='K=Q \Lambda Q^T.' title='K=Q \Lambda Q^T.' class='latex' />
<p>Using basic linear algebra, we obtain</p>
<img src='http://s.wordpress.com/latex.php?latex=G%5E%7B-1%7D%3DQ%28%5CLambda%20%2B%20%5Clambda%20I%29%5E%7B-1%7D%20Q%5ET.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G^{-1}=Q(\Lambda + \lambda I)^{-1} Q^T.' title='G^{-1}=Q(\Lambda + \lambda I)^{-1} Q^T.' class='latex' />
<p>Since <img src='http://s.wordpress.com/latex.php?latex=%28%5CLambda%20%2B%20%5Clambda%20I%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(\Lambda + \lambda I)' title='(\Lambda + \lambda I)' class='latex' /> is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition, <img src='http://s.wordpress.com/latex.php?latex=G&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G' title='G' class='latex' /> can thus be inverted inexpensively for many different <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' />.</p>
<p>Let <img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i)' title='f_i(\mathbf{x}_i)' class='latex' /> be the prediction value when the model was fitted to all training instances but <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_i' title='\mathbf{x}_i' class='latex' />. The second trick, sometimes called generalized cross-validation, will allow us to calculate the <img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i)' title='f_i(\mathbf{x}_i)' class='latex' /> almost for free.</p>
<p>First, let <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D%5Ei&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}^i' title='\mathbf{y}^i' class='latex' /> be the label vector of the model when <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_i' title='\mathbf{x}_i' class='latex' /> is held out.</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D%5Ei%20%3D%20%28y_1%2C%5Cdots%2Cf_i%28%5Cmathbf%7Bx%7D_i%29%2C%5Cdots%2Cy_N%29%5ET&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}^i = (y_1,\dots,f_i(\mathbf{x}_i),\dots,y_N)^T' title='\mathbf{y}^i = (y_1,\dots,f_i(\mathbf{x}_i),\dots,y_N)^T' class='latex' />
<p>When we replace <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}' title='\mathbf{y}' class='latex' /> by <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D%5Ei&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}^i' title='\mathbf{y}^i' class='latex' /> in <img src='http://s.wordpress.com/latex.php?latex=L_%7BRidge%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='L_{Ridge}' title='L_{Ridge}' class='latex' />, clearly, <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx_i%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x_i}' title='\mathbf{x_i}' class='latex' /> doesn&#8217;t affect the solution (as desired) and we have</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cboldsymbol%7B%5Calpha%7D%5Ei%20%3D%20G%5E%7B-1%7D%20%5Cmathbf%7By%7D%5Ei&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\boldsymbol{\alpha}^i = G^{-1} \mathbf{y}^i' title='\boldsymbol{\alpha}^i = G^{-1} \mathbf{y}^i' class='latex' /><br />
<br/><br />
<img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29%20%3D%20%28K%20%5Cboldsymbol%7B%5Calpha%7D%5Ei%29_i.&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i) = (K \boldsymbol{\alpha}^i)_i.' title='f_i(\mathbf{x}_i) = (K \boldsymbol{\alpha}^i)_i.' class='latex' />
<p>However, this is a circular definition, since <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D%5Ei&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}^i' title='\mathbf{y}^i' class='latex' /> itself depends on <img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i)' title='f_i(\mathbf{x}_i)' class='latex' />. We can calculate the difference with <img src='http://s.wordpress.com/latex.php?latex=f%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(\mathbf{x}_i)' title='f(\mathbf{x}_i)' class='latex' />, which will allow us to factorize <img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i)' title='f_i(\mathbf{x}_i)' class='latex' />.</p>
<img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29%20-%20f%28%5Cmathbf%7Bx%7D_i%29%20%3D%20%28K%20G%5E%7B-1%7D%20%5Cmathbf%7By%7D%5Ei%29_i%20-%20%28KG%5E%7B-1%7D%5Cmathbf%7By%7D%29_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1} \mathbf{y}^i)_i - (KG^{-1}\mathbf{y})_i' title='f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1} \mathbf{y}^i)_i - (KG^{-1}\mathbf{y})_i' class='latex' />
<p>By definition, the <img src='http://s.wordpress.com/latex.php?latex=i%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i^{th}' title='i^{th}' class='latex' /> element of <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D%5Ei&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}^i' title='\mathbf{y}^i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7By%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{y}' title='\mathbf{y}' class='latex' /> are <img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i)' title='f_i(\mathbf{x}_i)' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=y_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_i' title='y_i' class='latex' />, respectively. Therefore,</p>
<img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29%20-%20f%28%5Cmathbf%7Bx%7D_i%29%20%3D%20%28K%20G%5E%7B-1%7D%29_%7Bii%7D%20%28f_i%28%5Cmathbf%7Bx%7D_i%29%20-%20y_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1})_{ii} (f_i(\mathbf{x}_i) - y_i)' title='f_i(\mathbf{x}_i) - f(\mathbf{x}_i) = (K G^{-1})_{ii} (f_i(\mathbf{x}_i) - y_i)' class='latex' />
<p>and finally, by simplifying, we obtain</p>
<img src='http://s.wordpress.com/latex.php?latex=f_i%28%5Cmathbf%7Bx%7D_i%29%20%3D%20%5Cfrac%7Bf%28%5Cmathbf%7Bx%7D_i%29%20-%28K%20G%5E%7B-1%7D%29_%7Bii%7D%20y_i%7D%7B1%20-%20%28K%20G%5E%7B-1%7D%29_%7Bii%7D%7D.&#038;bg=ffffff&#038;fg=000000&#038;s=1' alt='f_i(\mathbf{x}_i) = \frac{f(\mathbf{x}_i) -(K G^{-1})_{ii} y_i}{1 - (K G^{-1})_{ii}}.' title='f_i(\mathbf{x}_i) = \frac{f(\mathbf{x}_i) -(K G^{-1})_{ii} y_i}{1 - (K G^{-1})_{ii}}.' class='latex' />
<p>Similar computational tricks exist for Gaussian Process Regression as well.</p>
<h3>References</h3>
<p><a href="http://www.mit.edu/~9.520/spring07/Classes/rlsslides.pdf">Regularized Least Squares</a>, Ryan Rifkin</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2011/02/09/regularized-least-squares/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Kernel Perceptron in Python</title>
		<link>http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-in-python/#comments</comments>
		<pubDate>Sun, 31 Oct 2010 05:36:32 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=129</guid>
		<description><![CDATA[The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It&#8217;s also trivial to kernelize, which makes it an ideal candidate to gain insights on kernel methods. Perceptron The Perceptron predicts the class of an input with the function where sign(y) = +1 if y > 0 and -1 if y [...]]]></description>
			<content:encoded><![CDATA[<p>The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It&#8217;s also trivial to <em>kernelize</em>, which makes it an ideal candidate to gain insights on kernel methods.<br />
<span id="more-129"></span></p>
<h3>Perceptron</h3>
<p>The Perceptron predicts the class of an input <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D%20%5Cin%20%5Cmathcal%7BX%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x} \in \mathcal{X}' title='\mathbf{x} \in \mathcal{X}' class='latex' /> with the function</p>
<img src='http://s.wordpress.com/latex.php?latex=f%28%5Cmathbf%7Bx%7D%29%20%3D%20sign%28%5Cmathbf%7Bw%7D%5Ccdot%5Cphi%28%5Cmathbf%7Bx%7D%29%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(\mathbf{x}) = sign(\mathbf{w}\cdot\phi(\mathbf{x}))' title='f(\mathbf{x}) = sign(\mathbf{w}\cdot\phi(\mathbf{x}))' class='latex' />
<p>where sign(y) = +1 if y > 0 and -1 if y < 0, <img src='http://s.wordpress.com/latex.php?latex=%5Cphi&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi' title='\phi' class='latex' /> is a feature-space transformation and  <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> is a feature weight vector. If  <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> is already a feature vector and a projection to another space is not needed, then <img src='http://s.wordpress.com/latex.php?latex=%5Cphi%28%5Cmathbf%7Bx%7D%29%3D%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi(\mathbf{x})=\mathbf{x}' title='\phi(\mathbf{x})=\mathbf{x}' class='latex' />.</p>
<p>Given a data set comprising <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> training examples <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_1%2C%5Cdots%2C%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_1,\dots,\mathbf{x}_n' title='\mathbf{x}_1,\dots,\mathbf{x}_n' class='latex' /> and their corresponding labels <img src='http://s.wordpress.com/latex.php?latex=y_1%2C%5Cdots%2Cy_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_1,\dots,y_n' title='y_1,\dots,y_n' class='latex' />, where <img src='http://s.wordpress.com/latex.php?latex=y_n%20%5Cin%20%5C%7B-1%2C%2B1%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_n \in \{-1,+1\}' title='y_n \in \{-1,+1\}' class='latex' />, the Perceptron makes a prediction for each <img src='http://s.wordpress.com/latex.php?latex=%28%5Cmathbf%7Bx%7D_i%2C%20y_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(\mathbf{x}_i, y_i)' title='(\mathbf{x}_i, y_i)' class='latex' /> using the current estimate of <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' />. When the prediction is correct (equal to the label <img src='http://s.wordpress.com/latex.php?latex=y_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_i' title='y_i' class='latex' />), the algorithm jumps to the next example. When the prediction is incorrect, in order to correct for the mistake, if <img src='http://s.wordpress.com/latex.php?latex=y_i%20%3D%20%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_i = +1' title='y_i = +1' class='latex' /> then <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> is incremented by <img src='http://s.wordpress.com/latex.php?latex=%5Cphi%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi(\mathbf{x}_i)' title='\phi(\mathbf{x}_i)' class='latex' />, otherwise it is decremented by <img src='http://s.wordpress.com/latex.php?latex=%5Cphi%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi(\mathbf{x}_i)' title='\phi(\mathbf{x}_i)' class='latex' />. Since <img src='http://s.wordpress.com/latex.php?latex=y_i%20%5Cin%20%5C%7B-1%2C%2B1%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_i \in \{-1,+1\}' title='y_i \in \{-1,+1\}' class='latex' />, the update rule is thus:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D%20%3D%20%5Cmathbf%7Bw%7D%20%2B%20y_i%20%5Cphi%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w} = \mathbf{w} + y_i \phi(\mathbf{x}_i)' title='\mathbf{w} = \mathbf{w} + y_i \phi(\mathbf{x}_i)' class='latex' />
<p><center><img src="http://www.mblondel.org/images/perceptron_update.png" /></center><br />
Figure 1: The decision boundary (hyperplane), to which <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> is normal, partitions the vector space into two sets, one for each class. An update changes the direction of the decision boundary to correct for the incorrect prediction. (From this <a href="http://www.cse.wustl.edu/~mgeorg/readPapers/kernelTrick/kernels.pdf">document</a>)</p>
<p><center><img src="http://www.mblondel.org/images/perceptron_linear.png" /></center><br />
Figure 2: Decision boundary in the linear case. </p>
<h3>Kernel Perceptron</h3>
<p>From the update rule above, we clearly see that, if <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> is initialized to the zero vector, it is a linear combination of the training examples.</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D%20%3D%20%5Csum_i%20%5Calpha_i%20y_i%20%5Cphi%28%5Cmathbf%7Bx%7D_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w} = \sum_i \alpha_i y_i \phi(\mathbf{x}_i)' title='\mathbf{w} = \sum_i \alpha_i y_i \phi(\mathbf{x}_i)' class='latex' />
<p>Injecting <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> into the prediction function <img src='http://s.wordpress.com/latex.php?latex=f%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(\mathbf{x})' title='f(\mathbf{x})' class='latex' />, we get</p>
<img src='http://s.wordpress.com/latex.php?latex=f%28%5Cmathbf%7Bx%7D%29%20%3D%20sign%28%5Csum_i%20%5Calpha_i%20y_i%20%5Cphi%28%5Cmathbf%7Bx%7D_i%29%5Ccdot%5Cphi%28%5Cmathbf%7Bx%7D%29%29%20%3D%20sign%28%5Csum_i%20%5Calpha_i%20y_i%20K%28%5Cmathbf%7Bx%7D_i%2C%5Cmathbf%7Bx%7D%29%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(\mathbf{x}) = sign(\sum_i \alpha_i y_i \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x})) = sign(\sum_i \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}))' title='f(\mathbf{x}) = sign(\sum_i \alpha_i y_i \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x})) = sign(\sum_i \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}))' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=K%28%5Cmathbf%7Bx%7D_i%2C%5Cmathbf%7Bx%7D%29%20%3D%20%5Cphi%28%5Cmathbf%7Bx%7D_i%29%5Ccdot%5Cphi%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='K(\mathbf{x}_i,\mathbf{x}) = \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x})' title='K(\mathbf{x}_i,\mathbf{x}) = \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x})' class='latex' /> is a Mercer kernel.</p>
<p>The update rule for when a mistake is made predicting <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_i' title='\mathbf{x}_i' class='latex' /> now simply becomes</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Calpha_i%20%3D%20%5Calpha_i%20%2B%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha_i = \alpha_i + 1' title='\alpha_i = \alpha_i + 1' class='latex' />
<p>i.e., <img src='http://s.wordpress.com/latex.php?latex=%5Calpha_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha_i' title='\alpha_i' class='latex' /> is the number of times a mistake has been made with respect to <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_i' title='\mathbf{x}_i' class='latex' />.</p>
<p>A few remarks:</p>
<ul>
<li>Instead of learning a weight vector <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> with respect to features, we&#8217;re now learning a weight vector <img src='http://s.wordpress.com/latex.php?latex=%5Cboldsymbol%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\boldsymbol{\alpha}' title='\boldsymbol{\alpha}' class='latex' /> with respect to examples.</li>
<li>To predict the class of an input <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' />, we need to store the training examples <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_i' title='\mathbf{x}_i' class='latex' />. Kernel methods are memory-based methods, like k-NN. However, we only need to store examples for which a mistake has been made, i.e. <img src='http://s.wordpress.com/latex.php?latex=%7B%5Calpha%7D_i%20%5Cneq%200%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='{\alpha}_i \neq 0 ' title='{\alpha}_i \neq 0 ' class='latex' />. In the context of Support Vector Machines (SVMs), these are called support vectors. SVMs, however, not only store examples for which a mistake has been made, they also store examples that lie inside the margin, i.e. <img src='http://s.wordpress.com/latex.php?latex=y_i%20%28%5Cmathbf%7Bw%7D%20%5Ccdot%20%5Cphi%28%5Cmathbf%7Bx%7D_i%29%29%20%5Cle%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_i (\mathbf{w} \cdot \phi(\mathbf{x}_i)) \le 1' title='y_i (\mathbf{w} \cdot \phi(\mathbf{x}_i)) \le 1' class='latex' />. (See Figure 4 from my <a href="http://www.mblondel.org/journal/2010/09/19/support-vector-machines-in-python/">post on SVMs</a>)</li>
<li>In the online learning setting, the number of support vectors can grow and grow as more mistakes are made. The Forgetron is an extension of the Kernel Perceptron which can learn with a &#8220;memory budget&#8221;. When the budget is exceeded, some support vectors are &#8220;forgotten&#8221;.</li>
<li>For some kinds of objects like sequences, trees and graphs, it might be difficult to map objects to feature vectors while it is easy to come up with a similarity measure <img src='http://s.wordpress.com/latex.php?latex=K%28%5Cmathbf%7Bx%7D_i%2C%5Cmathbf%7Bx%7D_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='K(\mathbf{x}_i,\mathbf{x}_j)' title='K(\mathbf{x}_i,\mathbf{x}_j)' class='latex' /> between any two objects <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_i' title='\mathbf{x}_i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_j' title='\mathbf{x}_j' class='latex' />. In this case, kernel methods are very useful, since they can be used &#8220;as is&#8221;.</li>
</ul>
<p>For a brief and intuitive introduction to kernel classifiers, I highly recommend <a href="http://128.148.32.110/courses/csci1950-f/docs/lecture_10-27.pdf">these slides</a>, by Mark Johnson.</p>
<p><center><img src="http://www.mblondel.org/images/perceptron_gaussian.png" /></center><br />
Figure 3: Decision boundary when using a gaussian kernel. Green dots indicate support vectors.</p>
<p>The voted and average Perceptrons are two straightforward extensions to the Perceptron, which for some applications are competitive with SVMs. For details, see &#8220;Large Margin Classiﬁcation Using the Perceptron Algorithm&#8221; by Freund and Schapire.</p>
<h3>Source</h3>
<p><a href="http://gist.github.com/656147">http://gist.github.com/656147</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-in-python/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Support Vector Machines in Python</title>
		<link>http://www.mblondel.org/journal/2010/09/19/support-vector-machines-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/09/19/support-vector-machines-in-python/#comments</comments>
		<pubDate>Sun, 19 Sep 2010 14:07:15 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=128</guid>
		<description><![CDATA[Support Vector Machines (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let&#8217;s play a little bit with SVM in Python! To make it easier to read, we will [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines</a> (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let&#8217;s play a little bit with SVM in Python! To make it easier to read, we will use the same notation as in Christopher Bishop&#8217;s book &#8220;Pattern Recognition and Machine Learning&#8221;.<br />
<span id="more-128"></span></p>
<h3>Maximum margin</h3>
<p>In SVM, the class of an input vector <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> can be decided by evaluating the sign of <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x})' title='y(\mathbf{x})' class='latex' />.</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.1%29~y%28%5Cmathbf%7Bx%7D%29%20%3D%20%5Cmathbf%7Bw%7D%5ET%5Cphi%28%5Cmathbf%7Bx%7D%29%2Bb&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.1)~y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})+b' title='(7.1)~y(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})+b' class='latex' />
<p>If <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D%29%20%3E%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x}) &gt; 0' title='y(\mathbf{x}) &gt; 0' class='latex' /> we assign <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' />  to class +1 and if <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D%29%20%3C%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x}) &lt; 0' title='y(\mathbf{x}) &lt; 0' class='latex' />, we assign it to class -1. Here <img src='http://s.wordpress.com/latex.php?latex=%5Cphi%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi(\mathbf{x})' title='\phi(\mathbf{x})' class='latex' /> is a feature-space transformation, which can map <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> to a space of higher, possibly infinite, dimensions.</p>
<p>Given a data set comprising <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> input vectors <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_1%2C%5Cdots%2C%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_1,\dots,\mathbf{x}_n' title='\mathbf{x}_1,\dots,\mathbf{x}_n' class='latex' /> and their corresponding labels <img src='http://s.wordpress.com/latex.php?latex=t_1%2C%5Cdots%2Ct_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t_1,\dots,t_n' title='t_1,\dots,t_n' class='latex' />, where <img src='http://s.wordpress.com/latex.php?latex=t_n%20%5Cin%20%5C%7B-1%2C%2B1%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t_n \in \{-1,+1\}' title='t_n \in \{-1,+1\}' class='latex' />, we would like to find <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='b' title='b' class='latex' /> such that it explains the training data: <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D_n%29%20%5Cge%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x}_n) \ge 1' title='y(\mathbf{x}_n) \ge 1' class='latex' /> when <img src='http://s.wordpress.com/latex.php?latex=t_n%3D%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t_n=+1' title='t_n=+1' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D_n%29%20%5Cle%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x}_n) \le 1' title='y(\mathbf{x}_n) \le 1' class='latex' /> when <img src='http://s.wordpress.com/latex.php?latex=t_n%3D-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t_n=-1' title='t_n=-1' class='latex' />. This can be rewritten in a single constraint:</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.5%29~t_n%20%28%5Cmathbf%7Bw%7D%5ET%5Cphi%28%5Cmathbf%7Bx%7D%29%2Bb%29%20%5Cge%201%2C~n%3D1%2C%5Cdots%2CN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.5)~t_n (\mathbf{w}^T\phi(\mathbf{x})+b) \ge 1,~n=1,\dots,N' title='(7.5)~t_n (\mathbf{w}^T\phi(\mathbf{x})+b) \ge 1,~n=1,\dots,N' class='latex' />
<p>In addition, <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='b' title='b' class='latex' /> are chosen so that the distance between the decision boundary <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D%5ET%5Cphi%28%5Cmathbf%7Bx%7D%29%2Bb%3D0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}^T\phi(\mathbf{x})+b=0' title='\mathbf{w}^T\phi(\mathbf{x})+b=0' class='latex' /> (a line in the 2-d case, a plane in the 3-d case, a hyperplane in the n-d case) and the closest points to it is maximized. This distance is called the margin, hence the name <em>maximum margin</em> classifier. Geometrically, the margin is found to be <img src='http://s.wordpress.com/latex.php?latex=2%2F%5C%7C%5Cmathbf%7Bw%7D%5C%7C%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2/\|\mathbf{w}\|^2' title='2/\|\mathbf{w}\|^2' class='latex' /> and so the maximum margin problem can be equivalently expressed as the minimization problem:</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.6%29~arg%20min_%7B%5Cmathbf%7Bw%7D%2Cb%7D%5Cfrac%7B1%7D%7B2%7D%5C%7C%5Cmathbf%7Bw%7D%5C%7C%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.6)~arg min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^2' title='(7.6)~arg min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^2' class='latex' />
<p>subject to constraint (7.5).</p>
<p><img src="http://www.mblondel.org/images/svm_linear.png" /></p>
<p>Figure 1: The linearly separable case. The decision line is the plain line and the margin is the gap between the two dotted lines. Only 3 support vectors (green dots) out of 180 training examples are necessary.</p>
<h3>Dual representation</h3>
<p>Since this is a constrained optimization problem, we can introduce <a href="http://en.wikipedia.org/wiki/Lagrange_multipliers">Lagrange multipliers</a> <img src='http://s.wordpress.com/latex.php?latex=a_n%20%5Cge%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='a_n \ge 0' title='a_n \ge 0' class='latex' /> (one per training example), differentiate the Lagrangian function with respect to <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='b' title='b' class='latex' /> and inject the solution back in the Lagrangian function (equations (7.7) to (7.9) in Bishop&#8217;s book), so that it depends only on the Lagrange multipliers. Doing so, we find that the maximum margin equivalently emerges from the maximization of</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.10%29~%5Ctilde%7BL%7D%28%5Cmathbf%7Ba%7D%29%20%3D%20%5Csum_%7Bn%3D1%7D%5ENa_n-%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bn%3D1%7D%5EN%5Csum_%7Bm%3D1%7D%5EN%20a_n%20a_m%20t_n%20t_m%20k%28%5Cmathbf%7Bx%7D_n%2C%5Cmathbf%7Bx%7D_m%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.10)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)' title='(7.10)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)' class='latex' />
<p>subject to the constraints</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.11%29~a_n%20%5Cge%200%2C~n%3D1%2C%5Cdots%2CN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.11)~a_n \ge 0,~n=1,\dots,N' title='(7.11)~a_n \ge 0,~n=1,\dots,N' class='latex' /><br />
</br><br />
<img src='http://s.wordpress.com/latex.php?latex=%287.12%29~%5Csum_%7Bn%3D1%7D%5ENa_nt_n%20%3D%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.12)~\sum_{n=1}^Na_nt_n = 0' title='(7.12)~\sum_{n=1}^Na_nt_n = 0' class='latex' />
<p>This is the so-called <em>dual representation</em> and is a <a href="http://en.wikipedia.org/wiki/Quadratic_programming">quadratic programming</a> (QP) problem. <img src='http://s.wordpress.com/latex.php?latex=k%28%5Cmathbf%7Bx%7D_n%2C%5Cmathbf%7Bx%7D_m%29%3D%20%5Cphi%28%5Cmathbf%7Bx%7D_n%29%5ET%5Cphi%28%5Cmathbf%7Bx%7D_m%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k(\mathbf{x}_n,\mathbf{x}_m)= \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m)' title='k(\mathbf{x}_n,\mathbf{x}_m)= \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m)' class='latex' /> is called the kernel function.</p>
<p>Similarly to the objective function, <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x})' title='y(\mathbf{x})' class='latex' /> can also be re-expressed solely in terms of the Lagrange multipliers.</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.13%29~y%28%5Cmathbf%7Bx%7D%29%3D%20%5Csum_%7Bn%3D1%7D%5EN%20a_n%20t_n%20k%28%5Cmathbf%7Bx%7D%2C%5Cmathbf%7Bx%7D_n%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.13)~y(\mathbf{x})= \sum_{n=1}^N a_n t_n k(\mathbf{x},\mathbf{x}_n)' title='(7.13)~y(\mathbf{x})= \sum_{n=1}^N a_n t_n k(\mathbf{x},\mathbf{x}_n)' class='latex' />
<p>The important thing to notice here is that we&#8217;ve gone from a sum over <img src='http://s.wordpress.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> dimensions (the <a href="http://en.wikipedia.org/wiki/Dot_product">dot product</a> in equation (7.1)) to a sum over <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> points. This may seem like a disadvantage as the number of training examples <img src='http://s.wordpress.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> is usually bigger than the number of dimensions <img src='http://s.wordpress.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' />. However, this is very useful and is called the <a href="http://en.wikipedia.org/wiki/Kernel_trick">kernel trick</a>: this allows to use SVM, originally a linear classifier, to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, but without explicitly computing <img src='http://s.wordpress.com/latex.php?latex=%5Cphi%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi(\mathbf{x})' title='\phi(\mathbf{x})' class='latex' />.</p>
<p>In many situations, only a small proportion of the Lagrange multipliers <img src='http://s.wordpress.com/latex.php?latex=a_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='a_n' title='a_n' class='latex' /> will be non-zero, therefore we only need to store the corresponding training examples <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}_n' title='\mathbf{x}_n' class='latex' />. These are called the <em>support vectors</em> and this is why SVMs are sometimes called sparse kernel machines.</p>
<p>That being said, in the linear case, i.e. when <img src='http://s.wordpress.com/latex.php?latex=%5Cphi%28%5Cmathbf%7Bx%7D%29%3D%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\phi(\mathbf{x})=\mathbf{x}' title='\phi(\mathbf{x})=\mathbf{x}' class='latex' />, it is faster to directly compute <img src='http://s.wordpress.com/latex.php?latex=y%28%5Cmathbf%7Bx%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y(\mathbf{x})' title='y(\mathbf{x})' class='latex' /> from equation (7.1). <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='b' title='b' class='latex' /> can be computed in terms of the Lagrange multipliers by equations (7.8) and (7.18) in Bishop&#8217;s book.</p>
<p><img src="http://www.mblondel.org/images/svm_non_linear.png" /></p>
<p>Figure 2: The non-linearly separable case. Example of a gaussian kernel with parameter sigma=5.0. Perfect prediction is achieved on the held-out 20 data points.</p>
<h3>QP solver</h3>
<p>We want to find the Lagrange multipliers <img src='http://s.wordpress.com/latex.php?latex=a_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='a_n' title='a_n' class='latex' /> maximizing equation (7.10) subject to the constraints (7.11) and (7.11). This can be done by a standard QP solver such as <a href="http://abel.ee.ucla.edu/cvxopt/">cvxopt</a>.</p>
<p>Minimize</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7D%20%5Cmathbf%7Bx%7D%5ET%20P%5Cmathbf%7Bx%7D%20%2B%20%5Cmathbf%7Bq%7D%5ET%20%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{1}{2} \mathbf{x}^T P\mathbf{x} + \mathbf{q}^T \mathbf{x}' title='\frac{1}{2} \mathbf{x}^T P\mathbf{x} + \mathbf{q}^T \mathbf{x}' class='latex' />
<p>subject to</p>
<img src='http://s.wordpress.com/latex.php?latex=G%5Cmathbf%7Bx%7D%20%5Cleq%20%5Cmathbf%20h&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G\mathbf{x} \leq \mathbf h' title='G\mathbf{x} \leq \mathbf h' class='latex' /> (inequality constraint)<br />
</br><br />
<img src='http://s.wordpress.com/latex.php?latex=A%5Cmathbf%7Bx%7D%20%3D%20%5Cmathbf%20b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='A\mathbf{x} = \mathbf b' title='A\mathbf{x} = \mathbf b' class='latex' /> (equality constraint)</p>
<p>The unknow is <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' />, which in our case correspond to the Lagrange multipliers <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Ba%7D%3Da_1%2C%5Cdots%2Ca_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{a}=a_1,\dots,a_n' title='\mathbf{a}=a_1,\dots,a_n' class='latex' />. We just need to rework the formulation a little bit to use matrix notation and be a minimization (hence the -1 multiplicative factors).</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;"># Gram matrix</span>
K = np.<span style="color: black;">zeros</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>n_samples, n_samples<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>n_samples<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">for</span> j <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>n_samples<span style="color: black;">&#41;</span>:
        K<span style="color: black;">&#91;</span>i,j<span style="color: black;">&#93;</span> = <span style="color: #008000;">self</span>.<span style="color: black;">kernel</span><span style="color: black;">&#40;</span>X<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>, X<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
P = cvxopt.<span style="color: black;">matrix</span><span style="color: black;">&#40;</span>np.<span style="color: black;">outer</span><span style="color: black;">&#40;</span>y,y<span style="color: black;">&#41;</span> <span style="color: #66cc66;">*</span> K<span style="color: black;">&#41;</span>
q = cvxopt.<span style="color: black;">matrix</span><span style="color: black;">&#40;</span>np.<span style="color: black;">ones</span><span style="color: black;">&#40;</span>n_samples<span style="color: black;">&#41;</span> <span style="color: #66cc66;">*</span> -<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
A = cvxopt.<span style="color: black;">matrix</span><span style="color: black;">&#40;</span>y, <span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>,n_samples<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
b = cvxopt.<span style="color: black;">matrix</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0.0</span><span style="color: black;">&#41;</span>
G = cvxopt.<span style="color: black;">matrix</span><span style="color: black;">&#40;</span>np.<span style="color: black;">diag</span><span style="color: black;">&#40;</span>np.<span style="color: black;">ones</span><span style="color: black;">&#40;</span>n_samples<span style="color: black;">&#41;</span> <span style="color: #66cc66;">*</span> -<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
h = cvxopt.<span style="color: black;">matrix</span><span style="color: black;">&#40;</span>np.<span style="color: black;">zeros</span><span style="color: black;">&#40;</span>n_samples<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Solve QP problem</span>
solution = cvxopt.<span style="color: black;">solvers</span>.<span style="color: black;">qp</span><span style="color: black;">&#40;</span>P, q, G, h, A, b<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Lagrange multipliers</span>
a = np.<span style="color: black;">ravel</span><span style="color: black;">&#40;</span>solution<span style="color: black;">&#91;</span><span style="color: #483d8b;">'x'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>Note here that <img src='http://s.wordpress.com/latex.php?latex=P&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P' title='P' class='latex' /> is a <img src='http://s.wordpress.com/latex.php?latex=N%20%5Ctimes%20N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N \times N' title='N \times N' class='latex' /> matrix. Thus, a standard QP solver can&#8217;t be used for a large number of training examples, as P needs to be stored in memory. There exists a number of algorithms in order to decompose the original QP problem into smaller QP problems that target only a few training samples at a time. One such algorithm is <a href="http://en.wikipedia.org/wiki/Sequential_Minimal_Optimization">Sequential Minimal Optimization</a> (SMO). One advantage of SMO is that the smaller QP problems can be solved analytically and so SMO doesn&#8217;t even need a QP solver.</p>
<h3>Soft margin</h3>
<p>The problem with the formulation we have used thus far is that it doesn&#8217;t allow for misclassification of the training examples. This can lead to poor generalization if there is overlap between the distributions of the two classes. To solve this problem, we can rework constraint (7.5) as</p>
<p><img src='http://s.wordpress.com/latex.php?latex=%287.20%29~t_n%20y%28%5Cmathbf%7Bx%7D_n%29%20%5Cge%201%20-%20%5Cxi_n%2C~n%3D1%2C%5Cdots%2CN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.20)~t_n y(\mathbf{x}_n) \ge 1 - \xi_n,~n=1,\dots,N' title='(7.20)~t_n y(\mathbf{x}_n) \ge 1 - \xi_n,~n=1,\dots,N' class='latex' /><br />
</br><br />
<img src='http://s.wordpress.com/latex.php?latex=%5Cxi_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\xi_n' title='\xi_n' class='latex' /> are called slack variables and are introduced to allow the misclassification of some examples. If <img src='http://s.wordpress.com/latex.php?latex=%5Cxi_n%3D0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\xi_n=0' title='\xi_n=0' class='latex' />, the corresponding training example is correctly classified. If <img src='http://s.wordpress.com/latex.php?latex=0%20%3C%20%5Cxi_n%20%5Cle%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0 &lt; \xi_n \le 1' title='0 &lt; \xi_n \le 1' class='latex' />, the training example lies inside the margin but is still on the correct side of the decision boundary. If <img src='http://s.wordpress.com/latex.php?latex=%5Cxi_n%20%3E%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\xi_n &gt; 1' title='\xi_n &gt; 1' class='latex' />, the training example is misclassified. Equation (7.6) then becomes</p>
<p><img src='http://s.wordpress.com/latex.php?latex=%287.21%29~C%20%5Csum_%7Bn%3D1%7D%5EN%20%5Cxi_n%20%2B%20%5Cfrac%7B1%7D%7B2%7D%5C%7C%5Cmathbf%7Bw%7D%5C%7C%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.21)~C \sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|^2' title='(7.21)~C \sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|^2' class='latex' /><br />
</br></p>
<p><img src='http://s.wordpress.com/latex.php?latex=C%20%3E%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='C &gt; 0' title='C &gt; 0' class='latex' /> is the parameter which controls the trade-off between the slack variable penality and the margin. Again, we can introduce Lagrange multipliers, derive the Lagrangian function with respect to <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{w}' title='\mathbf{w}' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='b' title='b' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Cxi_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\xi_n' title='\xi_n' class='latex' />, and inject the solutions back in the Lagragian function (equations (7.22) to (7.31) in Bishop&#8217;s book).</p>
<p><img src='http://s.wordpress.com/latex.php?latex=%287.32%29~%5Ctilde%7BL%7D%28%5Cmathbf%7Ba%7D%29%20%3D%20%5Csum_%7Bn%3D1%7D%5ENa_n-%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bn%3D1%7D%5EN%5Csum_%7Bm%3D1%7D%5EN%20a_n%20a_m%20t_n%20t_m%20k%28%5Cmathbf%7Bx%7D_n%2C%5Cmathbf%7Bx%7D_m%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.32)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)' title='(7.32)~\tilde{L}(\mathbf{a}) = \sum_{n=1}^Na_n-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n,\mathbf{x}_m)' class='latex' />
<p>Which is identical to the hard margin case! The constraints become:</p>
<img src='http://s.wordpress.com/latex.php?latex=%287.33%29~0%20%5Cle%20a_n%20%5Cle%20C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.33)~0 \le a_n \le C' title='(7.33)~0 \le a_n \le C' class='latex' /><br />
</br><br />
<img src='http://s.wordpress.com/latex.php?latex=%287.34%29~%5Csum_%7Bn%3D1%7D%5EN%20a_n%20t_n%20%3D%200&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(7.34)~\sum_{n=1}^N a_n t_n = 0' title='(7.34)~\sum_{n=1}^N a_n t_n = 0' class='latex' />
<p>Interestingly, the slack variables <img src='http://s.wordpress.com/latex.php?latex=%5Cxi_n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\xi_n' title='\xi_n' class='latex' /> have vanished and the only difference with the hard margin is that the inequality constraint now has an upper bound, <img src='http://s.wordpress.com/latex.php?latex=C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='C' title='C' class='latex' />.</p>
<p>The attentive reader will have noticed that the inequality constraints in cvxopt have an upper bound but no lower bound.</p>
<img src='http://s.wordpress.com/latex.php?latex=G%5Cmathbf%7Bx%7D%20%5Cleq%20%5Cmathbf%20h&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G\mathbf{x} \leq \mathbf h' title='G\mathbf{x} \leq \mathbf h' class='latex' />
<p>The trick is to rewrite constraint (7.33) as a system of inequations, in matrix notation. Example with 2 training examples:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cbegin%7Bpmatrix%7D-1%20%26%200%20%5C%5C%200%20%26%20-1%20%5C%5C%201%20%26%200%20%20%5C%5C%200%20%26%201%5Cend%7Bpmatrix%7D%5Cbegin%7Bpmatrix%7Da_1%5C%5C%20a_2%5Cend%7Bpmatrix%7D%20%5Cle%20%5Cbegin%7Bpmatrix%7D0%20%5C%5C%200%20%5C%5C%20C%20%5C%5C%20C%5Cend%7Bpmatrix%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\begin{pmatrix}-1 &amp; 0 \\ 0 &amp; -1 \\ 1 &amp; 0  \\ 0 &amp; 1\end{pmatrix}\begin{pmatrix}a_1\\ a_2\end{pmatrix} \le \begin{pmatrix}0 \\ 0 \\ C \\ C\end{pmatrix}' title='\begin{pmatrix}-1 &amp; 0 \\ 0 &amp; -1 \\ 1 &amp; 0  \\ 0 &amp; 1\end{pmatrix}\begin{pmatrix}a_1\\ a_2\end{pmatrix} \le \begin{pmatrix}0 \\ 0 \\ C \\ C\end{pmatrix}' class='latex' />
<p><img src="http://www.mblondel.org/images/svm_hard.png" /></p>
<p>Figure 3: The hard margin case. 180 vectors out of 180 are support vectors! And the classifier only achieves 11 correct predictions out of 20, on held-out data.</p>
<p><img src="http://www.mblondel.org/images/svm_soft.png" /></p>
<p>Figure 4: The soft margin case (C=0.1). 36 vectors out of 180 are support vectors. The classifier achieves 19 correct predictions out of 20!</p>
<h3>Source</h3>
<p><a href="http://gist.github.com/586753">http://gist.github.com/586753</a></p>
<h3>References</h3>
<p>Pattern Recognition and Machine Learning, Christopher Bishop, 2006.</p>
<p><a href="http://d.hatena.ne.jp/aidiary/20100503/1272889097">ソフトマージンSVM</a>, 人工知能に関する断想録</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/09/19/support-vector-machines-in-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Latent Dirichlet Allocation in Python</title>
		<link>http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 20:52:00 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=127</guid>
		<description><![CDATA[Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) &#8211; see my previous post &#8220;LSA and pLSA in Python&#8220;, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the &#8220;topics&#8221; expressed by documents in that collection. LDA can be seen as a Bayesian [...]]]></description>
			<content:encoded><![CDATA[<p>Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) &#8211; see my previous post &#8220;<a href="http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/">LSA and pLSA in Python</a>&#8220;, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the &#8220;topics&#8221; expressed by documents in that collection. LDA can be seen as a <a href="http://en.wikipedia.org/wiki/Bayesian_inference">Bayesian</a> extension of pLSA.<br />
<span id="more-127"></span><br />
As Blei, the author of LDA, points out, the topic proportions in pLSA are tied with the training documents. This is problematic: 1) the number of parameters grows linearly with the number of training documents, which can cause serious <a href="http://en.wikipedia.org/wiki/Overfitting">overfitting</a> 2) it is difficult to generalize to new documents and requires so-called &#8220;folding-in&#8221;. LDA fixes those issues by being a fully generative model: where pLSA uses a matrix of P(topic|document) probabilities, LDA uses a distribution over topics.</p>
<p>To date, there exists several parameter estimation schemes for LDA: variational Bayes, expectation propagation and Gibbs sampling. I&#8217;ve chosen to implement the latter. It has first been described in a paper entitled &#8220;Finding scientific topics&#8221;, by Griffiths and Steyvers.</p>
<h3>Artificial data</h3>
<p>As with all model-based algorithms, during the early development phase, it is useful to work with artificial data, generated by following the model assumptions. In the case of LDA (and pLSA), the core assumption is that words (w) in documents are generated by mixture of topics (z). In other words, the probability of a word is:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28w%29%20%3D%20%5Csum_%7Bz%7D%20P%28w%7Cz%29%20P%28z%29&#038;bg=ffffff&#038;fg=000000&#038;s=1' alt='P(w) = \sum_{z} P(w|z) P(z)' title='P(w) = \sum_{z} P(w|z) P(z)' class='latex' />
<p>The generative process can be summarized as follows: 1) set the topic proportions once for all when the collection is instantiated and 2) for each document and for as many words as needed, sample a topic from the topic distribution and sample a word from the word distribution of the selected topic. Obviously, this is only an approximation of how documents are created in reality.</p>
<p>To generate an artificial dataset, we can fix the word distribution of each topic and then generate documents as explained above. Since we generated documents by sticking to the generative assumption of the model, if the algorithm is correctly implemented, it should be able to recover the word distribution of each topic, from the generated documents.</p>
<h3>Graphical example</h3>
<p>To gain insight and intuition, we can reuse the graphical example from Griffiths and Steyvers&#8217; paper. </p>
<p>In the <a href="http://en.wikipedia.org/wiki/Bag_of_words_model">bag-of-words model</a>, documents are represented by vectors of dimension <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' />, where <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' /> is the vocabulary size. Moreover, an image of size <img src='http://s.wordpress.com/latex.php?latex=%5Csqrt%7BV%7D%20%5Ctimes%20%5Csqrt%7BV%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sqrt{V} \times \sqrt{V}' title='\sqrt{V} \times \sqrt{V}' class='latex' /> has <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' /> pixels: it can thus be stored as a string/vector of length/size <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' />. This means that a document in the bag-of-words model can be represented as an image, where pixels correspond to words and pixel intensities correspond to word counts!</p>
<p>As put previously, we first need to fix the word distribution of each topic. Let&#8217;s arbitrarily create <strong>10</strong> topics.</p>
<p><strong>5</strong> with &#8220;vertical&#8221; bars:</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic4.png" style="border:1px solid" /></td>
</tr>
</table>
<p>and another <strong>5</strong> with &#8220;horizontal&#8221; bars:</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>Each topic distribution is represented by a <strong>5&#215;5</strong> image, so the vocabulary is of size <strong>25</strong>. Black pixels correspond to words that the topic will never possibly generate. White pixels correspond to words that the topic can generate with probability 1/5.</p>
<p>Now let&#8217;s generate <strong>500</strong> documents using the generative process previously described. Here are 3 examples of such generated documents.</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/doc0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/doc1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/doc2.png" style="border:1px solid" /></td>
</tr>
</table>
<p>We clearly see bars emerging from the documents and can thus confirm that documents are mixtures of topics.</p>
<p>We can now use the generated documents as training data. If the Gibbs sampler is correctly implemented, we should be able to recover the original topics. Here are the results for the 1st, 6th and 26th iterations. The number between brackets is the log-likelihood.</p>
<p>1st iteration (-278541.7835):</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic0-0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-4.png" style="border:1px solid" /></td>
</tr>
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic0-5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>5th iteration (-165139.56193):</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic5-0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-4.png" style="border:1px solid" /></td>
</tr>
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic5-5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>[...]</p>
<p>26th iteration (-129272.328181):</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic25-0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-4.png" style="border:1px solid" /></td>
</tr>
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic25-5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>After a few iterations, we see that the algorithm recovered the topics correctly. Also, the log-likelihood increases: as the number of iterations increases, it becomes more and more likely that the model generated the data. The fact that it works pretty well is not surprising: the data used were generated by sticking to the model assumptions.</p>
<h3>Gibbs sampling</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampler</a> used is said to be collapsed: the parameters of interest are not sampled directly. Instead we sample the topic assignments and the parameters can be computed in terms of those.</p>
<p>It is not necessarily obvious from the equation of the full conditional distribution (from which the topic assignments are sampled) but the sampler is naturally sparse: it doesn&#8217;t need to iterate over words with zero-count. This is a nice property, given that sampling algorithms are often considered slow.</p>
<h3>Source code</h3>
<p><a href="http://gist.github.com/542786">http://gist.github.com/542786</a></p>
<p>Fairly readable and compact code but to be considered a toy implementation.</p>
<h3>Useful Resources</h3>
<h4>MCMC</h4>
<p>- &#8220;<a href="http://videolectures.net/mlss09uk_murray_mcmc/">MCMC lecture at MLSS09</a>&#8221; (Iain Murray). Nice for a first general overview and the insights.</p>
<p>- &#8220;Gibbs sampling for the uninitiated&#8221; (Resnik and Hardisty). Nice for a first general overview and the insights.</p>
<p>- &#8220;Pattern Recognition and Machine Learning&#8221; (Bishop), Chapters 8 and 11 on graphical models and sampling methods. Excellent chapters.</p>
<p>- &#8220;<a href="http://users.aims.ac.za/~ioana/">Review Course: Markov Chains and Monte Carlo Methods</a>&#8221; (Cosma and Evers). Very nice free online course and solutions to exercises in Python and R!</p>
<h4>LDA</h4>
<p>- &#8220;Latent Dirichlet Allocation&#8221; (Blei et al, 2003). By Blei himself.</p>
<p>- &#8220;Finding scientific topics&#8221; (Griffiths and Steyvers). Insightful comments and nice intuitive graphical example.</p>
<p>- &#8220;Parameter Estimation for text analysis&#8221; (Heinrich). Very nice introduction to Bayesian thinking. Pseudo-code for the LDA Gibbs sampler.</p>
<p>- &#8220;On an equivalence between PLSI and LDA&#8221; (Girolami and Kaban). Connections between pLSA and LDA.</p>
<p>- &#8220;Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling&#8221; (Carpenter). Very detailed, step-by-step derivation of the collapsed Gibbs samplers for LDA and NB.</p>
<p>- &#8220;Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details&#8221; (Wang). Insightful comments and pseudo-code of the LDA Gibbs sampler.</p>
<h4>Other Python implementations</h4>
<p>- <a href="http://github.com/nrolland/pyLDA">nrolland&#8217;s pyLDA</a>. Works fine but mixes Python-style and Numpy-style.</p>
<p>- <a href="http://github.com/alextp/pylda">alextp&#8217;s pylda</a>. Numpy-style but not tested.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Semi-supervised Naive Bayes in Python</title>
		<link>http://www.mblondel.org/journal/2010/06/21/semi-supervised-naive-bayes-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/06/21/semi-supervised-naive-bayes-in-python/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 17:47:50 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=126</guid>
		<description><![CDATA[Expectation-Maximization The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can&#8217;t directly count. For example, in pLSA, the document topics z are [...]]]></description>
			<content:encoded><![CDATA[<h3>Expectation-Maximization</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Expectation-maximization_algorithm">Expectation-Maximization</a> (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can&#8217;t directly count. For example, in <a href="http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/">pLSA</a>, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we&#8217;re looking for: the model parameters. First, we make an initial guess, which can be either random or &#8220;our best bet&#8221;. Then, in the E-step, we use our current model parameters to estimate some &#8220;measures&#8221;, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.<br />
<span id="more-126"></span></p>
<h3>Naive Bayes trained with EM</h3>
<p>In their paper &#8220;Semi-supervised Text Classification Using EM&#8221;, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning">semi-supervised</a> fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:</p>
<ul>
<li>Train a classifier with your labeled data</li>
<li>While the model likelihood increases:
<ul>
<li>E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.</li>
<li>M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.</li>
</ul>
</ul>
<p>The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.</p>
<h3>Code</h3>
<p>I made a simple implementation of it in Python + Numpy. The code is fairly optimized.</p>
<p>$ git clone http://www.mblondel.org/code/seminb.git</p>
<p><a href="http://www.mblondel.org/gitweb?p=seminb.git">web interface</a></p>
<h3>Implementation details</h3>
<p>Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.</p>
<p>Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28x_i%7Cc_j%29%20%3D%20%5Cprod_%7Bt%7D%5EV%20P%28w_t%7Cc_j%29%5E%7Bx_%7Bit%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}' title='P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}' class='latex' />
<p>However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Clog%20P%28x_i%7Cc_j%29%20%3D%20%5Csum_%7Bt%7D%5EV%20x_%7Bit%7D%20%5Clog%20P%28w_t%7Cc_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)' title='\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)' class='latex' />
<p>To turn around P(x|c), we use Bayes&#8217;rule:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cfrac%7BP%28c_j%29P%28x_i%7Cc_j%29%7D%7BP%28x_i%29%7D%20%3D%20%5Cfrac%7BP%28c_j%29P%28x_i%7Cc_j%29%7D%7B%5Csum_k%20P%28c_k%29P%28x_i%7Cc_k%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}' title='P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}' class='latex' />
<p>By posing <img src='http://s.wordpress.com/latex.php?latex=z_j%20%3D%20%5Clog%20P%28c_j%29%20%2B%20%5Clog%20P%28x_i%7Cc_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='z_j = \log P(c_j) + \log P(x_i|c_j)' title='z_j = \log P(c_j) + \log P(x_i|c_j)' class='latex' />, we get:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cfrac%7Be%5E%7Bz_j%7D%7D%7B%5Csum_k%20e%5E%7Bz_k%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}' title='P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}' class='latex' />
<p>This is the <a href="http://en.wikipedia.org/wiki/Softmax_activation_function">softmax</a> function. However, we are back to our initial problem because, since the <img src='http://s.wordpress.com/latex.php?latex=z_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='z_j' title='z_j' class='latex' /> are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant <img src='http://s.wordpress.com/latex.php?latex=e%5E%7B-m%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e^{-m}' title='e^{-m}' class='latex' />:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cfrac%7Be%5E%7Bz_j-m%7D%7D%7B%5Csum_k%20e%5E%7Bz_k-m%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}' title='P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}' class='latex' />
<p>Setting m to <img src='http://s.wordpress.com/latex.php?latex=max_j~z_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='max_j~z_j' title='max_j~z_j' class='latex' />, the <img src='http://s.wordpress.com/latex.php?latex=z_j-m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='z_j-m' title='z_j-m' class='latex' /> values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.</p>
<p>In case this is not enough, we can further define:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cbegin%7Bcases%7D%200%2C%20%26%20%5Cmbox%7Bif%20%7D%20%20z_j%20-%20m%20%20%5Cle%20t%20%20%5C%5C%20%5Cfrac%7Be%5E%7Bz_j-m%7D%7D%7B%5Csum_%7B%5C%7Bk~%3A~z_k-m%20%3E%20t%5C%7D%7D%20e%5E%7Bz_k-m%7D%7D%2C%20%20%26%20%5Cmbox%7Botherwise%7D%20%5Cend%7Bcases%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \begin{cases} 0, &amp; \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m &gt; t\}} e^{z_k-m}},  &amp; \mbox{otherwise} \end{cases}' title='P(y_i=c_j|x_i) = \begin{cases} 0, &amp; \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m &gt; t\}} e^{z_k-m}},  &amp; \mbox{otherwise} \end{cases}' class='latex' />
<p>This truncates the exponentials to zero when <img src='http://s.wordpress.com/latex.php?latex=e%5E%7Bz_j-m%7D%20%5Cle%20e%5Et&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e^{z_j-m} \le e^t' title='e^{z_j-m} \le e^t' class='latex' />. For t=-10, this corresponds to 0.000045. Equivalently we can see it as setting the exponentials to zero when <img src='http://s.wordpress.com/latex.php?latex=e%5E%7Bz_j%7D%20%5Cle%20e%5E%7Bt%2Bm%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e^{z_j} \le e^{t+m}' title='e^{z_j} \le e^{t+m}' class='latex' />. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.</p>
<h3>Reference</h3>
<blockquote><p>Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/06/21/semi-supervised-naive-bayes-in-python/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>LSA and pLSA in Python</title>
		<link>http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/#comments</comments>
		<pubDate>Sun, 13 Jun 2010 17:42:58 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=125</guid>
		<description><![CDATA[Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Analysis</a> (LSA) and its <a href="http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis">probabilistic counterpart</a> pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus <a href="http://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised</a> <a href="http://en.wikipedia.org/wiki/Dimension_reduction">dimensionality reduction</a> techniques. Applications include <a href="http://en.wikipedia.org/wiki/Information_Retrieval">information retrieval</a>, document classification and <a href="http://en.wikipedia.org/wiki/Collaborative_filtering">collaborative filtering</a>.<br />
<span id="more-125"></span><br />
Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing. </p>
<h3>Comparison</h3>
<table border="1" style="margin-top:10px">
<tr>
<th>&nbsp;</th>
<th>LSA</th>
<th>pLSA</th>
</tr>
<tr>
<td>1. Theoretical background</td>
<td>Linear Algebra</td>
<td>Probabilities and Statistics</td>
</tr>
<tr>
<td>2. Objective function</td>
<td>Frobenius norm</td>
<td>Likelihood function</td>
</tr>
<tr>
<td>3. Polysemy</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>4. Folding-in</td>
<td>Straightforward</td>
<td>Complicated</td>
</tr>
</table>
<p>1. LSA stems from Linear Algebra as it is nothing more than a <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a>. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).</p>
<p>2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the <a href="http://en.wikipedia.org/wiki/Matrix_norm">Frobenius norm</a> of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the <a href="http://en.wikipedia.org/wiki/Least_squares">least squares</a> solution corresponds to the <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">Maximum Likelihood</a> solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling. </p>
<p>The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.</p>
<p>Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).</p>
<p>3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.</p>
<p>4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called &#8220;folding-in&#8221;. Clearly, new documents folded-in don&#8217;t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time. </p>
<p>In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm. </p>
<h3>Implementation in Python</h3>
<p>LSA is straightforward to implement as it is nothing more than a SVD and Numpy&#8217;s Linear Algebra module has a function &#8220;svd&#8221; already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn&#8217;t mean that the SVD is not full, just that the returned matrices don&#8217;t contain vectors corresponding to zero singular values. Scipy&#8217;s Linear Algebra package unfortunately doesn&#8217;t seem to have a sparse SVD. Likewise, there&#8217;s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).</p>
<p>pLSA&#8217;s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.</p>
<p>The implementation is sparse. It works with both Numpy&#8217;s ndarrays and Scipy&#8217;s sparse matrices.</p>
<p>$ git clone http://www.mblondel.org/code/plsa.git</p>
<p><a href=" http://www.mblondel.org/gitweb?p=plsa.git">web interface</a></p>
<p>Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it&#8217;s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>Seam Carving in Python</title>
		<link>http://www.mblondel.org/journal/2010/02/09/seam-carving-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/02/09/seam-carving-in-python/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 15:57:20 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Image Processing]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=123</guid>
		<description><![CDATA[Seam Carving is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper &#8220;Seam Carving for Content-Aware Image Resizing&#8220;. Miyako Island, Okinawa, Japan. The principle is very simple. Find the connected paths of low energy pixels (&#8220;the seams&#8221;). This can be done efficiently by dynamic programming (see my [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Seam_carving">Seam Carving</a> is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper &#8220;<a href="http://www.shaiavidan.org/papers/imretFinal.pdf">Seam Carving for Content-Aware Image Resizing</a>&#8220;.</p>
<p><a href="http://www.flickr.com/photos/ippei-janine/2165260667/"><img src="http://www.mblondel.org/images/seam-carving/okinawa.jpg" style="border: 1px solid black" /></a><br />
<em>Miyako Island, Okinawa, Japan.</em><br />
<span id="more-123"></span><br />
The principle is very simple. Find the connected paths of low energy pixels (&#8220;the seams&#8221;). This can be done efficiently by <a href="http://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a> (see my post on <a href="http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/">DTW</a>). </p>
<p><img src="http://www.mblondel.org/images/seam-carving/okinawa_gradient.jpg" /><br />
<em>Same image in the gradient domain showing the vertical and horizontal seams of lowest cumulated energy.</em></p>
<p>The seams of lowest cumulated energy can be seen as the pixels contributing the least to an image. By repeatedly removing or adding seams, it is thus possible to perform &#8220;content-aware&#8221; image reduction or extension. The resulting images feel more natural, less &#8220;streched&#8221;.</p>
<p><img src="http://www.mblondel.org/images/seam-carving/okinawa_good.jpg" style="border: 1px solid black" /><br />
<em>Height reduced by 50% by seam carving.</em></p>
<p><img src="http://www.mblondel.org/images/seam-carving/okinawa_bad.jpg" style="border: 1px solid black"  /><br />
<em>Height reduced by 50% by traditional rescaling.</em></p>
<p>Although seam carving doesn&#8217;t need human intervention, in the original paper, a graphical user interface (GUI) was also developed to let the user define areas that can&#8217;t be removed, or conversely, that must be removed.</p>
<p>In my opinion, seam carving is simple and elegant. No sophisticated object recognition algorithm was used, yet the results are quite impressive.</p>
<p>You can find my implementation in 250 lines of Python in my git repo:</p>
<p>$ git clone http://www.mblondel.org/code/seam-carving.git</p>
<p><a href="http://www.mblondel.org/gitweb?p=seam-carving.git;a=tree">web interface</a></p>
<p>Unfortunately, it&#8217;s too slow to be real-time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/02/09/seam-carving-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Easy parallelization with data decomposition</title>
		<link>http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/</link>
		<comments>http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 18:17:28 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
				<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=120</guid>
		<description><![CDATA[Recently I came across this blog post which introduced me to the new multiprocessing module in Python 2.6, a module to execute multiple concurrent processes. It makes parallelizing your programs very easy. The author also provided a smart code snippet that makes using multiprocessing even easier. I studied how the snippet works and I came [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I came across this <a href="http://gael-varoquaux.info/blog/?p=119">blog post</a> which introduced me to the new <a href="http://docs.python.org/library/multiprocessing.html">multiprocessing</a> module in Python 2.6, a module to execute multiple concurrent processes.  It makes parallelizing your programs very easy. The author also provided a smart code snippet that makes using multiprocessing even easier. I studied how the snippet works and I came up with an alternative solution which is in my opinion very elegant and easy to read. I&#8217;m so excited about the new possibilities provided by this module that I had to spread the word. But first, off to some background.</p>
<p><span id="more-120"></span></p>
<h3>The multi-core trend</h3>
<p>Moore&#8217;s law states that:</p>
<blockquote><p>The density of transistors on chips doubles every 24 month. </p></blockquote>
<p>Although <a href="http://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s law</a>, contrary to what is often thought <a href="http://www.thinkingparallel.com/2007/07/09/moores-law-is-dead-long-live-moores-law/">still holds true</a>, the exponential processor transistor growth predicted by Moore does not always translate into exponentially greater practical computing performance. Therefore parallel computation has recently become necessary to take full advantage of the gains allowed by Moore&#8217;s law. This explains the recent <a href="http://en.wikipedia.org/wiki/Multi-core">multi-core</a> trend: most recent computers are now equipped with 2 or more cores.</p>
<p>The problem is that you can&#8217;t just use multi-core equipped computers and hope that your programs will run faster on them. Programs need be modified to operate in a parallel fashion as opposed to a sequential fashion.</p>
<p>At the same time, languages like Ruby and Python are famous for their GIL (<a href="http://en.wikipedia.org/wiki/Global_Interpreter_Lock">Global Interpreter Lock</a>). Because of the GIL, even programs that are designed to be parallel can effectively use only one core at a time, resulting in no speed improvement. Parallelism here is just an illusion: the processor switches between threads but does so frequently that the user perceive the operations as being performed in parallel. </p>
<p>The novelty of the multiprocessing module in Python 2.6 is that is uses processes instead of threads (see <a href="http://en.wikipedia.org/wiki/Thread_%28computer_science%29#Threads_compared_with_processes">Threads compared with processes</a>) and it does not suffer from the GIL. Programs running on multi-cores can therefore operate in a truly parallel fashion.</p>
<h3>Parallelizing programs</h3>
<p>To make things simpler, let me quote the excellent blog post <a href="http://www.thinkingparallel.com/2007/09/06/how-to-split-a-problem-into-tasks/">How-to Split a Problem into Tasks</a>.</p>
<blockquote><p>
The very first step in every successful parallelization effort is always the same: you take a look at the problem that needs to be solved and start splitting it into tasks that can be computed in parallel. [...] what I am describing here is also called problem decomposition. The goal here is to divide the problem into several smaller subproblems, called tasks that can be computed in parallel later on. The tasks can be of different size and must not necessarily be independent.
</p></blockquote>
<p>And, about data decomposition:</p>
<blockquote><p>
When data structures with large amounts of similar data need to be processed, data decomposition is usually a well-performing decomposition technique. The tasks in this strategy consist of groups of data. These can be either input data, output data or even intermediate data, decompositions to all varieties are possible and may be useful. All processors perform the same operations on these data, which are often independent from one another. This is my favorite decomposition technique, because it is usually easy to do, often has no dependencies in between tasks and scales really well.
</p></blockquote>
<p>Data decomposition is so straightforward that it can without any doubt be called <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a>.</p>
<h3>Map</h3>
<p>If you are a Python user, you most probably know <a href="http://en.wikipedia.org/wiki/List_comprehension">list comprehensions</a>:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">math</span> <span style="color: #ff7700;font-weight:bold;">import</span> sqrt
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: black;">&#91;</span>sqrt<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span> 
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>In this example, sqrt is applied to each element of the list and a list is returned. The resulting list and the input list are therefore the same size.</p>
<p>Probably less known are generator comprehensions, which can be written by replacing the outer brackets with parentheses:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> gen = <span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> 
<span style="color: #66cc66;">&lt;</span>generator <span style="color: #008000;">object</span> at 0xb7cec56c<span style="color: #66cc66;">&gt;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> gen: <span style="color: #ff7700;font-weight:bold;">print</span> i
<span style="color: #ff4500;">1.0</span>
<span style="color: #ff4500;">2.0</span>
<span style="color: #ff4500;">3.0</span>
<span style="color: #ff4500;">4.0</span></pre></div></div>

<p>The difference between list and generator comprehensions is that list comprehensions are evaluated entirely before returning, while generator comprehensions yield results one by one. Generators are therefore more &#8220;lazy&#8221; and can results in big memory savings when iterating over large lists.</p>
<p>The outer parentheses can even be omitted when calling functions with only 1 argument:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">print</span><span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> 
<span style="color: #66cc66;">&lt;</span>generator <span style="color: #008000;">object</span> at 0xb7cec68c<span style="color: #66cc66;">&gt;</span></pre></div></div>

<p>For those more familiar with functional programming, list comprehensions are similar to the <a href="http://en.wikipedia.org/wiki/Map_%28higher-order_function%29">map</a> higher-order function.</p>

<div class="wp_syntax"><div class="code"><pre class="scheme" style="font-family:monospace;">&nbsp;</pre></div></div>

<p>In fact, Python has a built-in map function too.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span>sqrt, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<h3>Reduce</h3>
<p>While map applies a function to each element of a list and returns the resulting list, <a href="http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29">reduce</a> is a higher-order function that uses another function to combine the elements of a list in some way.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> plus = concatenate = <span style="color: #ff7700;font-weight:bold;">lambda</span> x,y: x+y
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #008000;">reduce</span><span style="color: black;">&#40;</span>plus, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span>,<span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: #ff4500;">10</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #008000;">reduce</span><span style="color: black;">&#40;</span>concatenate, <span style="color: black;">&#91;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>, <span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span></pre></div></div>

<h3>multiprocessing.Pool &#8216;s map</h3>
<p>Applying a function to each element of a list with map kind of assumes that the function is <a href="http://en.wikipedia.org/wiki/Pure_function">pure</a>, i.e. that the result output by the function is only function of its input arguments. Although nothing prevents you from giving an impure function as argument to map, it is dirty, potentially dangerous and not the functional philosophy. Concretely, it means that you&#8217;d better not use global variables or anything the state of which may be changed during the program execution, in your functions. This thus also includes instance methods (an object, in essence, encapsulates a state).</p>
<p>To reuse the terminology above, if we think of applying our function to each element of the list as tasks, then our tasks are independent from each other and so there&#8217;s is no reason to operate over the list sequentially. Independence is also very nice because communication and collaboration between threads/processes happen to be one of the most difficult aspect of concurrent programming. Here, no communication between threads/processes is required.</p>
<p>And here comes the new multiprocessing module and more particularly its Pool class. This class represents pools of worker processes and has a map method, which is similar to the map built-in function.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">from</span> multiprocessing <span style="color: #ff7700;font-weight:bold;">import</span> Pool
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool = Pool<span style="color: black;">&#40;</span>processes=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>sqrt, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">4</span>,<span style="color: #ff4500;">9</span>,<span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>The difference with the built-in map here is that 4 processes are used. This will result in about a 4x speedup if the computer running the program has at least 4 cores. Of course, sqrt is a toy example but here&#8217;s a real-life example in a Machine Learning context.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> image_sets = <span style="color: black;">&#91;</span>set1, ..., setn<span style="color: black;">&#93;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> preprocessed = pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>preprocess, images_sets<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> feat_sets = pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>feat_extract, preprocessed<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> models = pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>train, feat_sets<span style="color: black;">&#41;</span></pre></div></div>

<p>As long as you can write your code as list comprehensions, you can apply the data decomposition approach. It&#8217;s easy, abuse it!</p>
<p>However, spawning a process has a cost because of context switching. Therefore, when the function to be applied on each element returns quasi instantaneously, it may be worth splitting the data into larger chunks, run each chunk in a separate process and then recombine the results with reduce. (See also <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a>)</p>
<h3>Helpers</h3>
<p>Here are some helpers which make parallelizing your list comprehensions even more straightforward and easy to read.</p>
<p>As mentioned before, the <a href="http://gael-varoquaux.info/blog/?p=119">blog post</a> that introduced me to this new multiprocessing module also came with a smart code snippet. I reworked it to fit my liking and this is what it looks like now:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> sqrtd = delayed<span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> powd = delayed<span style="color: black;">&#40;</span><span style="color: #008000;">pow</span><span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> squares = <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool_parallelize<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>sqrtd<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> squares<span style="color: black;">&#93;</span>, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool_parallelize<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>powd<span style="color: black;">&#40;</span>i, <span style="color: #ff4500;">0.5</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> squares<span style="color: black;">&#93;</span>, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>Contrary to Pool&#8217;s map, this supports parallelizing functions of any arity.</p>
<p>Then I came up with this solution, which reduces the typing and is quite elegant.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> sqrtp = parallelized<span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> powp = parallelized<span style="color: black;">&#40;</span><span style="color: #008000;">pow</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> sqrtp<span style="color: black;">&#40;</span>squares, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> powp<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>i, <span style="color: #ff4500;">0.5</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> squares<span style="color: black;">&#93;</span>, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>Code available <strong><a href="http://www.mblondel.org/files/parallel.py.txt">here</a></strong>.</p>
<h3>Conclusion</h3>
<p>You really gotta love functional programming in Python! (By the way, see also Charming Python: Functional programming in Python <a href="http://www.ibm.com/developerworks/library/l-prog.html">part 1</a> and <a href="http://www.ibm.com/developerworks/library/l-prog2.html">part 2</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
