<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Mathieu's log</title>
	<atom:link href="http://www.mblondel.org/journal/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mblondel.org/journal</link>
	<description>Computer science, Chinese, Japanese, random thoughts…</description>
	<pubDate>Sat, 21 Aug 2010 20:52:00 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>Latent Dirichlet Allocation in Python</title>
		<link>http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 20:52:00 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Machine Learning]]></category>

		<category><![CDATA[Natural Language Processing]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=127</guid>
		<description><![CDATA[Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) - see my previous post &#8220;LSA and pLSA in Python&#8220;, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the &#8220;topics&#8221; expressed by documents in that collection. LDA can be seen as a Bayesian [...]]]></description>
			<content:encoded><![CDATA[<p>Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) - see my previous post &#8220;<a href="http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/">LSA and pLSA in Python</a>&#8220;, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the &#8220;topics&#8221; expressed by documents in that collection. LDA can be seen as a <a href="http://en.wikipedia.org/wiki/Bayesian_inference">Bayesian</a> extension of pLSA. </p>
<p>As Blei, the author of LDA, points out, the topic proportions in pLSA are tied with the training documents. This is problematic: 1) the number of parameters grows linearly with the number of training documents, which can cause serious <a href="http://en.wikipedia.org/wiki/Overfitting">overfitting</a> 2) it is difficult to generalize to new documents and requires so-called &#8220;folding-in&#8221;. LDA fixes those issues by being a fully generative model: where pLSA uses a matrix of P(topic|document) probabilities, LDA uses a distribution over topics.</p>
<p>To date, there exists several parameter estimation schemes for LDA: variational Bayes, expectation propagation and Gibbs sampling. I&#8217;ve chosen to implement the latter. It has first been described in a paper entitled &#8220;Finding scientific topics&#8221;, by Griffiths and Steyvers.</p>
<h3>Artificial data</h3>
<p>As with all model-based algorithms, during the early development phase, it is useful to work with artificial data, generated by following the model assumptions. In the case of LDA (and pLSA), the core assumption is that words (w) in documents are generated by mixture of topics (z). In other words, the probability of a word is:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28w%29%20%3D%20%5Csum_%7Bz%7D%20P%28w%7Cz%29%20P%28z%29&#038;bg=ffffff&#038;fg=000000&#038;s=1' alt='P(w) = \sum_{z} P(w|z) P(z)' title='P(w) = \sum_{z} P(w|z) P(z)' class='latex' />
<p>The generative process can be summarized as follows: 1) set the topic proportions once for all when the collection is instantiated and 2) for each document and for as many words as needed, sample a topic from the topic distribution and sample a word from the word distribution of the selected topic. Obviously, this is only an approximation of how documents are created in reality.</p>
<p>To generate an artificial dataset, we can fix the word distribution of each topic and then generate documents as explained above. Since we generated documents by sticking to the generative assumption of the model, if the algorithm is correctly implemented, it should be able to recover the word distribution of each topic, from the generated documents.</p>
<h3>Graphical example</h3>
<p>To gain insight and intuition, we can reuse the graphical example from Griffiths and Steyvers&#8217; paper. </p>
<p>In the <a href="http://en.wikipedia.org/wiki/Bag_of_words_model">bag-of-words model</a>, documents are represented by vectors of dimension <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' />, where <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' /> is the vocabulary size. Moreover, an image of size <img src='http://s.wordpress.com/latex.php?latex=%5Csqrt%7BV%7D%20%5Ctimes%20%5Csqrt%7BV%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sqrt{V} \times \sqrt{V}' title='\sqrt{V} \times \sqrt{V}' class='latex' /> has <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' /> pixels: it can thus be stored as a string/vector of length/size <img src='http://s.wordpress.com/latex.php?latex=V&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='V' title='V' class='latex' />. This means that a document in the bag-of-words model can be represented as an image, where pixels correspond to words and pixel intensities correspond to word counts!</p>
<p>As put previously, we first need to fix the word distribution of each topic. Let&#8217;s arbitrarily create <strong>10</strong> topics.</p>
<p><strong>5</strong> with &#8220;vertical&#8221; bars:</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic4.png" style="border:1px solid" /></td>
</tr>
</table>
<p>and another <strong>5</strong> with &#8220;horizontal&#8221; bars:</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>Each topic distribution is represented by a <strong>5&#215;5</strong> image, so the vocabulary is of size <strong>25</strong>. Black pixels correspond to words that the topic will never possibly generate. White pixels correspond to words that the topic can generate with probability 1/5.</p>
<p>Now let&#8217;s generate <strong>500</strong> documents using the generative process previously described. Here are 3 examples of such generated documents.</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/doc0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/doc1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/doc2.png" style="border:1px solid" /></td>
</tr>
</table>
<p>We clearly see bars emerging from the documents and can thus confirm that documents are mixtures of topics.</p>
<p>We can now use the generated documents as training data. If the Gibbs sampler is correctly implemented, we should be able to recover the original topics. Here are the results for the 1st, 6th and 26th iterations. The number between brackets is the log-likelihood.</p>
<p>1st iteration (-278541.7835):</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic0-0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-4.png" style="border:1px solid" /></td>
</tr>
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic0-5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic0-9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>5th iteration (-165139.56193):</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic5-0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-4.png" style="border:1px solid" /></td>
</tr>
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic5-5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic5-9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>[...]</p>
<p>26th iteration (-129272.328181):</p>
<table border="0" cellspacing="10">
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic25-0.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-1.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-2.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-3.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-4.png" style="border:1px solid" /></td>
</tr>
<tr>
<td><img src="http://www.mblondel.org/images/lda/topic25-5.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-6.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-7.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-8.png" style="border:1px solid" /></td>
<td><img src="http://www.mblondel.org/images/lda/topic25-9.png" style="border:1px solid" /></td>
</tr>
</table>
<p>After a few iterations, we see that the algorithm recovered the topics correctly. Also, the log-likelihood increases: as the number of iterations increases, it becomes more and more likely that the model generated the data. The fact that it works pretty well is not surprising: the data used were generated by sticking to the model assumptions.</p>
<h3>Gibbs sampling</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampler</a> used is said to be collapsed: the parameters of interest are not sampled directly. Instead we sample the topic assignments and the parameters can be computed in terms of those.</p>
<p>It is not necessarily obvious from the equation of the full conditional distribution (from which the topic assignments are sampled) but the sampler is naturally sparse: it doesn&#8217;t need to iterate over words with zero-count. This is a nice property, given that sampling algorithms are often considered slow.</p>
<h3>Source code</h3>
<p><a href="http://gist.github.com/542786">http://gist.github.com/542786</a></p>
<p>Fairly readable and compact code but to be considered a toy implementation.</p>
<h3>Useful Resources</h3>
<h4>MCMC</h4>
<p>- &#8220;<a href="http://videolectures.net/mlss09uk_murray_mcmc/">MCMC lecture at MLSS09</a>&#8221; (Iain Murray). Nice for a first general overview and the insights.</p>
<p>- &#8220;Gibbs sampling for the uninitiated&#8221; (Resnik and Hardisty). Nice for a first general overview and the insights.</p>
<p>- &#8220;Pattern Recognition and Machine Learning&#8221; (Bishop), Chapters 8 and 11 on graphical models and sampling methods. Excellent chapters.</p>
<p>- &#8220;<a href="http://users.aims.ac.za/~ioana/">Review Course: Markov Chains and Monte Carlo Methods</a>&#8221; (Cosma and Evers). Very nice free online course and solutions to exercises in Python and R!</p>
<h4>LDA</h4>
<p>- &#8220;Latent Dirichlet Allocation&#8221; (Blei et al, 2003). By Blei himself.</p>
<p>- &#8220;Finding scientific topics&#8221; (Griffiths and Steyvers). Insightful comments and nice intuitive graphical example.</p>
<p>- &#8220;Parameter Estimation for text analysis&#8221; (Heinrich). Very nice introduction to Bayesian thinking. Pseudo-code for the LDA Gibbs sampler.</p>
<p>- &#8220;On an equivalence between PLSI and LDA&#8221; (Girolami and Kaban). Connections between pLSA and LDA.</p>
<p>- &#8220;Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling&#8221; (Carpenter). Very detailed, step-by-step derivation of the collapsed Gibbs samplers for LDA and NB.</p>
<p>- &#8220;Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details&#8221; (Wang). Insightful comments and pseudo-code of the LDA Gibbs sampler.</p>
<h4>Other Python implementations</h4>
<p>- <a href="http://github.com/nrolland/pyLDA">nrolland&#8217;s pyLDA</a>. Works fine but mixes Python-style and Numpy-style.</p>
<p>- <a href="http://github.com/alextp/pylda">alextp&#8217;s pylda</a>. Numpy-style but not tested.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Semi-supervised Naive Bayes in Python</title>
		<link>http://www.mblondel.org/journal/2010/06/21/semi-supervised-naive-bayes-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/06/21/semi-supervised-naive-bayes-in-python/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 17:47:50 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=126</guid>
		<description><![CDATA[Expectation-Maximization
The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can&#8217;t directly count. For example, in pLSA, the document topics z are latent [...]]]></description>
			<content:encoded><![CDATA[<h3>Expectation-Maximization</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Expectation-maximization_algorithm">Expectation-Maximization</a> (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can&#8217;t directly count. For example, in <a href="http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/">pLSA</a>, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we&#8217;re looking for: the model parameters. First, we make an initial guess, which can be either random or &#8220;our best bet&#8221;. Then, in the E-step, we use our current model parameters to estimate some &#8220;measures&#8221;, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.</p>
<h3>Naive Bayes trained with EM</h3>
<p>In their paper &#8220;Semi-supervised Text Classification Using EM&#8221;, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning">semi-supervised</a> fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:</p>
<ul>
<li>Train a classifier with your labeled data</li>
<li>While the model likelihood increases:
<ul>
<li>E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.</li>
<li>M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.</li>
</ul>
</ul>
<p>The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.</p>
<h3>Code</h3>
<p>I made a simple implementation of it in Python + Numpy. The code is fairly optimized.</p>
<p>$ git clone http://www.mblondel.org/code/seminb.git</p>
<p><a href="http://www.mblondel.org/gitweb?p=seminb.git">web interface</a></p>
<h3>Implementation details</h3>
<p>Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.</p>
<p>Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28x_i%7Cc_j%29%20%3D%20%5Cprod_%7Bt%7D%5EV%20P%28w_t%7Cc_j%29%5E%7Bx_%7Bit%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}' title='P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}' class='latex' />
<p>However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Clog%20P%28x_i%7Cc_j%29%20%3D%20%5Csum_%7Bt%7D%5EV%20x_%7Bit%7D%20%5Clog%20P%28w_t%7Cc_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)' title='\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)' class='latex' />
<p>To turn around P(x|c), we use Bayes&#8217;rule:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cfrac%7BP%28c_j%29P%28x_i%7Cc_j%29%7D%7BP%28x_i%29%7D%20%3D%20%5Cfrac%7BP%28c_j%29P%28x_i%7Cc_j%29%7D%7B%5Csum_k%20P%28c_k%29P%28x_i%7Cc_k%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}' title='P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}' class='latex' />
<p>By posing <img src='http://s.wordpress.com/latex.php?latex=z_j%20%3D%20%5Clog%20P%28c_j%29%20%2B%20%5Clog%20P%28x_i%7Cc_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='z_j = \log P(c_j) + \log P(x_i|c_j)' title='z_j = \log P(c_j) + \log P(x_i|c_j)' class='latex' />, we get:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cfrac%7Be%5E%7Bz_j%7D%7D%7B%5Csum_k%20e%5E%7Bz_k%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}' title='P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}' class='latex' />
<p>This is the <a href="http://en.wikipedia.org/wiki/Softmax_activation_function">softmax</a> function. However, we are back to our initial problem because, since the <img src='http://s.wordpress.com/latex.php?latex=z_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='z_j' title='z_j' class='latex' /> are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant <img src='http://s.wordpress.com/latex.php?latex=e%5E%7B-m%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e^{-m}' title='e^{-m}' class='latex' />:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cfrac%7Be%5E%7Bz_j-m%7D%7D%7B%5Csum_k%20e%5E%7Bz_k-m%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}' title='P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}' class='latex' />
<p>Setting m to <img src='http://s.wordpress.com/latex.php?latex=max_j%7Ez_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='max_j~z_j' title='max_j~z_j' class='latex' />, the <img src='http://s.wordpress.com/latex.php?latex=z_j-m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='z_j-m' title='z_j-m' class='latex' /> values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.</p>
<p>This trick will improve the situation quite a lot but in case this is not enough:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28y_i%3Dc_j%7Cx_i%29%20%3D%20%5Cbegin%7Bcases%7D%200%2C%20%26%20%5Cmbox%7Bif%20%7D%20%20z_j%20-%20m%20%20%5Cle%20t%20%20%5C%5C%20%5Cfrac%7Be%5E%7Bz_j-m%7D%7D%7B%5Csum_%7B%5C%7Bk%7E%3A%7Ez_k-m%20%3E%20t%5C%7D%7D%20e%5E%7Bz_k-m%7D%7D%2C%20%20%26%20%5Cmbox%7Botherwise%7D%20%5Cend%7Bcases%7D&#038;bg=ffffff&#038;fg=000000&#038;s=2' alt='P(y_i=c_j|x_i) = \begin{cases} 0, &#038; \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m &gt; t\}} e^{z_k-m}},  &#038; \mbox{otherwise} \end{cases}' title='P(y_i=c_j|x_i) = \begin{cases} 0, &#038; \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m &gt; t\}} e^{z_k-m}},  &#038; \mbox{otherwise} \end{cases}' class='latex' />
<p>This sets the exponentials to zero when <img src='http://s.wordpress.com/latex.php?latex=e%5E%7Bz_j-m%7D%20%5Cle%20e%5Et&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e^{z_j-m} \le e^t' title='e^{z_j-m} \le e^t' class='latex' />. For t=-10, this is 0.000045. Equivalently this corresponds to setting the exponentials to zero when <img src='http://s.wordpress.com/latex.php?latex=e%5E%7Bz_j%7D%20%5Cle%20e%5E%7Bt%2Bm%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e^{z_j} \le e^{t+m}' title='e^{z_j} \le e^{t+m}' class='latex' />. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.</p>
<h3>Reference</h3>
<blockquote><p>Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/06/21/semi-supervised-naive-bayes-in-python/feed/</wfw:commentRss>
		</item>
		<item>
		<title>LSA and pLSA in Python</title>
		<link>http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/#comments</comments>
		<pubDate>Sun, 13 Jun 2010 17:42:58 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Machine Learning]]></category>

		<category><![CDATA[Natural Language Processing]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=125</guid>
		<description><![CDATA[Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Analysis</a> (LSA) and its <a href="http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis">probabilistic counterpart</a> pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus <a href="http://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised</a> <a href="http://en.wikipedia.org/wiki/Dimension_reduction">dimensionality reduction</a> techniques. Applications include <a href="http://en.wikipedia.org/wiki/Information_Retrieval">information retrieval</a>, document classification and <a href="http://en.wikipedia.org/wiki/Collaborative_filtering">collaborative filtering</a>.</p>
<p>Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing. </p>
<h3>Comparison</h3>
<table border="1" style="margin-top:10px">
<tr>
<th>&nbsp;</th>
<th>LSA</th>
<th>pLSA</th>
</tr>
<tr>
<td>1. Theoretical background</td>
<td>Linear Algebra</td>
<td>Probabilities and Statistics</td>
</tr>
<tr>
<td>2. Objective function</td>
<td>Frobenius norm</td>
<td>Likelihood function</td>
</tr>
<tr>
<td>3. Polysemy</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>4. Folding-in</td>
<td>Straightforward</td>
<td>Complicated</td>
</tr>
</table>
<p>1. LSA stems from Linear Algebra as it is nothing more than a <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a>. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).</p>
<p>2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the <a href="http://en.wikipedia.org/wiki/Matrix_norm">Frobenius norm</a> of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the <a href="http://en.wikipedia.org/wiki/Least_squares">least squares</a> solution corresponds to the <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">Maximum Likelihood</a> solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling. </p>
<p>The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.</p>
<p>Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).</p>
<p>3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.</p>
<p>4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called &#8220;folding-in&#8221;. Clearly, new documents folded-in don&#8217;t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time. </p>
<p>In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm. </p>
<h3>Implementation in Python</h3>
<p>LSA is straightforward to implement as it is nothing more than a SVD and Numpy&#8217;s Linear Algebra module has a function &#8220;svd&#8221; already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn&#8217;t mean that the SVD is not full, just that the returned matrices don&#8217;t contain vectors corresponding to zero singular values. Scipy&#8217;s Linear Algebra package unfortunately doesn&#8217;t seem to have a sparse SVD. Likewise, there&#8217;s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).</p>
<p>pLSA&#8217;s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.</p>
<p>The implementation is sparse. It works with both Numpy&#8217;s ndarrays and Scipy&#8217;s sparce matrices.</p>
<p>$ git clone http://www.mblondel.org/code/plsa.git</p>
<p><a href=" http://www.mblondel.org/gitweb?p=plsa.git">web interface</a></p>
<p>Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it&#8217;s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/06/13/lsa-and-plsa-in-python/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Little Machine Learner</title>
		<link>http://www.mblondel.org/journal/2010/02/18/the-little-machine-learner/</link>
		<comments>http://www.mblondel.org/journal/2010/02/18/the-little-machine-learner/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 11:57:16 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=124</guid>
		<description><![CDATA[The idea
I&#8217;ve been having this idea on my mind for quite some time: wouldn&#8217;t it be nice to write a book about Machine Learning where each chapter is a literate program?
From Wikipedia:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, [...]]]></description>
			<content:encoded><![CDATA[<h3>The idea</h3>
<p>I&#8217;ve been having this idea on my mind for quite some time: wouldn&#8217;t it be nice to write a book about <a href="http://en.wikipedia.org/wiki/Machine_learning">Machine Learning</a> where each chapter is a <a href="http://en.wikipedia.org/wiki/Literate_programming">literate program</a>?</p>
<p>From Wikipedia:</p>
<blockquote><p>
The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.
</p></blockquote>
<p>From the <a href="http://pylit.berlios.de/">PyLit</a> homepage:</p>
<blockquote><p>
The idea is that you do not document programs (after the fact), but write documents that contain the programs.
</p></blockquote>
<p>There are plenty of great textbooks about Machine Learning out there, so the point would not be to write yet another one, but write something <em>different</em>.  Here&#8217;s what I had been thinking.</p>
<ul>
<li>Each chapter written as a literate program, organized so as to maximize understanding</li>
<li>Code in Python (+Numpy + Scipy but without any additional dependencies)</li>
<li>Readability over Performance</li>
<li>Intuitions, nice figures, useful tips or tricks</li>
<li>Real-world applications at the end of each chapter</li>
<li>Don&#8217;t shy away from the maths, especially if at high-school or undergraduate level&#8230;</li>
</ul>
<p>I bet that quite a few algorithms can be written this way, yet remain very concise!</p>
<p>Except for the maths part, the closest book to this idea that I know of is probably &#8220;Programming Collective Intelligence: Building Smart Web 2.0 Applications&#8221;, by Toby Segaran.</p>
<h3>An example with logistic regression</h3>
<p>So, in order to experiment with what such a book could look like, I&#8217;ve decided to write a chapter about <a href="http://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a>. Topics I cover include Maximum Likelihood Estimation, Regularization and Cross-validation. At the end, I use heart disease prediction as an example of real-world application. Probably many things could be improved or added but the point for now is mainly to show <em>what it could look like</em>.</p>
<ul>
<li><strong><a href="http://www.mblondel.org/tlml/logreg.py.html">HTML</a></strong></li>
<li><strong><a href="http://www.mblondel.org/tlml/tlml.pdf">PDF</a></strong></li>
<li><strong><a href="http://www.mblondel.org/gitweb?p=tlml.git;a=blob;f=logreg.py">Code</a></strong>
</ul>
<h3>Tools</h3>
<p>For the documentation tool, I&#8217;ve decided to go for <a href="http://sphinx.pocoo.org/">Sphinx</a>, which seems to be emerging as the de-facto documentation tool in the Python community. It has nice features like syntax highlighting, latex support and <a href="http://matplotlib.sourceforge.net/">matplotlib</a> plots support and can output to HTML and PDF.</p>
<p>Normally, in literate programming, there&#8217;s the literate source, which uses some kind of markup-language and tools are used to generate either code or documentation from it. I took a different approach. In my case, the source file is the code and the documentation is extracted from the comments in the code. Technically, it&#8217;s therefore closer to extensively documented code than actual literate programming. It has some limitations but the main advantages are that the program is runnable directly (since Python is interpreted) and the programmer can benefit from syntax highlighting. I wrote a simple program that converts Python source code to reStructuredText, as necessary for integration in Sphinx.</p>
<h3>Interested?</h3>
<p>It took quite some time to collect the information and do the actual writing but I feel like I improved my own understanding in the process, so I&#8217;m thinking of writing a chapter from time to time. If I do so, at the end of my PhD, I may have gathered enough material to make it a real book! The book could affectionately be entitled &#8220;The Little Machine Learner&#8221;, hence the title of this post.</p>
<p>Since Machine Learning is a very large field and to write a better book than I could possibly write alone, I&#8217;m also thinking that it could actually be a collaborative effort (by researchers, students and practitioners). If you&#8217;re interested, please leave a comment. I will create a discussion group if there&#8217;s enough interest.</p>
<p>As usual, the source code is available in my git repo:</p>
<p>$ git clone http://www.mblondel.org/code/tlml.git</p>
<p><a href="http://www.mblondel.org/gitweb?p=tlml.git;a=tree">web interface</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/02/18/the-little-machine-learner/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Seam Carving in Python</title>
		<link>http://www.mblondel.org/journal/2010/02/09/seam-carving-in-python/</link>
		<comments>http://www.mblondel.org/journal/2010/02/09/seam-carving-in-python/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 15:57:20 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[Image Processing]]></category>

		<category><![CDATA[In English]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=123</guid>
		<description><![CDATA[Seam Carving is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper &#8220;Seam Carving for Content-Aware Image Resizing&#8220;.

Miyako Island, Okinawa, Japan.
The principle is very simple. Find the connected paths of low energy pixels (&#8221;the seams&#8221;). This can be done efficiently by dynamic programming (see my post on [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Seam_carving">Seam Carving</a> is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper &#8220;<a href="http://www.shaiavidan.org/papers/imretFinal.pdf">Seam Carving for Content-Aware Image Resizing</a>&#8220;.</p>
<p><a href="http://www.flickr.com/photos/ippei-janine/2165260667/"><img src="http://www.mblondel.org/images/seam-carving/okinawa.jpg" style="border: 1px solid black" /></a><br />
<em>Miyako Island, Okinawa, Japan.</em></p>
<p>The principle is very simple. Find the connected paths of low energy pixels (&#8221;the seams&#8221;). This can be done efficiently by <a href="http://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a> (see my post on <a href="http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/">DTW</a>). </p>
<p><img src="http://www.mblondel.org/images/seam-carving/okinawa_gradient.jpg" /><br />
<em>Same image in the gradient domain showing the vertical and horizontal seams of lowest cumulated energy.</em></p>
<p>The seams of lowest cumulated energy can be seen as the pixels contributing the least to an image. By repeatedly removing or adding seams, it is thus possible to perform &#8220;content-aware&#8221; image reduction or extension. The resulting images feel more natural, less &#8220;streched&#8221;.</p>
<p><img src="http://www.mblondel.org/images/seam-carving/okinawa_good.jpg" style="border: 1px solid black" /><br />
<em>Height reduced by 50% by seam carving.</em></p>
<p><img src="http://www.mblondel.org/images/seam-carving/okinawa_bad.jpg" style="border: 1px solid black"  /><br />
<em>Height reduced by 50% by traditional rescaling.</em></p>
<p>Although seam carving doesn&#8217;t need human intervention, in the original paper, a graphical user interface (GUI) was also developed to let the user define areas that can&#8217;t be removed, or conversely, that must be removed.</p>
<p>In my opinion, seam carving is simple and elegant. No sophisticated object recognition algorithm was used, yet the results are quite impressive.</p>
<p>You can find my implementation in 250 lines of Python in my git repo:</p>
<p>$ git clone http://www.mblondel.org/code/seam-carving.git</p>
<p><a href="http://www.mblondel.org/gitweb?p=seam-carving.git;a=tree">web interface</a></p>
<p>Unfortunately, it&#8217;s too slow to be real-time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/02/09/seam-carving-in-python/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Caching computation tasks</title>
		<link>http://www.mblondel.org/journal/2010/01/27/caching-computation-tasks/</link>
		<comments>http://www.mblondel.org/journal/2010/01/27/caching-computation-tasks/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 14:12:37 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=121</guid>
		<description><![CDATA[When I work on computationally expensive projects (e.g., Machine Learning), I always find myself in the same situation: my programs can be broken down into a chain of tasks, where tasks may depend on the results of other tasks. A typical such chain would be:
preprocessing -> feature-extraction -> training -> evaluation
If I make a modification [...]]]></description>
			<content:encoded><![CDATA[<p>When I work on computationally expensive projects (e.g., Machine Learning), I always find myself in the same situation: my programs can be broken down into a chain of tasks, where tasks may depend on the results of other tasks. A typical such chain would be:</p>
<p>preprocessing -> feature-extraction -> training -> evaluation</p>
<p>If I make a modification in my training algorithm and want to re-evaluate it, I do need to re-run the &#8220;training&#8221; and &#8220;evaluation&#8221; tasks, but I don&#8217;t need and don&#8217;t want to re-run the &#8220;processing&#8221; and &#8220;feature-extraction&#8221; tasks, especially if they take time to compute.</p>
<p>At first, I tried to save and load task results manually. This quickly proved unmanageable so I started to think of ways to automate this. Since I had quite a precise idea of what I wanted, I&#8217;ve decided to write my own tool, at the risk of reinventing the wheel. (I suspect it&#8217;s quite hard to come up with a universal tool, though) To keep things simple, I&#8217;ve decided to limit the tool&#8217;s scope to projects that can be run on a single computer, typically with multi-cores. In particular, it won&#8217;t support any kind of distributed computing.<br />
<span id="more-121"></span></p>
<h3>Dependency resolution &#038; Object persistence</h3>
<p>Basically, the tool boils down to <a href="http://en.wikipedia.org/wiki/Topological_sorting">dependency resolution</a> and <a href="http://en.wikipedia.org/wiki/Object_persistence">object persistence</a>. <em>make</em> is an obvious possibility for dependency resolution, but it can only use file modification time to decide whether to recompute tasks or not. In my tool, I check cache availability for a task based on the task inputs (these can be outputs from previous tasks, files, algorithm parameters&#8230;) as well as source code. The source code is also taken into consideration because a task&#8217;s result is likely to change if the task source code has changed.</p>
<p>To store objects on the filesystem or in a database, you need a way to serialize and deserialize objects. In the python world, the obvious choice is <a href="http://docs.python.org/library/pickle.html">pickle</a>, which is also used by the module <a href="http://docs.python.org/library/shelve.html">shelve</a> to store objects in a dbm database with a dict-like interface. Pickle is quite slow to load and save big lists of objects though, so I created two sqlite-based stores called KeyListStore and ListStore, to address this issue. A difficulty is how to efficiently compute a hash that identifies objects uniquely. To make things simple, I just took the hash of pickled objects. This is wrong, since pickle doesn&#8217;t guarantee to return twice the same strings for two same objects. However, while this can lead to incorrectly invalidating a cache, forcing the task to be recomputed, hopefully, it&#8217;s very unlikely that a cache is mistaken for the cache of another object. In practice, I haven&#8217;t had any problem with cache so far.</p>
<p>One feature in Python that was particularly useful for this tool was decorators. They can be used to change the behavior of a function, in a declarative style.</p>
<h3>Example</h3>
<p>Here&#8217;s a concrete example of a program written with my tool:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> numpy <span style="color: #ff7700;font-weight:bold;">as</span> np
<span style="color: #ff7700;font-weight:bold;">import</span> taskmanager <span style="color: #ff7700;font-weight:bold;">as</span> tm
&nbsp;
DFLT_TRAIN = <span style="color: #483d8b;">&quot;/path/to/...&quot;</span>
DFLT_EVAL = <span style="color: #483d8b;">&quot;/path/to/...&quot;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Preprocessing</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> preprocess<span style="color: black;">&#40;</span>img_folder, normalize<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>preprocess_img<span style="color: black;">&#40;</span>img, normalize<span style="color: black;">&#41;</span> \ 
                  <span style="color: #ff7700;font-weight:bold;">for</span> img <span style="color: #ff7700;font-weight:bold;">in</span> img_folder.<span style="color: black;">get_files</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>tm.<span style="color: black;">directory</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;*.jpg&quot;</span><span style="color: black;">&#41;</span>, <span style="color: #008000;">bool</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">def</span> preproc_train<span style="color: black;">&#40;</span>img_folder=DFLT_TRAIN, normalize=<span style="color: #008000;">False</span><span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> preprocess<span style="color: black;">&#40;</span>img_folder, normalize<span style="color: black;">&#41;</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>tm.<span style="color: black;">directory</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;*.jpg&quot;</span><span style="color: black;">&#41;</span>, <span style="color: #008000;">bool</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">def</span> preproc_eval<span style="color: black;">&#40;</span>img_folder=DFLT_EVAL, normalize=<span style="color: #008000;">False</span><span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> preprocess<span style="color: black;">&#40;</span>img_folder, normalize<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Feature extraction</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>preproc_train<span style="color: black;">&#41;</span>:
<span style="color: #ff7700;font-weight:bold;">def</span> fextract_train<span style="color: black;">&#40;</span>images<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>...<span style="color: black;">&#93;</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>preproc_eval<span style="color: black;">&#41;</span>:
<span style="color: #ff7700;font-weight:bold;">def</span> fextract_eval<span style="color: black;">&#40;</span>images<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>...<span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Training</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>fextract_train, <span style="color: #008000;">int</span>, <span style="color: #008000;">float</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">def</span> train<span style="color: black;">&#40;</span>features, maxiter=<span style="color: #ff4500;">10</span>, esp=<span style="color: #ff4500;">0.0001</span><span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>...<span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Evaluation results</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>fextract_eval, train<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">def</span> evaluate<span style="color: black;">&#40;</span>features, models<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>...<span style="color: black;">&#93;</span>
&nbsp;
@tm.<span style="color: black;">task</span><span style="color: black;">&#40;</span>evaluate<span style="color: black;">&#41;</span>
@nocache
<span style="color: #ff7700;font-weight:bold;">def</span> results<span style="color: black;">&#40;</span>eval_res<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: black;">&#91;</span>...<span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">try</span>:
        tm.<span style="color: black;">TaskManager</span>.<span style="color: black;">OUTPUT_FOLDER</span> = <span style="color: #483d8b;">&quot;./tmp&quot;</span>
        tm.<span style="color: black;">run_command</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> tm.<span style="color: black;">TaskManagerError</span>, m:
        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #66cc66;">&gt;&gt;</span>sys.<span style="color: black;">stderr</span>, m
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> __name__ == <span style="color: #483d8b;">&quot;__main__&quot;</span>:
    main<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<ul>
<li>The tool is quite unobtrusive.</li>
<li>There&#8217;s no need to deal with file names or file versions, this is all done transparently for you.</li>
<li>You get a command-line interface for free. Here, since all tasks have a default parameter, you could just run &#8220;./mypgm.py results&#8221; and it would work. If you wanted to try out different parameters for, e.g., &#8220;train&#8221;, you could run &#8220;./mypgm.py train:5 results&#8221;</li>
</ul>
<h3>Code</h3>
<p>Code available <a href="http://www.mblondel.org/gitweb?p=taskmanager.git;a=tree">here</a>. </p>
<p>Everything is kept in one file to make it easy to copy the tool to another project. The tool is quite usable already but of course, it&#8217;s a work in progress.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2010/01/27/caching-computation-tasks/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Easy parallelization with data decomposition</title>
		<link>http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/</link>
		<comments>http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 18:17:28 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=120</guid>
		<description><![CDATA[Recently I came across this blog post which introduced me to the new multiprocessing module in Python 2.6, a module to execute multiple concurrent processes.  It makes parallelizing your programs very easy. The author also provided a smart code snippet that makes using multiprocessing even easier. I studied how the snippet works and I [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I came across this <a href="http://gael-varoquaux.info/blog/?p=119">blog post</a> which introduced me to the new <a href="http://docs.python.org/library/multiprocessing.html">multiprocessing</a> module in Python 2.6, a module to execute multiple concurrent processes.  It makes parallelizing your programs very easy. The author also provided a smart code snippet that makes using multiprocessing even easier. I studied how the snippet works and I came up with an alternative solution which is in my opinion very elegant and easy to read. I&#8217;m so excited about the new possibilities provided by this module that I had to spread the word. But first, off to some background.</p>
<p><span id="more-120"></span></p>
<h3>The multi-core trend</h3>
<p>Moore&#8217;s law states that:</p>
<blockquote><p>The density of transistors on chips doubles every 24 month. </p></blockquote>
<p>Although <a href="http://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s law</a>, contrary to what is often thought <a href="http://www.thinkingparallel.com/2007/07/09/moores-law-is-dead-long-live-moores-law/">still holds true</a>, the exponential processor transistor growth predicted by Moore does not always translate into exponentially greater practical computing performance. Therefore parallel computation has recently become necessary to take full advantage of the gains allowed by Moore&#8217;s law. This explains the recent <a href="http://en.wikipedia.org/wiki/Multi-core">multi-core</a> trend: most recent computers are now equipped with 2 or more cores.</p>
<p>The problem is that you can&#8217;t just use multi-core equipped computers and hope that your programs will run faster on them. Programs need be modified to operate in a parallel fashion as opposed to a sequential fashion.</p>
<p>At the same time, languages like Ruby and Python are famous for their GIL (<a href="http://en.wikipedia.org/wiki/Global_Interpreter_Lock">Global Interpreter Lock</a>). Because of the GIL, even programs that are designed to be parallel can effectively use only one core at a time, resulting in no speed improvement. Parallelism here is just an illusion: the processor switches between threads but does so frequently that the user perceive the operations as being performed in parallel. </p>
<p>The novelty of the multiprocessing module in Python 2.6 is that is uses processes instead of threads (see <a href="http://en.wikipedia.org/wiki/Thread_%28computer_science%29#Threads_compared_with_processes">Threads compared with processes</a>) and it does not suffer from the GIL. Programs running on multi-cores can therefore operate in a truly parallel fashion.</p>
<h3>Parallelizing programs</h3>
<p>To make things simpler, let me quote the excellent blog post <a href="http://www.thinkingparallel.com/2007/09/06/how-to-split-a-problem-into-tasks/">How-to Split a Problem into Tasks</a>.</p>
<blockquote><p>
The very first step in every successful parallelization effort is always the same: you take a look at the problem that needs to be solved and start splitting it into tasks that can be computed in parallel. [...] what I am describing here is also called problem decomposition. The goal here is to divide the problem into several smaller subproblems, called tasks that can be computed in parallel later on. The tasks can be of different size and must not necessarily be independent.
</p></blockquote>
<p>And, about data decomposition:</p>
<blockquote><p>
When data structures with large amounts of similar data need to be processed, data decomposition is usually a well-performing decomposition technique. The tasks in this strategy consist of groups of data. These can be either input data, output data or even intermediate data, decompositions to all varieties are possible and may be useful. All processors perform the same operations on these data, which are often independent from one another. This is my favorite decomposition technique, because it is usually easy to do, often has no dependencies in between tasks and scales really well.
</p></blockquote>
<p>Data decomposition is so straightforward that it can without any doubt be called <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a>.</p>
<h3>Map</h3>
<p>If you are a Python user, you most probably know <a href="http://en.wikipedia.org/wiki/List_comprehension">list comprehensions</a>:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">math</span> <span style="color: #ff7700;font-weight:bold;">import</span> sqrt
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: black;">&#91;</span>sqrt<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span> 
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>In this example, sqrt is applied to each element of the list and a list is returned. The resulting list and the input list are therefore the same size.</p>
<p>Probably less known are generator comprehensions, which can be written by replacing the outer brackets with parentheses:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> gen = <span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> 
<span style="color: #66cc66;">&lt;</span>generator <span style="color: #008000;">object</span> at 0xb7cec56c<span style="color: #66cc66;">&gt;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> gen: <span style="color: #ff7700;font-weight:bold;">print</span> i
<span style="color: #ff4500;">1.0</span>
<span style="color: #ff4500;">2.0</span>
<span style="color: #ff4500;">3.0</span>
<span style="color: #ff4500;">4.0</span></pre></div></div>

<p>The difference between list and generator comprehensions is that list comprehensions are evaluated entirely before returning, while generator comprehensions yield results one by one. Generators are therefore more &#8220;lazy&#8221; and can results in big memory savings when iterating over large lists.</p>
<p>The outer parentheses can even be omitted when calling functions with only 1 argument:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">print</span><span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> 
<span style="color: #66cc66;">&lt;</span>generator <span style="color: #008000;">object</span> at 0xb7cec68c<span style="color: #66cc66;">&gt;</span></pre></div></div>

<p>For those more familiar with functional programming, list comprehensions are similar to the <a href="http://en.wikipedia.org/wiki/Map_%28higher-order_function%29">map</a> higher-order function.</p>

<div class="wp_syntax"><div class="code"><pre class="scheme" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">map</span> <span style="color: #b1b100;">sqrt</span> '<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1</span> <span style="color: #cc66cc;">4</span> <span style="color: #cc66cc;">9</span> <span style="color: #cc66cc;">16</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1.0</span> <span style="color: #cc66cc;">2.0</span> <span style="color: #cc66cc;">3.0</span> <span style="color: #cc66cc;">4.0</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>In fact, Python has a built-in map function too.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span>sqrt, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<h3>Reduce</h3>
<p>While map applies a function to each element of a list and returns the resulting list, <a href="http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29">reduce</a> is a higher-order function that uses another function to combine the elements of a list in some way.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> plus = concatenate = <span style="color: #ff7700;font-weight:bold;">lambda</span> x,y: x+y
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #008000;">reduce</span><span style="color: black;">&#40;</span>plus, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span>,<span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: #ff4500;">10</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #008000;">reduce</span><span style="color: black;">&#40;</span>concatenate, <span style="color: black;">&#91;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>, <span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span></pre></div></div>

<h3>multiprocessing.Pool &#8217;s map</h3>
<p>Applying a function to each element of a list with map kind of assumes that the function is <a href="http://en.wikipedia.org/wiki/Pure_function">pure</a>, i.e. that the result output by the function is only function of its input arguments. Although nothing prevents you from giving an impure function as argument to map, it is dirty, potentially dangerous and not the functional philosophy. Concretely, it means that you&#8217;d better not use global variables or anything the state of which may be changed during the program execution, in your functions. This thus also includes instance methods (an object, in essence, encapsulates a state).</p>
<p>To reuse the terminology above, if we think of applying our function to each element of the list as tasks, then our tasks are independent from each other and so there&#8217;s is no reason to operate over the list sequentially. Independence is also very nice because communication and collaboration between threads/processes happen to be one of the most difficult aspect of concurrent programming. Here, no communication between threads/processes is required.</p>
<p>And here comes the new multiprocessing module and more particularly its Pool class. This class represents pools of worker processes and has a map method, which is similar to the map built-in function.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">from</span> multiprocessing <span style="color: #ff7700;font-weight:bold;">import</span> Pool
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool = Pool<span style="color: black;">&#40;</span>processes=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>sqrt, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">4</span>,<span style="color: #ff4500;">9</span>,<span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>The difference with the built-in map here is that 4 processes are used. This will result in about a 4x speedup if the computer running the program has at least 4 cores. Of course, sqrt is a toy example but here&#8217;s a real-life example in a Machine Learning context.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> image_sets = <span style="color: black;">&#91;</span>set1, ..., setn<span style="color: black;">&#93;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> preprocessed = pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>preprocess, images_sets<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> feat_sets = pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>feat_extract, preprocessed<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> models = pool.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>train, feat_sets<span style="color: black;">&#41;</span></pre></div></div>

<p>As long as you can write your code as list comprehensions, you can apply the data decomposition approach. It&#8217;s easy, abuse it!</p>
<p>However, spawning a process has a cost because of context switching. Therefore, when the function to be applied on each element returns quasi instantaneously, it may be worth splitting the data into larger chunks, run each chunk in a separate process and then recombine the results with reduce. (See also <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a>)</p>
<h3>Helpers</h3>
<p>Here are some helpers which make parallelizing your list comprehensions even more straightforward and easy to read.</p>
<p>As mentioned before, the <a href="http://gael-varoquaux.info/blog/?p=119">blog post</a> that introduced me to this new multiprocessing module also came with a smart code snippet. I reworked it to fit my liking and this is what it looks like now:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> sqrtd = delayed<span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> powd = delayed<span style="color: black;">&#40;</span><span style="color: #008000;">pow</span><span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> squares = <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">9</span>, <span style="color: #ff4500;">16</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool_parallelize<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>sqrtd<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> squares<span style="color: black;">&#93;</span>, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> pool_parallelize<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>powd<span style="color: black;">&#40;</span>i, <span style="color: #ff4500;">0.5</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> squares<span style="color: black;">&#93;</span>, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>Contrary to Pool&#8217;s map, this supports parallelizing functions of any arity.</p>
<p>Then I came up with this solution, which reduces the typing and is quite elegant.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> sqrtp = parallelized<span style="color: black;">&#40;</span>sqrt<span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> powp = parallelized<span style="color: black;">&#40;</span><span style="color: #008000;">pow</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> sqrtp<span style="color: black;">&#40;</span>squares, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #66cc66;">&gt;&gt;&gt;</span> powp<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>i, <span style="color: #ff4500;">0.5</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> squares<span style="color: black;">&#93;</span>, njobs=<span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1.0</span>, <span style="color: #ff4500;">2.0</span>, <span style="color: #ff4500;">3.0</span>, <span style="color: #ff4500;">4.0</span><span style="color: black;">&#93;</span></pre></div></div>

<p>Code available <strong><a href="http://www.mblondel.org/files/parallel.py.txt">here</a></strong>.</p>
<h3>Conclusion</h3>
<p>You really gotta love functional programming in Python! (By the way, see also Charming Python: Functional programming in Python <a href="http://www.ibm.com/developerworks/library/l-prog.html">part 1</a> and <a href="http://www.ibm.com/developerworks/library/l-prog2.html">part 2</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/feed/</wfw:commentRss>
		</item>
		<item>
		<title>First look at Cython</title>
		<link>http://www.mblondel.org/journal/2009/11/27/first-look-at-cython/</link>
		<comments>http://www.mblondel.org/journal/2009/11/27/first-look-at-cython/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 08:14:22 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=119</guid>
		<description><![CDATA[The Python and C/C++ duo
Lately, Python and C/C++ are becoming my language combination of choice for my research. It&#8217;s a pragmatical choice. 
Regarding Python:
- It has interesting packages for scientific computing such as NumPy (fast multi-dimensional arrays and vectorized code), SciPy (reusable scientific packages), Matplotlib (plotting), IPython (Matlab-like interactive environment).
- It has many libraries and [...]]]></description>
			<content:encoded><![CDATA[<h3>The Python and C/C++ duo</h3>
<p>Lately, Python and C/C++ are becoming my language combination of choice for my research. It&#8217;s a pragmatical choice. </p>
<p>Regarding Python:</p>
<p>- It has interesting packages for scientific computing such as <a href="http://numpy.scipy.org/">NumPy</a> (fast multi-dimensional arrays and vectorized code), <a href="http://www.scipy.org/">SciPy</a> (reusable scientific packages), <a href="http://matplotlib.sourceforge.net/">Matplotlib</a> (plotting), <a href="http://ipython.scipy.org">IPython</a> (Matlab-like interactive environment).<br />
- It has many libraries and many bindings/wrappers for C/C++ libraries, including in my fields of interest such as Machine Learning, Natural Language Processing and Image Processing.<br />
- It has many users, meaning that more people can contribute to your projects.<br />
- It&#8217;s a full-fledge language, with powerful features and a large standard library.</p>
<p>Regarding C/C++:</p>
<p>- They are the most commonly used languages to write native extensions for Python. Even though it&#8217;s possible to get huge speedups by vectorizing your code with NumPy (avoid for loops like the plague!), you can never get anywhere close to native programs speed.<br />
- They are pretty much the fastest languages out there, although Fortran can be faster.</p>
<p>In a nutshell, I try to use Python and NumPy as much as possible and when necessary, I rewrite selected portions in C or C++.</p>
<p><span id="more-119"></span></p>
<h3>Wrapping with SWIG</h3>
<p>Wrappers/bindings need be created in order to be able to call C/C++ code from Python. Although it&#8217;s possible to write such bindings by hand, using the Python C API, it&#8217;s quite a tedious task and a lot of it can be automated. </p>
<p><a href="http://www.swig.org">SWIG</a> is a tool that reads a .i interface file and can generate a wrapper automatically. The interface file defines the signature of the functions to wrap, together with hand-made rules to handle cases that SWIG cannot process automatically. SWIG can generate bindings for many languages besides Python.</p>
<p>SWIG can usually manage to generate bindings automatically for functions with simple types. However, when functions require arrays, pointers or structures, hand-made rules usually need be added to the interface file to tell SWIG how to process these types. In that case, SWIG becomes, in my opinion, a complicated tool and needs quite some time to master.</p>
<p>So far I never had to wrap existing third-party libraries and the only code I had to wrap is my own, to speed up selected portions of my Python code. In that case, my approach has been to specifically <em>design</em> my code to be easy to wrap by SWIG. Basically, the trick is to avoid all types that SWIG cannot handle directly and instead use C++ objects as a facade to your data. For example, to represent a handwritten stroke, instead of using a 2-dimensional array of integer coordinates (x, y), I can create a Stroke class, which defines a add_point(int x, int y) method.</p>
<p>C++ side:</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">void</span> do_something <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> <span style="color: #000040;">**</span>stroke<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
<span style="color: #0000ff;">void</span> do_something2 <span style="color: #008000;">&#40;</span>Stroke <span style="color: #000040;">*</span>stroke<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span></pre></div></div>

<p>Python side:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">mydata = <span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>x1,y1<span style="color: black;">&#41;</span>, ..., <span style="color: black;">&#40;</span>xn,yn<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># won't work because need to instruct SWIG</span>
<span style="color: #808080; font-style: italic;"># how to convert a list of int pairs to a int **</span>
do_something<span style="color: black;">&#40;</span>mydata<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># OK</span>
stroke = Stroke<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> x,y <span style="color: #ff7700;font-weight:bold;">in</span> mydata:
    stroke.<span style="color: black;">add_point</span><span style="color: black;">&#40;</span>x, y<span style="color: black;">&#41;</span>
do_something2<span style="color: black;">&#40;</span>stroke<span style="color: black;">&#41;</span></pre></div></div>

<p>As can be seen, this approach is far from ideal but at least, the code can be wrapped without adding any hand-made rules to SWIG&#8217;s interface file. However, it does require you to design your code specifically to make it easy to wrap for SWIG. The resulting Python code feels objected-oriented but not necessarily as pythonic as it could be.</p>
<h3>Wrapping with Cython</h3>
<p><a href="http://www.cython.org">Cython</a> (not to be confused with CPython, the default implementation of the Python interpreter) is a tool which allows to compile Python into native Python extensions. It generates a C file which is then compiled to a Python extension with gcc. The compilation is completely transparent to the user thanks to the integration of Cython in distutils (you know, the module used by setup.py).</p>
<p>Compiling Python to native extension in itself won&#8217;t provide any speed up compared to purely interpreted code, although it does provide an effective way to do close-source in Python. However, Cython also provides additional features to the Python language such as static types  that can give hints to the compiler to optimize the extension (Cython is therefore a new language, similar to Python). Concretely, Cython feels like Python with additional C-like features. The programmer can thus optimize selected portions of code by providing additional hints to the compiler, while still being able to use Python objects in other portions if willing to pay the speed cost.</p>
<p>Moreover, there are mechanisms in Cython to easily call external C/C++ functions. So with Cython, one can write extensions running at native speed either by writing them entirely in Cython, or, in a combination of C/C++ and Cython.</p>
<p>To try Cython, I&#8217;ve written a small extension to perform Dynamic Time Warping (see my <a href="http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/"> recent post</a>). Rather than writing my extension entirely in Cython, I&#8217;ve opted for the solution of writing it in C and wrapping it in Cython. There were several reasons for that: </p>
<p>- Cython is a language in its own right meaning that you need to learn it. Although Cython does seem easy to learn, I didn&#8217;t want to spend too much time for now.<br />
- If I am to write C-style code, I prefer to write it in C directly. Mixing C-like code and Python-like code in the same functions seemed to be confusing me. I found it much easier to reason about my program if I split the C-like code and the Python-like code into two separate files.<br />
- My editor doesn&#8217;t have Cython syntax highlighting ;-)</p>
<p>When wrapping C code with Cython, a useful trick is to leave memory allocation of both your input and output data to the Cython side. For example, my dtw C function takes a matrix as input and outputs a new matrix as result. Rather than allocating the output matrix on the C side, I allocate it from the Cython side and pass it as an additional argument to my function. It makes the C code clearer because it can focus on processing the data and you don&#8217;t need to care about memory de-allocation since objects are taken care of by the garbage collector just like other objects.</p>
<p>When using Numpy to allocate your arrays and matrices, it&#8217;s important to know the internal ndarray memory model: how elements are organized in memory. The <a href="http://docs.scipy.org/doc/">Guide to Numpy</a> has all the information you need for that.</p>
<p>This DTW implementation is trivial: it doesn&#8217;t support features commonly found in other implementations like window constraints. However, it was a nice way to try out Cython and I got a 100x speed-up compared to the pure Python and Numpy version. The problem of DTW and dynamic programming algorithms in general is that they are difficult to vectorize so writing a native extension makes a lot of sense here. </p>
<p>The source code is available <a href="http://www.mblondel.org/gitweb?p=fdtw.git;a=summary">here</a>.</p>
<p>For completeness, finally let me mention <a href="http://docs.python.org/library/ctypes.html">ctypes</a> and <a href="http://www.scipy.org/Weave">Weave</a>, which are respectively a way to call C functions in pure Python and a way to inline C in your Python programs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2009/11/27/first-look-at-cython/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Fantasdic on Mac OS X install how-to</title>
		<link>http://www.mblondel.org/journal/2009/09/13/fantasdic-on-mac-os-x-install-how-to/</link>
		<comments>http://www.mblondel.org/journal/2009/09/13/fantasdic-on-mac-os-x-install-how-to/#comments</comments>
		<pubDate>Sun, 13 Sep 2009 12:14:52 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[In English]]></category>

		<category><![CDATA[Projects]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=118</guid>
		<description><![CDATA[This is how you can install Fantasdic, my (self-proclaimed ;-)) versatile dictionary application in Mac OS X. Windows users can download an application bundle from the official website and Linux users can probably install it from their distro&#8217;s package manager (at least on Debian, Ubuntu and Fedora).
1. Macports
Install Macports.
2. X11
Install X11 for Mac OS X.
3. [...]]]></description>
			<content:encoded><![CDATA[<p>This is how you can install <a href="http://projects.gnome.org/fantasdic/">Fantasdic</a>, my (self-proclaimed ;-)) versatile dictionary application in Mac OS X. Windows users can download an application bundle from the official website and Linux users can probably install it from their distro&#8217;s package manager (at least on Debian, Ubuntu and Fedora).</p>
<h3>1. Macports</h3>
<p>Install <a href="http://www.macports.org/">Macports</a>.</p>
<h3>2. X11</h3>
<p>Install <a href="http://www.apple.com/downloads/macosx/apple/macosx_updates/x11formacosx.html">X11 for Mac OS X</a>.</p>
<h3>3. Fantasdic</h3>
<p>Install dependencies:<br />
$ sudo port install rb-gtk2 rb-libglade2 git-core</p>
<p>Retrieve latest source code:<br />
$ git clone git://git.gnome.org/fantasdic</p>
<p>Install fantasdic:<br />
$ cd fantasdic/<br />
$ ruby setup.rb config<br />
$ ruby setup.rb setup<br />
$ sudo ruby setup.rb install</p>
<p>You can now launch fantasdic by running the &#8220;fantasdic&#8221; command.</p>
<p>You can use <a href="http://www.sveinbjorn.org/platypus">Platypus</a> to make it a dock application. In that case, you need to input the full path to the ruby interpreter and fantasdic: /opt/local/bin/ruby and /opt/local/bin/fantasdic, respectively.</p>
<h3>4. Kinput2 and canna</h3>
<p>You can safely skip this if you don&#8217;t need to input Japanese.</p>
<p>Install kinput2 and canna (kana-kanji conversion server):<br />
$ sudo port install kinput2 canna</p>
<p>Activate canna on startup:<br />
$ sudo launchctl load -w /opt/local/etc/LaunchDaemons/org.macports.canna/org.macports.canna.plist</p>
<p>Activate kinput2 on X&#8217;s startup:<br />
$ cp /usr/X11/lib/X11/xinit/xinitrc ~/.xinitrc<br />
$ vi ~/.xinitrc</p>
<p>And add the following line below &#8220;# start some nice programs&#8221;:<br />
test -x /opt/local/bin/kinput2 &#038;&#038; /opt/local/bin/kinput2 &#038;</p>
<p>The command to launch fantasdic is now:<br />
XMODIFIERS=&#8221;@im=kinput2&#8243; GTK_IM_MODULE=&#8221;xim&#8221; LANG=&#8221;ja_JP.UTF-8&#8243; fantasdic</p>
<p>And the obligatory screenshot ;-)</p>
<p><a href="http://www.mblondel.org/images/fantasdic-osx.jpg"><img src="http://www.mblondel.org/images/fantasdic-osx.jpg" width="400" style="border: 0px" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2009/09/13/fantasdic-on-mac-os-x-install-how-to/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Dynamic Time Warping : theory</title>
		<link>http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/</link>
		<comments>http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 05:28:55 +0000</pubDate>
		<dc:creator>Mathieu</dc:creator>
		
		<category><![CDATA[Dynamic Time Warping]]></category>

		<category><![CDATA[Handwriting Recognition]]></category>

		<category><![CDATA[In English]]></category>

		<guid isPermaLink="false">http://www.mblondel.org/journal/?p=116</guid>
		<description><![CDATA[Recently, I&#8217;ve been working on a new handwriting recognition engine for Tegaki based on Dynamic Time Warping and I figured it would be interesting to make a short, informal introduction to it.
Dynamic Time Warping (DTW) is a well-known algorithm which aims at comparing and aligning two sequences of data points (a.k.a time series). Although it [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;ve been working on a new handwriting recognition engine for <a href="http://www.tegaki.org">Tegaki</a> based on <a href="http://en.wikipedia.org/wiki/Dynamic_time_warping">Dynamic Time Warping</a> and I figured it would be interesting to make a short, informal introduction to it.</p>
<p>Dynamic Time Warping (DTW) is a well-known algorithm which aims at comparing and aligning two sequences of data points (a.k.a <a href="http://en.wikipedia.org/wiki/Time_series">time series</a>). Although it was originally developed for speech recognition (see [1]), it has also been applied to many other fields like <a href="http://en.wikipedia.org/wiki/Bioinformatics">bioinformatics</a>, <a href="http://en.wikipedia.org/wiki/Econometrics">econometrics</a> and, of course, handwriting recognition.</p>
<p>Consider two sequences A and B, composed respectively of n and m <a href="http://en.wikipedia.org/wiki/Feature_vector">feature vectors</a>. </p>
<p><img src="http://www.mblondel.org/images/dtw3.gif" /></p>
<p>Each feature vector is d-dimensional and can thus be represented as a point in a d-dimensional <a href="http://en.wikipedia.org/wiki/Feature_space">space</a>. For example, in handwriting recognition, we could directly use the raw (x,y) coordinates of the pen movement and that would make us sequences of 2-dimensional vectors. In practice however, one would extract more useful features from (x,y) and create vectors of dimension possibly greater than 2.  It&#8217;s also worth noting that the sequences A and B can be of different length.</p>
<h3>Time warping</h3>
<p>DTW works by warping (hence the name) the time axis iteratively until an optimal match between the two sequences is found.</p>
<p><img src="http://www.mblondel.org/images/dtw1.gif" /></p>
<p>In the figure above, which is an example of two sequences of data points with only 1 dimension, the time axis is warped so that each data point in the green sequence is optimally aligned to a point in the blue sequence.</p>
<h3>Best path</h3>
<p>We can construct a n x m distance matrix. In this matrix, each cell (i,j) represents the distance between the i-th element of sequence A and the j-th element of sequence B. The distance <a href="http://en.wikipedia.org/wiki/Metric_%28mathematics%29">metric</a> used depends on the application but a common metric is the <a href="http://en.wikipedia.org/wiki/Euclidean_distance">euclidean distance</a>.</p>
<p><img src="http://www.mblondel.org/images/dtw2.gif" /></p>
<p>Finding the best alignment between two sequences can be seen as finding the shortest path to go from the bottom-left cell to the top-right cell of that matrix. The length of a path is simply the sum of all the cells that were visited along that path. The further away the optimal path wanders from the diagonal, the more the two sequences need to be warped to match together.</p>
<p>The brute force approach to finding the shortest path would be to try each path one by one and finally select the shortest one. However it&#8217;s apparent that it would result in an explosion of paths to explore, especially if the two sequences are long. To solve this problem, DTW uses two things: constraints and dynamic programming.</p>
<h3>Constraints</h3>
<p>DTW can impose several kinds of reasonable constraints, to limit the number of paths to explore.</p>
<ul>
<li>Monotonicity: The alignment path doesn&#8217;t go back in time index. This guarantees that features are not repeated in the alignment.</li>
<li>Continuity: The alignment doesn&#8217;t jump in time index. This guarantees that important features are not omitted.</li>
<li>Boundary: The alignment starts at the bottom-left and ends at the top-right. This guarantees that the sequences are not considered only partially.</li>
<li>Warping window: A good alignment path is unlikely to wander too far from the diagonal. This guarantees that the alignment doesn&#8217;t try to skip different features or get stuck at similar features.</li>
<li>Shape: Aligned paths shouldn&#8217;t be too steep or too shallow. This prevents short sequences to be aligned with long ones.</li>
</ul>
<p>These constraints are best visualized in [3].</p>
<h3>Dynamic Programming</h3>
<p>Taking advantage of such constraints, DTW uses <a href="http://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a> to find the best alignment in a recursive way. Previously, the cell (i,j) of the distance matrix was defined as &#8220;the distance between the i-th element of sequence A and the j-th element of sequence B&#8221;. In the dynamic programming way of thinking, this definition is changed, and instead, the cell (i,j) is defined as the length of the shortest path <strong>up to</strong> that cell. Assuming local constraints like below, </p>
<p><img src="http://www.mblondel.org/images/dtw5.jpg" /></p>
<p>it allows us to define the cell (i,j) recursively:</p>
<p>cell(i,j) = local_distance(i,j) + MIN(cell(i-1,j), cell(i-1,j-1), cell(i, j-1))</p>
<p>Here, recursively means that the shortest path up to the cell (i,j) is defined in terms of the shortest path up to the adjacent cells. A lot of different local constraints can be defined (see this <a href="http://www.mblondel.org/images/dtw4.jpg">table</a>) and thus there are many variations in the way DTW can be implemented.</p>
<h3>DTW as a distance metric</h3>
<p>Once the algorithm has reached the top-right cell, we can use <a href="http://en.wikipedia.org/wiki/Backtracking">backtracking</a> in order to retrieve the best alignment. If we&#8217;re just interested in comparing the two sequences however, then the top-right cell of the matrix just happens to be the length of the shortest path. We can therefore use the value stored in this cell as the distance between the two sequences. DTW has the nice property to be symmetric so DTW(a,b) = DTW(b,a). Also, DTW doesn&#8217;t fulfill the <a href="http://en.wikipedia.org/wiki/Triangle_inequality">triangle inequality</a> but it isn&#8217;t a problem in practice.</p>
<h3>Related algorithms</h3>
<p>DTW looks almost identical to the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein algorithm</a>, an algorithm to compare strings, and is very similar to the <a href="http://en.wikipedia.org/wiki/Smith-Waterman_algorithm">Smith-Waterman algorithm</a>, an algorithm for <a href="http://en.wikipedia.org/wiki/Sequence_alignment">sequence alignment</a>.</p>
<h3>References</h3>
<p>[1] Sakoe, H. and Chiba, S., Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1) pp. 43- 49, 1978</p>
<p>[2]  <a href="http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWalgorithm.htm">DTW algorithm @ GenTχWarper</a></p>
<p>[3] <a href="http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWAlgorithm.ppt">PowerPoint presentation by Elena Tsiporkova</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
