Tweeter button

Archive for February, 2010

The Little Machine Learner

Thursday, February 18th, 2010

The idea

I’ve been having this idea on my mind for quite some time: wouldn’t it be nice to write a book about Machine Learning where each chapter is a literate program?

From Wikipedia:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

From the PyLit homepage:

The idea is that you do not document programs (after the fact), but write documents that contain the programs.

There are plenty of great textbooks about Machine Learning out there, so the point would not be to write yet another one, but write something different. Here’s what I had been thinking.

  • Each chapter written as a literate program, organized so as to maximize understanding
  • Code in Python (+Numpy + Scipy but without any additional dependencies)
  • Readability over Performance
  • Intuitions, nice figures, useful tips or tricks
  • Real-world applications at the end of each chapter
  • Don’t shy away from the maths, especially if at high-school or undergraduate level…

I bet that quite a few algorithms can be written this way, yet remain very concise!

Except for the maths part, the closest book to this idea that I know of is probably “Programming Collective Intelligence: Building Smart Web 2.0 Applications”, by Toby Segaran.

An example with logistic regression

So, in order to experiment with what such a book could look like, I’ve decided to write a chapter about Logistic Regression. Topics I cover include Maximum Likelihood Estimation, Regularization and Cross-validation. At the end, I use heart disease prediction as an example of real-world application. Probably many things could be improved or added but the point for now is mainly to show what it could look like.

Tools

For the documentation tool, I’ve decided to go for Sphinx, which seems to be emerging as the de-facto documentation tool in the Python community. It has nice features like syntax highlighting, latex support and matplotlib plots support and can output to HTML and PDF.

Normally, in literate programming, there’s the literate source, which uses some kind of markup-language and tools are used to generate either code or documentation from it. I took a different approach. In my case, the source file is the code and the documentation is extracted from the comments in the code. Technically, it’s therefore closer to extensively documented code than actual literate programming. It has some limitations but the main advantages are that the program is runnable directly (since Python is interpreted) and the programmer can benefit from syntax highlighting. I wrote a simple program that converts Python source code to reStructuredText, as necessary for integration in Sphinx.

Interested?

It took quite some time to collect the information and do the actual writing but I feel like I improved my own understanding in the process, so I’m thinking of writing a chapter from time to time. If I do so, at the end of my PhD, I may have gathered enough material to make it a real book! The book could affectionately be entitled “The Little Machine Learner”, hence the title of this post.

Since Machine Learning is a very large field and to write a better book than I could possibly write alone, I’m also thinking that it could actually be a collaborative effort (by researchers, students and practitioners). If you’re interested, please leave a comment. I will create a discussion group if there’s enough interest.

As usual, the source code is available in my git repo:

$ git clone http://www.mblondel.org/code/tlml.git

web interface

Seam Carving in Python

Tuesday, February 9th, 2010

Seam Carving is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper “Seam Carving for Content-Aware Image Resizing“.


Miyako Island, Okinawa, Japan.

The principle is very simple. Find the connected paths of low energy pixels (”the seams”). This can be done efficiently by dynamic programming (see my post on DTW).


Same image in the gradient domain showing the vertical and horizontal seams of lowest cumulated energy.

The seams of lowest cumulated energy can be seen as the pixels contributing the least to an image. By repeatedly removing or adding seams, it is thus possible to perform “content-aware” image reduction or extension. The resulting images feel more natural, less “streched”.


Height reduced by 50% by seam carving.


Height reduced by 50% by traditional rescaling.

Although seam carving doesn’t need human intervention, in the original paper, a graphical user interface (GUI) was also developed to let the user define areas that can’t be removed, or conversely, that must be removed.

In my opinion, seam carving is simple and elegant. No sophisticated object recognition algorithm was used, yet the results are quite impressive.

You can find my implementation in 250 lines of Python in my git repo:

$ git clone http://www.mblondel.org/code/seam-carving.git

web interface

Unfortunately, it’s too slow to be real-time.