Tweeter button

The Little Machine Learner

The idea

I’ve been having this idea on my mind for quite some time: wouldn’t it be nice to write a book about Machine Learning where each chapter is a literate program?

From Wikipedia:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

From the PyLit homepage:

The idea is that you do not document programs (after the fact), but write documents that contain the programs.

There are plenty of great textbooks about Machine Learning out there, so the point would not be to write yet another one, but write something different. Here’s what I had been thinking.

  • Each chapter written as a literate program, organized so as to maximize understanding
  • Code in Python (+Numpy + Scipy but without any additional dependencies)
  • Readability over Performance
  • Intuitions, nice figures, useful tips or tricks
  • Real-world applications at the end of each chapter
  • Don’t shy away from the maths, especially if at high-school or undergraduate level…

I bet that quite a few algorithms can be written this way, yet remain very concise!

Except for the maths part, the closest book to this idea that I know of is probably “Programming Collective Intelligence: Building Smart Web 2.0 Applications”, by Toby Segaran.

An example with logistic regression

So, in order to experiment with what such a book could look like, I’ve decided to write a chapter about Logistic Regression. Topics I cover include Maximum Likelihood Estimation, Regularization and Cross-validation. At the end, I use heart disease prediction as an example of real-world application. Probably many things could be improved or added but the point for now is mainly to show what it could look like.

Tools

For the documentation tool, I’ve decided to go for Sphinx, which seems to be emerging as the de-facto documentation tool in the Python community. It has nice features like syntax highlighting, latex support and matplotlib plots support and can output to HTML and PDF.

Normally, in literate programming, there’s the literate source, which uses some kind of markup-language and tools are used to generate either code or documentation from it. I took a different approach. In my case, the source file is the code and the documentation is extracted from the comments in the code. Technically, it’s therefore closer to extensively documented code than actual literate programming. It has some limitations but the main advantages are that the program is runnable directly (since Python is interpreted) and the programmer can benefit from syntax highlighting. I wrote a simple program that converts Python source code to reStructuredText, as necessary for integration in Sphinx.

Interested?

It took quite some time to collect the information and do the actual writing but I feel like I improved my own understanding in the process, so I’m thinking of writing a chapter from time to time. If I do so, at the end of my PhD, I may have gathered enough material to make it a real book! The book could affectionately be entitled “The Little Machine Learner”, hence the title of this post.

Since Machine Learning is a very large field and to write a better book than I could possibly write alone, I’m also thinking that it could actually be a collaborative effort (by researchers, students and practitioners). If you’re interested, please leave a comment. I will create a discussion group if there’s enough interest.

As usual, the source code is available in my git repo:

$ git clone http://www.mblondel.org/code/tlml.git

web interface

2 Responses to “The Little Machine Learner”

  1. enj Says:

    This is an awesome idea. I really like the format for learning the math and the implementation at the same time. The hands on aspect of it is great!

    I could see this being a great way to teach numerical analysis or other computational science tools (interpolation, integration, signal processing) as well.

    Really sweet!

  2. Mathieu Says:

    Thanks for your comment, Ian. In my case, I guess what I like the most is that the implementation is right below the corresponding explanations. This makes the flow of explanations very natural.

    Speaking of numerical analysis, I’ve been looking into optimization lately since it’s so important in Machine Learning. It’s a very interesting field.

Leave a Reply