Archive for the ‘In English’ Category

Java + JRuby or Jython for scientific computing: a test-case with Hidden Markov Models

Tuesday, May 19th, 2009

When it comes to programming languages for scientific computing, researchers are usually faced with a trade-off between ease of programming and runtime performance. On the one hand, you have languages/toolboxes like Matlab or R which are easy to use but slow. On the other hand, you have C and C++ which take more time to develop but usually perform the best. There exist several alternative solutions that share a little bit of both worlds. Among others:

- Implement the computation-intensive parts in C or C++ and use a scripting language like Python or Ruby for all the rest. A wrapper is necessary in order to be able to use the library from the scripting language. SWIG can be used for that. The number of existing scientific libraries written in C or C++ is big.

- Implement the computation-heavy parts in Java (Java is not bad at number crunching!) and use another JVM-based language for the rest. Popular choices recently are JRuby, Jython, Scala, Groovy and Clojure but there exist many others. These languages are usually designed from the ground-up to integrate well with Java so a wrapper is not necessary in order to interact with the library. The number of existing Java packages for scientific computing is huge.

- Use Python with Numpy. Numpy is a package for fast manipulation of arrays and matrices. Since most of the time in computation-heavy algorithms is spent in calculations on arrays or matrices, there is a huge performance gain compared to plain Python. Together with matplotlib and scipy, the Python / Numpy duo is becoming more and more popular as a replacement for Matlab and thus the number of available scientific libraries is growing fast.

- Use OCaml. OCaml is a functional programming language which is said to have near the performance of C and near the ease of use of scripting languages. It’s pretty popular in the scientific community in France since this is where OCaml is originating from. However, the number of existing libraries is smaller than in C or Java.

I gave a shot to the Java solution today. I tried to use Jahmm (a Hidden Markov Models package written in Java) from both JRuby and Jython. It’s very nice to be able to use a Java library in Ruby or Python syntax! You can edit your source and try it right away without having to recompile.

One thing that I didn’t like in JRuby is that Ruby arrays must be explicitly converted to Java arrays in order to be used in a Java method. Say you have a Java method “foo(double[] n)”. If you want to use the Ruby array [1.0, 2.0, ...] as a parameter for foo, you need to convert it to a Java array with [1.0, 2.0, ...].to_java(:double). Otherwise, you get an error telling you that the parameters don’t match the signature of foo. Java supports method overloading. That is, the same method can be redefined with different number of parameters or different types of parameters. Jython has some heuristics to do the conversion transparently for you most of the time. This makes the Jython script feels more natural and easier to read.

The solution to use Java for the numerical computations and a JVM-based language for the rest is quite tempting. You can use the Java library (almost) transparently from your favorite language, whether it is Python, Ruby, Scala, Groovy or Closure… One thing that is missing though, is interoperability between JVM-based programming languages. Say you have portions of code written in JRuby, in addition to Java, it’s not yet possible to use them from Jython. So truly polyglot programming is not possible yet. You have to choose one JVM-based language in addition to Java and stick to it.

I tried HMMs with discrete, multivariate gaussians and mixtures of gaussians as observation probability distribution in both JRuby and Jython. You can have a look and compare which one you prefer.

JRuby: discrete.rb, multivariate.rb, mixture.rb
Jython: discrete.py, multivariate.py, mixture.py

Personal wiki

Sunday, May 10th, 2009

A few months ago, I set up a personal wiki on my server. By personal wiki, I mean a wiki that is for you and yourself only and that is not intended for anyone else to see. The advantage of the wiki is that you can read and edit it from anywhere and the wiki syntax is very convenient. Overall, I found the idea of a personal wiki to be very useful and I think it may help you organize your work/life as well.

TODO-list

I use my wiki to keep a list of things I want to do. This can be project ideas, books or publications I want to read, movies I want to watch. The only problem with a TODO-list is that you generally add more items to it than you remove so the list can grow quite fast!

Notes

Another thing I’ve been using my wiki for is taking notes. Every time I read a publication, I now write down the ideas I found interesting in the paper. For technical books, I try to make a quick summary after each chapter I read. Of course, this takes a little more time than just reading the book but I found out that 1) it helps me memorize the content better and 2) as I write down the summary, I sometimes realize that I didn’t fully grasp a concept and thus I have to clarify my understanding in order to write the notes. I also take notes of interesting companies, conferences, links, program commands I run across…

One important thing with taking notes is not getting too far in taking your notes – otherwise it’s like you’re rewriting the book that you’re reading or you’re recreating the internet…

Diary

I’ve also been using my wiki as a personal mini-diary. At the end of the day, I write down the meaningful things I did of my day and try to remember the interesting ideas I had. It didn’t become a habit yet so I forget to do it very often. Yet, one doesn’t necessarily have something interesting to say everyday so it can become a motivation to try to do something meaningful of your day.

Writing style

Another important thing to consider is that this kind of wiki is strictly personal — it should be protected with a password. Therefore you don’t have to worry about making typos or of what will people think. As a writing style, I make extensive use of bullet lists and I use a mix of English and French, depending on what comes out. Mostly for technical stuff, it often happens that words come in English rather than French so I write directly in English.

All-female cloning ants

Friday, April 17th, 2009

Apparently researchers from the University of Arizona have discovered an amazonian ant that has developed into an all-female species which reproduces by cloning.

More details can be found here:

Sexual reproduction is a great source of diversity which allows species to evolve over time through natural selection. So, from a pure evolutionary standpoint, one may wonder why these ants have favored cloning over sexual reproduction. The “choice” of cloning suggests that these ants are already very well adapted to their environment and can give up evolving. Asexual reproduction is faster, requires less energy and produces fewer errors (deficient individuals). In the case of these ants, it thus allows for mass reproduction of already well-suited individuals. However this comes at the great risk of extinction of the species on the long term because the species will be “trapped” into its current state (mutations are more rare in asexual reproduction) and won’t be able to adapt in case of environmental change (temperature change, new predator). So this strategy can be seen as favoring individuals over the species. Anyway, this is quite a fascinating discovery.

Numbers game solver

Monday, April 6th, 2009

Genetic programming is a Machine Learning technique that allows to “create programs that create programs” by simulating evolution. The purpose is to slowly converge to a program that performs a defined task as well as possible.

I wanted to try out Genetic Programming and I figured writing a solver for the “numbers game” would be a good example. If you don’t know it, it’s part of the popular game shows “Countdown” in the UK and “Des chiffres et des lettres” in France. For example, given the numbers 3, 75, 2, 4, 1 and 1, the goal is to arrive at 888 using the four basic arithmetic operations (+, −, × and ÷) like so:

74 = 75 – 1
12 = 4 * 3
888 = 74 * 12

The above calculations can be represented as a tree like this:

        *
     /      \
    -        *
  /   \    /   \
75    1    4    3

So it becomes clear that solving the numbers game is equivalent to writing little programs. Although I used the tree representation, a possibly simpler representation that I now think of is like a string of stack operations, like in Forth:

75 1 – 4 3 * *

You start with an initial population of random programs and every generation, you create a new population by selecting the best-performing programs from the current generation, mutating programs (changing a constant with another constant, changing an operator with another operator…) and breeding programs (creating programs which are the combination of two programs).

The generated programs must respect some constraints. For example, division must not have a remainder so 7 / 2 is not allowed. Another example is that the constants used must be among the 6 input numbers and can only be used once.

In Genetic Programming, to determine how good the generated programs perform, we need a scoring function. Here it’s very clear: the closer we are from the target number, the better. To decide between two solutions that perform as well as the other, we can select the one with the fewer calculations.

You can find my program here.

Another idea for applying Genetic Programming that I had is for creating hash functions. What would be the scoring function in that case? One could use the avalanche effect (the greater the avalanche effect, the better) or the average number of collisions (the smaller number of collisions, the better). Although I had this idea by myself, it seems that researchers are already investing this!

Source code classifier

Wednesday, March 18th, 2009

Of the numerous Machine Learning algorithms, naive Bayes classifiers are probably one of the simplest forms. Yet, they are known to be able to yield very good performance in some circumstances. The most famous example of their use is probably the spam filters, in which spam and non-spam emails are marked beforehand and throughout the filter’s life in order to train the filter to recognize spam from non-spam.

I’ve decided to try out this technique for source code classification: given a source file for which we don’t know the programming language, the purpose is to output the n most probable programming languages in which the source is written. Although I did it mainly to see what kind of results I could get, one use I can think of is in a text-editor, to activate or deactivate syntax highlighting when the file extension is missing.

Difficulties

I can think of several difficulties in trying to classify code source files:
- Unlike many natural languages in which spaces delimit words, in programming languages, spaces can often be omitted. For example print(“hello”) and print (“hello”) could well both be correct in a given language. The lack for spaces is detrimental to the performance because print(“hello”) will be treated as a single word and as a result have a very low probability.
- Programs are sometimes heavily commented by the programmer, which can cause the classifier to confuse source files and plain-text files.
- Since we don’t know the language in advance, we cannot rely on a grammar. Intuitively, we need to use the occurrence probability of words for each programming language.

Collect data

Because this technique involves training the program to recognize programming languages, it was first necessary to collect source files for different programming languages. I decided to limit myself to 24 languages from Programming Language Popularity for which I had easy access to open-source programs. Of course it is possible to add more by training the program with more source files.

For each programming language, I grabbed a few projects written in the language and for interpreted languages, I also grabbed the virtual machine source code because their standard library is often written in the language directly. I downloaded all the source files with the Debian command “apt-get source”, which grabs project sources from Debian mirrors, making the task a little bit more convenient but overall collecting source files was the most boring and time-consuming part of this project.

The thing is, to account for different programming styles and to find a good approximation of the word distributions, especially given the difficulties mentioned above, it’s very important to collect as much source files as one can. Once this was done I was able to start the program itself.

Ruby and C implementations

I made the program (trainer and classifier) in Ruby. The training part is quite slow (Ruby is quite slow for text processing) but the classifier is well responsive, which is the most important. I also made a C version of the classifier: the Ruby program is used to generate a header file (.h) which contains a hash table with the data for the statistical model and when the classifier is compiled, the hash table is included directly in the executable for blazing fast processing. The executable remains small (~200K for 24 languages).

An advantage of this approach is that most of the work is done offline (i.e. during the training) and when I tweak some parameters in the Ruby program, I just have to retrain, regenerate the .h file and recompile. The core C module doesn’t change.

Performance evaluation

I used a validation set to estimate how good the classifier performs: the percentage of correctly classified source files and the percentage of incorrectly classified source files and I got an overall accuracy of 85%. The validation set was composed of 40 sources files for each of the 24 programming languages. Here are a few interesting points I noticed:

- A few languages such ada, css, forth, ocaml are recognized correctly 100% of the time. Others such as tcl, erlang, scheme, haskell, smalltalk get an average accuracy over 95%. This shows that these languages stand out in their word distributions: the most common words in these languages are quite different from the most common words in other languages making it easier for the classifier to distinguish them.

- On the contrary, other languages get poor accuracy because they easily get recognized as another language. An example is C++ which, according to my tests, got recognized as C or Javascript in 30% of the source files. This is not very surprising knowing that these languages have a similar syntax.

- It’s also interesting to note what languages does the classifier recognizes when it makes mistakes. This shows us how close some languages can be. For example, not surpringly, the classifiers sometimes confuses common lisp with scheme and vice-versa.

Source / Plain text

If this classifier had to be integrated into a text editor, it would first be necessary to tell if the file is even source code or just plain text (like a README file). To do that, I would personally build a second model whose role is solely to distinguish source code from plain text. It’s possible to do that by training the classifier appropriately. If the file is recognized as plain text, syntax highlighting can be disabled, otherwise the programming language model can be used to determine the programming language and set the syntax highlighting accordingly. The only disadvantage to this approach is that the first model becomes a bottleneck to the overall process. If it does a poor job at distinguishing plain text from source code, then it is the whole system that will do a poor job.

Download

You can get the source code (LGPL license) with this command:
$ git clone http://www.mblondel.org/code/source-classifier.git

or by browsing the web interface.

And so Forth

Wednesday, March 11th, 2009

Lately I’ve been looking into the Forth programming language. It’s a fascinating language in many ways! It combines low-level language features such as direct access to memory, absence of garbage collection and, at the same time, high-level language features such as high degree of expressiveness, reflection, ability to modify the compiler at runtime, meta-programming.
(more…)

WPA with Intersil Prism chipset-based wireless card

Saturday, March 7th, 2009

In order to install a wireless connection in my mother’s laptop, I used an old Netgear PCMCIA wireless card of mine. However, I faced the problem that when clicking on the home network in the list of available networks in NetworkManager, WPA encryption wouldn’t come out as a possible encryption method, although the network IS using WPA. The solution was to flash the card’s firmware. You can find an excellent how-to here but here’s a quick summary of what I did (assuming the interface is wlan0).

# apt-get install hostap-utils
$ wget http://www.red-bean.com/~proski/firmware/Latest-prism.tar.bz2
$ tar xvfj  Latest-prism.tar.bz2
$ cd Latest-prism/
# prism2_srec -f wlan0 -O /proc/net/hostap/wlan0/pda primary-FLASH/pk010101.hex secondary-FLASH/sf010804.hex

You can run the following command before and after the firmware upgrade to check that the firmware version was upgraded correctly:

# hostap_diag wlan0

Installing Ubuntu from an existing Linux

Saturday, March 7th, 2009

I’m giving my old laptop to my mother so that she can browse the web, write emails and chat over instant messaging and I wanted to replace Debian with Ubuntu. However, the cdrom no longer works and the mother board doesn’t support booting from an USB drive. One solution is to add an entry to your bootloader pointing to a netinstall vmlinuz (here called linux) and initrd.gz and choose that entry at boot in order to perform the netinstall. You need an Internet connection over a LAN network with a DHCP server. All the details are available here.

First release of Tegaki

Wednesday, February 11th, 2009

Today I’m releasing Tegaki 0.1. Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, that is suitable for both the desktop and mobile devices, and that is designed from the ground up to work well with Chinese and Japanese.

Screencast video: ogg or youtube.

This release features desktop and SCIM integration. However, the main “innovation” brought to you by this release is the user interface. It uses two drawing areas for continuous writing. The user can eventually fix recognition errors by choosing alternative candidates or editing characters. Since a video is worth a thousand words, see the screencast above. This interface is largely inspired from the Nintendo DS game “Kanji Sono Mama Rakubiki Jiten” (漢字そのまま楽引辞典).

Tegaki is designed to be able to use several recognition engines. However so far it only supports Zinnia, which is the only recognition engine that I know with acceptable recognition accuracy and good performance on mobile devices. One challenge of the project in the future will be to create a new recognition engine that can yield better results than Zinnia.

A take that I have on this project is to use Python whenever this is possible and only use C or C++ when performance is critical, like in recognition engines. Compared to Tomoe, which implements everything in C and provides bindings for several languages, this means less reusability of the components but I hope this will make the project go forward faster.

There are still a lot of things that can be done in various areas but I really wanted to release the code I’ve put together so far because I think it can already be useful to end-users. By the way, Maemo supports both pygtk and SCIM through third-party projects, thus Tegaki is just a few Debian packages away from being available on Maemo.

For further details:
http://tegaki.sourceforge.net/

Dictzip reader in Ruby

Monday, January 5th, 2009

Both Ruby and Python have classes in their standard library to read transparently gzip-compressed files. This is very convenient because you can read compressed files just like you would do with normal files. However, random file access (i.e. moving the file position indicator to an arbitrary offset, using fseek) is not possible without performing serial access to the whole file. Because the file is compressed, there’s no way to know where a given portion of the uncompressed file is in the compressed file. Decompressing the whole file is unacceptable for large files and would be damn slow.

(more…)