Tweeter button

Source code classifier

March 18th, 2009

Of the numerous Machine Learning algorithms, naive Bayes classifiers are probably one of the simplest forms. Yet, they are known to be able to yield very good performance in some circumstances. The most famous example of their use is probably the spam filters, in which spam and non-spam emails are marked beforehand and throughout the filter’s life in order to train the filter to recognize spam from non-spam.

I’ve decided to try out this technique for source code classification: given a source file for which we don’t know the programming language, the purpose is to output the n most probable programming languages in which the source is written. Although I did it mainly to see what kind of results I could get, one use I can think of is in a text-editor, to activate or deactivate syntax highlighting when the file extension is missing.

Difficulties

I can think of several difficulties in trying to classify code source files:
- Unlike many natural languages in which spaces delimit words, in programming languages, spaces can often be omitted. For example print(”hello”) and print (”hello”) could well both be correct in a given language. The lack for spaces is detrimental to the performance because print(”hello”) will be treated as a single word and as a result have a very low probability.
- Programs are sometimes heavily commented by the programmer, which can cause the classifier to confuse source files and plain-text files.
- Since we don’t know the language in advance, we cannot rely on a grammar. Intuitively, we need to use the occurrence probability of words for each programming language.

Collect data

Because this technique involves training the program to recognize programming languages, it was first necessary to collect source files for different programming languages. I decided to limit myself to 24 languages from Programming Language Popularity for which I had easy access to open-source programs. Of course it is possible to add more by training the program with more source files.

For each programming language, I grabbed a few projects written in the language and for interpreted languages, I also grabbed the virtual machine source code because their standard library is often written in the language directly. I downloaded all the source files with the Debian command “apt-get source”, which grabs project sources from Debian mirrors, making the task a little bit more convenient but overall collecting source files was the most boring and time-consuming part of this project.

The thing is, to account for different programming styles and to find a good approximation of the word distributions, especially given the difficulties mentioned above, it’s very important to collect as much source files as one can. Once this was done I was able to start the program itself.

Ruby and C implementations

I made the program (trainer and classifier) in Ruby. The training part is quite slow (Ruby is quite slow for text processing) but the classifier is well responsive, which is the most important. I also made a C version of the classifier: the Ruby program is used to generate a header file (.h) which contains a hash table with the data for the statistical model and when the classifier is compiled, the hash table is included directly in the executable for blazing fast processing. The executable remains small (~200K for 24 languages).

An advantage of this approach is that most of the work is done offline (i.e. during the training) and when I tweak some parameters in the Ruby program, I just have to retrain, regenerate the .h file and recompile. The core C module doesn’t change.

Performance evaluation

I used a validation set to estimate how good the classifier performs: the percentage of correctly classified source files and the percentage of incorrectly classified source files and I got an overall accuracy of 85%. The validation set was composed of 40 sources files for each of the 24 programming languages. Here are a few interesting points I noticed:

- A few languages such ada, css, forth, ocaml are recognized correctly 100% of the time. Others such as tcl, erlang, scheme, haskell, smalltalk get an average accuracy over 95%. This shows that these languages stand out in their word distributions: the most common words in these languages are quite different from the most common words in other languages making it easier for the classifier to distinguish them.

- On the contrary, other languages get poor accuracy because they easily get recognized as another language. An example is C++ which, according to my tests, got recognized as C or Javascript in 30% of the source files. This is not very surprising knowing that these languages have a similar syntax.

- It’s also interesting to note what languages does the classifier recognizes when it makes mistakes. This shows us how close some languages can be. For example, not surpringly, the classifiers sometimes confuses common lisp with scheme and vice-versa.

Source / Plain text

If this classifier had to be integrated into a text editor, it would first be necessary to tell if the file is even source code or just plain text (like a README file). To do that, I would personally build a second model whose role is solely to distinguish source code from plain text. It’s possible to do that by training the classifier appropriately. If the file is recognized as plain text, syntax highlighting can be disabled, otherwise the programming language model can be used to determine the programming language and set the syntax highlighting accordingly. The only disadvantage to this approach is that the first model becomes a bottleneck to the overall process. If it does a poor job at distinguishing plain text from source code, then it is the whole system that will do a poor job.

Download

You can get the source code (LGPL license) with this command:
$ git clone http://www.mblondel.org/code/source-classifier.git

or by browsing the web interface.

And so Forth

March 11th, 2009

Lately I’ve been looking into the Forth programming language. It’s a fascinating language in many ways! It combines low-level language features such as direct access to memory, absence of garbage collection and, at the same time, high-level language features such as high degree of expressiveness, reflection, ability to modify the compiler at runtime, meta-programming.
Read the rest of this entry »

WPA with Intersil Prism chipset-based wireless card

March 7th, 2009

In order to install a wireless connection in my mother’s laptop, I used an old Netgear PCMCIA wireless card of mine. However, I faced the problem that when clicking on the home network in the list of available networks in NetworkManager, WPA encryption wouldn’t come out as a possible encryption method, although the network IS using WPA. The solution was to flash the card’s firmware. You can find an excellent how-to here but here’s a quick summary of what I did (assuming the interface is wlan0).

# apt-get install hostap-utils
$ wget http://www.red-bean.com/~proski/firmware/Latest-prism.tar.bz2
$ tar xvfj  Latest-prism.tar.bz2
$ cd Latest-prism/
# prism2_srec -f wlan0 -O /proc/net/hostap/wlan0/pda primary-FLASH/pk010101.hex secondary-FLASH/sf010804.hex

You can run the following command before and after the firmware upgrade to check that the firmware version was upgraded correctly:

# hostap_diag wlan0

Installing Ubuntu from an existing Linux

March 7th, 2009

I’m giving my old laptop to my mother so that she can browse the web, write emails and chat over instant messaging and I wanted to replace Debian with Ubuntu. However, the cdrom no longer works and the mother board doesn’t support booting from an USB drive. One solution is to add an entry to your bootloader pointing to a netinstall vmlinuz (here called linux) and initrd.gz and choose that entry at boot in order to perform the netinstall. You need an Internet connection over a LAN network with a DHCP server. All the details are available here.

First release of Tegaki

February 11th, 2009

Today I’m releasing Tegaki 0.1. Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, that is suitable for both the desktop and mobile devices, and that is designed from the ground up to work well with Chinese and Japanese.

Screencast video: ogg or youtube.

This release features desktop and SCIM integration. However, the main “innovation” brought to you by this release is the user interface. It uses two drawing areas for continuous writing. The user can eventually fix recognition errors by choosing alternative candidates or editing characters. Since a video is worth a thousand words, see the screencast above. This interface is largely inspired from the Nintendo DS game “Kanji Sono Mama Rakubiki Jiten” (漢字そのまま楽引辞典).

Tegaki is designed to be able to use several recognition engines. However so far it only supports Zinnia, which is the only recognition engine that I know with acceptable recognition accuracy and good performance on mobile devices. One challenge of the project in the future will be to create a new recognition engine that can yield better results than Zinnia.

A take that I have on this project is to use Python whenever this is possible and only use C or C++ when performance is critical, like in recognition engines. Compared to Tomoe, which implements everything in C and provides bindings for several languages, this means less reusability of the components but I hope this will make the project go forward faster.

There are still a lot of things that can be done in various areas but I really wanted to release the code I’ve put together so far because I think it can already be useful to end-users. By the way, Maemo supports both pygtk and SCIM through third-party projects, thus Tegaki is just a few Debian packages away from being available on Maemo.

For further details:
http://tegaki.sourceforge.net/

Dictzip reader in Ruby

January 5th, 2009

Both Ruby and Python have classes in their standard library to read transparently gzip-compressed files. This is very convenient because you can read compressed files just like you would do with normal files. However, random file access (i.e. moving the file position indicator to an arbitrary offset, using fseek) is not possible without performing serial access to the whole file. Because the file is compressed, there’s no way to know where a given portion of the uncompressed file is in the compressed file. Decompressing the whole file is unacceptable for large files and would be damn slow.

Read the rest of this entry »

Linux in a Virtual Machine

December 26th, 2008

I own a Macbook on which I’ve been running Linux 99% of the time for over a year now. Although a Macbook is not necessarily the best choice to run Linux, I made that decision because installing Linux on a Macbook is very well documented. However, as far as you can get, it’s always difficult to get a configuration you are 100% happy with (no subwoofer support, flaky suspend…). With recent advances in virtualization technologies, both in software and hardware, I’ve been willing to test running Linux and Windows (the guest OSes) inside Mac OS X (the host OS).

Read the rest of this entry »

Difficultés du français pour les Japonais

November 3rd, 2008

(English version below)

Voici une liste non-exhaustive et en vrac de difficultés que, d’après mon expérience, les Japonais apprenant le français rencontrent.

  • Distinction entre les sons r et l, v et b, eu et ou.
    • Le français comporte 36 phonèmes (16 voyelles et 20 consonnes) tandis que le japonais n’en comporte que 19 (5 voyelles et 14 consonnes).
  • Distinction entre les articles définis (le, la, les) et les articles indéfinis (un, une, des).
  • Différences sémantiques entre les temps du passé: passés simple et composé, imparfait.
  • Verbes pronominaux comme se souvenir.
  • Distinction de son entre les pronoms il et elle.
  • Le genre des noms - masculin ou féminin.
  • La concordance des temps.
  • Une tendance à thématiser.
    • En caricaturant, « la pomme, je l’ai mangée » au lieu de « j’ai mangé la pomme ».
  • Les mots qui ne s’écrivent pas tels qu’il se prononcent comme monsieur ou oignon.

Cette liste n’inclue bien sûr pas toutes les difficultés que les Français eux-mêmes rencontrent… :-)

————————-

Below a non-exhaustive list of difficulties without particular order that, according to my experience, Japanese people learning French may encounter.

  • Distinction between r and l, v and b, eu and ou.
    • French is composed of 36 phonemes (16 vowels and 20 consonants) while Japanese has only 19 (5 vowels and 14 consonants).
  • Distinction between definite articles (le, la, les) and indefinites articles (un, une, des) .
  • Semantical differences between past tenses.
  • Reflexive verbs such as se souvenir.
  • Sound distinction between pronouns il and elle.
  • Noun genders - male and female.
  • Tense harmony (agreement of tenses).
  • Tendency to topicalization.
    • « the apple, I ate it » instead of « I ate the apple ». (exaggerated example)
  • Words which are not pronounced the way they are written such as monsieur or oignon.

Everybody is the friend of a friend

September 24th, 2008

Something is really fascinating me on social networking websites like Facebook and since this is already the third time that it happens to me in only a few months, I had to write about it in this journal!

- Primary school friends and middle school friends know each other although I didn’t go to the middle school most of my primary school friends went to and my middle school is quite far away from my primary school.

- The friend of a friend I met in France is the friend of a friend I met in Japan (happened twice).

I’ve once been told that within a distance a 6 intermediate friends, everyone on Facebook is the friend of a friend. This is fascinating yet this sounds very plausible to me.

See also Six degrees of separation @ Wikipedia

ibus

September 20th, 2008

I was checking the web to see whether it’s possible to write an input method for SCIM using Python (the answer was yes), and I found ibus. ibus is a new input framework, by the author of SCIM himself. A particularity is that most of the core seems to be written in Python, which is interesting for such a low level thing as an input method.

Input methods like ibus-pinyin, ibus-anthy are thus written directly in Python. It should be pretty easy to integrate Zinnia into ibus then! (using the canvas widget provided by libtegaki-gtk)

- ibus @ Google Code
- Ibus @ Github