Tweeter button

Source code classifier

Of the numerous Machine Learning algorithms, naive Bayes classifiers are probably one of the simplest forms. Yet, they are known to be able to yield very good performance in some circumstances. The most famous example of their use is probably the spam filters, in which spam and non-spam emails are marked beforehand and throughout the filter’s life in order to train the filter to recognize spam from non-spam.

I’ve decided to try out this technique for source code classification: given a source file for which we don’t know the programming language, the purpose is to output the n most probable programming languages in which the source is written. Although I did it mainly to see what kind of results I could get, one use I can think of is in a text-editor, to activate or deactivate syntax highlighting when the file extension is missing.

Difficulties

I can think of several difficulties in trying to classify code source files:
- Unlike many natural languages in which spaces delimit words, in programming languages, spaces can often be omitted. For example print(”hello”) and print (”hello”) could well both be correct in a given language. The lack for spaces is detrimental to the performance because print(”hello”) will be treated as a single word and as a result have a very low probability.
- Programs are sometimes heavily commented by the programmer, which can cause the classifier to confuse source files and plain-text files.
- Since we don’t know the language in advance, we cannot rely on a grammar. Intuitively, we need to use the occurrence probability of words for each programming language.

Collect data

Because this technique involves training the program to recognize programming languages, it was first necessary to collect source files for different programming languages. I decided to limit myself to 24 languages from Programming Language Popularity for which I had easy access to open-source programs. Of course it is possible to add more by training the program with more source files.

For each programming language, I grabbed a few projects written in the language and for interpreted languages, I also grabbed the virtual machine source code because their standard library is often written in the language directly. I downloaded all the source files with the Debian command “apt-get source”, which grabs project sources from Debian mirrors, making the task a little bit more convenient but overall collecting source files was the most boring and time-consuming part of this project.

The thing is, to account for different programming styles and to find a good approximation of the word distributions, especially given the difficulties mentioned above, it’s very important to collect as much source files as one can. Once this was done I was able to start the program itself.

Ruby and C implementations

I made the program (trainer and classifier) in Ruby. The training part is quite slow (Ruby is quite slow for text processing) but the classifier is well responsive, which is the most important. I also made a C version of the classifier: the Ruby program is used to generate a header file (.h) which contains a hash table with the data for the statistical model and when the classifier is compiled, the hash table is included directly in the executable for blazing fast processing. The executable remains small (~200K for 24 languages).

An advantage of this approach is that most of the work is done offline (i.e. during the training) and when I tweak some parameters in the Ruby program, I just have to retrain, regenerate the .h file and recompile. The core C module doesn’t change.

Performance evaluation

I used a validation set to estimate how good the classifier performs: the percentage of correctly classified source files and the percentage of incorrectly classified source files and I got an overall accuracy of 85%. The validation set was composed of 40 sources files for each of the 24 programming languages. Here are a few interesting points I noticed:

- A few languages such ada, css, forth, ocaml are recognized correctly 100% of the time. Others such as tcl, erlang, scheme, haskell, smalltalk get an average accuracy over 95%. This shows that these languages stand out in their word distributions: the most common words in these languages are quite different from the most common words in other languages making it easier for the classifier to distinguish them.

- On the contrary, other languages get poor accuracy because they easily get recognized as another language. An example is C++ which, according to my tests, got recognized as C or Javascript in 30% of the source files. This is not very surprising knowing that these languages have a similar syntax.

- It’s also interesting to note what languages does the classifier recognizes when it makes mistakes. This shows us how close some languages can be. For example, not surpringly, the classifiers sometimes confuses common lisp with scheme and vice-versa.

Source / Plain text

If this classifier had to be integrated into a text editor, it would first be necessary to tell if the file is even source code or just plain text (like a README file). To do that, I would personally build a second model whose role is solely to distinguish source code from plain text. It’s possible to do that by training the classifier appropriately. If the file is recognized as plain text, syntax highlighting can be disabled, otherwise the programming language model can be used to determine the programming language and set the syntax highlighting accordingly. The only disadvantage to this approach is that the first model becomes a bottleneck to the overall process. If it does a poor job at distinguishing plain text from source code, then it is the whole system that will do a poor job.

Download

You can get the source code (LGPL license) with this command:
$ git clone http://www.mblondel.org/code/source-classifier.git

or by browsing the web interface.

One Response to “Source code classifier”

  1. Will Dwinnell Says:

    This is an interesting project. Thanks!

Leave a Reply