Tweeter button

Archive for the ‘Ruby’ Category

Two Fantasdic plugins

Monday, August 3rd, 2009

Someone wrote two Fantasdic plugins to query http://open-tran.eu/ and www.mancomun.org. This is the first person to write a plugin for Fantasdic that I’m aware of. It shows that Fantasdic can easily be used as a client for online dictionary or translation services. More details here. By the way, Fantasdic is now available in most Linux distributions. In Debian/Ubuntu, you can install it with “apt-get install fantasdic”.

Java + JRuby or Jython for scientific computing: a test-case with Hidden Markov Models

Tuesday, May 19th, 2009

When it comes to programming languages for scientific computing, researchers are usually faced with a trade-off between ease of programming and runtime performance. On the one hand, you have languages/toolboxes like Matlab or R which are easy to use but slow. On the other hand, you have C and C++ which take more time to develop but usually perform the best. There exist several alternative solutions that share a little bit of both worlds. Among others:

- Implement the computation-intensive parts in C or C++ and use a scripting language like Python or Ruby for all the rest. A wrapper is necessary in order to be able to use the library from the scripting language. SWIG can be used for that. The number of existing scientific libraries written in C or C++ is big.

- Implement the computation-heavy parts in Java (Java is not bad at number crunching!) and use another JVM-based language for the rest. Popular choices recently are JRuby, Jython, Scala, Groovy and Clojure but there exist many others. These languages are usually designed from the ground-up to integrate well with Java so a wrapper is not necessary in order to interact with the library. The number of existing Java packages for scientific computing is huge.

- Use Python with Numpy. Numpy is a package for fast manipulation of arrays and matrices. Since most of the time in computation-heavy algorithms is spent in calculations on arrays or matrices, there is a huge performance gain compared to plain Python. Together with matplotlib and scipy, the Python / Numpy duo is becoming more and more popular as a replacement for Matlab and thus the number of available scientific libraries is growing fast.

- Use OCaml. OCaml is a functional programming language which is said to have near the performance of C and near the ease of use of scripting languages. It’s pretty popular in the scientific community in France since this is where OCaml is originating from. However, the number of existing libraries is smaller than in C or Java.

I gave a shot to the Java solution today. I tried to use Jahmm (a Hidden Markov Models package written in Java) from both JRuby and Jython. It’s very nice to be able to use a Java library in Ruby or Python syntax! You can edit your source and try it right away without having to recompile.

One thing that I didn’t like in JRuby is that Ruby arrays must be explicitly converted to Java arrays in order to be used in a Java method. Say you have a Java method “foo(double[] n)”. If you want to use the Ruby array [1.0, 2.0, ...] as a parameter for foo, you need to convert it to a Java array with [1.0, 2.0, ...].to_java(:double). Otherwise, you get an error telling you that the parameters don’t match the signature of foo. Java supports method overloading. That is, the same method can be redefined with different number of parameters or different types of parameters. Jython has some heuristics to do the conversion transparently for you most of the time. This makes the Jython script feels more natural and easier to read.

The solution to use Java for the numerical computations and a JVM-based language for the rest is quite tempting. You can use the Java library (almost) transparently from your favorite language, whether it is Python, Ruby, Scala, Groovy or Closure… One thing that is missing though, is interoperability between JVM-based programming languages. Say you have portions of code written in JRuby, in addition to Java, it’s not yet possible to use them from Jython. So truly polyglot programming is not possible yet. You have to choose one JVM-based language in addition to Java and stick to it.

I tried HMMs with discrete, multivariate gaussians and mixtures of gaussians as observation probability distribution in both JRuby and Jython. You can have a look and compare which one you prefer.

JRuby: discrete.rb, multivariate.rb, mixture.rb
Jython: discrete.py, multivariate.py, mixture.py

Source code classifier

Wednesday, March 18th, 2009

Of the numerous Machine Learning algorithms, naive Bayes classifiers are probably one of the simplest forms. Yet, they are known to be able to yield very good performance in some circumstances. The most famous example of their use is probably the spam filters, in which spam and non-spam emails are marked beforehand and throughout the filter’s life in order to train the filter to recognize spam from non-spam.

I’ve decided to try out this technique for source code classification: given a source file for which we don’t know the programming language, the purpose is to output the n most probable programming languages in which the source is written. Although I did it mainly to see what kind of results I could get, one use I can think of is in a text-editor, to activate or deactivate syntax highlighting when the file extension is missing.

Difficulties

I can think of several difficulties in trying to classify code source files:
- Unlike many natural languages in which spaces delimit words, in programming languages, spaces can often be omitted. For example print(”hello”) and print (”hello”) could well both be correct in a given language. The lack for spaces is detrimental to the performance because print(”hello”) will be treated as a single word and as a result have a very low probability.
- Programs are sometimes heavily commented by the programmer, which can cause the classifier to confuse source files and plain-text files.
- Since we don’t know the language in advance, we cannot rely on a grammar. Intuitively, we need to use the occurrence probability of words for each programming language.

Collect data

Because this technique involves training the program to recognize programming languages, it was first necessary to collect source files for different programming languages. I decided to limit myself to 24 languages from Programming Language Popularity for which I had easy access to open-source programs. Of course it is possible to add more by training the program with more source files.

For each programming language, I grabbed a few projects written in the language and for interpreted languages, I also grabbed the virtual machine source code because their standard library is often written in the language directly. I downloaded all the source files with the Debian command “apt-get source”, which grabs project sources from Debian mirrors, making the task a little bit more convenient but overall collecting source files was the most boring and time-consuming part of this project.

The thing is, to account for different programming styles and to find a good approximation of the word distributions, especially given the difficulties mentioned above, it’s very important to collect as much source files as one can. Once this was done I was able to start the program itself.

Ruby and C implementations

I made the program (trainer and classifier) in Ruby. The training part is quite slow (Ruby is quite slow for text processing) but the classifier is well responsive, which is the most important. I also made a C version of the classifier: the Ruby program is used to generate a header file (.h) which contains a hash table with the data for the statistical model and when the classifier is compiled, the hash table is included directly in the executable for blazing fast processing. The executable remains small (~200K for 24 languages).

An advantage of this approach is that most of the work is done offline (i.e. during the training) and when I tweak some parameters in the Ruby program, I just have to retrain, regenerate the .h file and recompile. The core C module doesn’t change.

Performance evaluation

I used a validation set to estimate how good the classifier performs: the percentage of correctly classified source files and the percentage of incorrectly classified source files and I got an overall accuracy of 85%. The validation set was composed of 40 sources files for each of the 24 programming languages. Here are a few interesting points I noticed:

- A few languages such ada, css, forth, ocaml are recognized correctly 100% of the time. Others such as tcl, erlang, scheme, haskell, smalltalk get an average accuracy over 95%. This shows that these languages stand out in their word distributions: the most common words in these languages are quite different from the most common words in other languages making it easier for the classifier to distinguish them.

- On the contrary, other languages get poor accuracy because they easily get recognized as another language. An example is C++ which, according to my tests, got recognized as C or Javascript in 30% of the source files. This is not very surprising knowing that these languages have a similar syntax.

- It’s also interesting to note what languages does the classifier recognizes when it makes mistakes. This shows us how close some languages can be. For example, not surpringly, the classifiers sometimes confuses common lisp with scheme and vice-versa.

Source / Plain text

If this classifier had to be integrated into a text editor, it would first be necessary to tell if the file is even source code or just plain text (like a README file). To do that, I would personally build a second model whose role is solely to distinguish source code from plain text. It’s possible to do that by training the classifier appropriately. If the file is recognized as plain text, syntax highlighting can be disabled, otherwise the programming language model can be used to determine the programming language and set the syntax highlighting accordingly. The only disadvantage to this approach is that the first model becomes a bottleneck to the overall process. If it does a poor job at distinguishing plain text from source code, then it is the whole system that will do a poor job.

Download

You can get the source code (LGPL license) with this command:
$ git clone http://www.mblondel.org/code/source-classifier.git

or by browsing the web interface.

And so Forth

Wednesday, March 11th, 2009

Lately I’ve been looking into the Forth programming language. It’s a fascinating language in many ways! It combines low-level language features such as direct access to memory, absence of garbage collection and, at the same time, high-level language features such as high degree of expressiveness, reflection, ability to modify the compiler at runtime, meta-programming.
(more…)

Dictzip reader in Ruby

Monday, January 5th, 2009

Both Ruby and Python have classes in their standard library to read transparently gzip-compressed files. This is very convenient because you can read compressed files just like you would do with normal files. However, random file access (i.e. moving the file position indicator to an arbitrary offset, using fseek) is not possible without performing serial access to the whole file. Because the file is compressed, there’s no way to know where a given portion of the uncompressed file is in the compressed file. Decompressing the whole file is unacceptable for large files and would be damn slow.

(more…)

Ruby.new

Thursday, November 22nd, 2007

I gave a talk on Ruby to TELECOM Lille 1’s LUG last Thursday. It was entitled “Ruby.new” and lasted 30 minutes. It was an introduction to Ruby without being a tutorial. I emphasized on the cool features that make Ruby apart. Apparently, it was pretty well received and there were about 15 people attending it.

You can download my slides in ODP and PDF (French).

Multiple dictionary sources in Fantasdic

Thursday, November 1st, 2007

Over the past few weeks, I have slowly but surely been adding multiple dictionary sources support to Fantasdic. Until recently, Fantasdic had been a DICT client only, that is, Fantasdic connected to DICT servers (as configured by the user in the settings) in order to retrieve definitions. I thought it would always be like that and I had even objected to change that in gnome-dictionary but I’ve finally changed my mind. As I said some time ago, a great deal of Fantasdic’s source code is only user interface source code. If making a dictionary application means spending so much time on user interface, it’s best to make it general-purpose…

Currently, Fantasdic includes two new kinds of source, in addition to DICT servers:

- Google Translate
- EDICT files

Basically, it works like a plugin system. Source plugins can either be distributed and installed with Fantasdic or installed manually in $HOME/.fantasdic/sources/ for third-party plugins. Writing a new source plugin is merely a matter of extending a base class and implementing a few required methods. Plugins are written in Ruby.

Hopefully, the user interface remained as simple as it was.

Fantasdic screenshot
Fantasdic searching in an EDICT file. EDICT is a famous dictionary format for anyone learning Japanese.

Some sources may require additional fields to be configured by the user. For example, the DICT server source requires a server host and port. The EDICT file source requires a file path to be specified. The user interface for those additional fields is defined directly in the source plugins.

Fantasdic screenshot

Fantasdic screenshot
For this source, a file must be selected…

Fantasdic screenshot
With the Google Translate source, you need to select your languages for the translations.

Fantasdic screenshot
Fantasdic, using Google Translate.

I hope more and more sources can be added :) Ideally all source plugins should be multi platform. Here are a few suggestions (of course, I’m counting on you to implement them ;-)):

- dictd file: search directly in files aimed for the dictd server. See “man dictd” for a description of the format and tools/ in Fantasdic’s source code for some starters.

- Stardict file. There’s a file describing the format in Stardict’s source code. Likewise, tools/ has a script to convert stardict files, it may be a good starter.

- Stardict server. Stardict authors have created their own protocol and they’re running a server with quite some dictionaries. Directly see Stardict’s source code or use a packet sniffer.

- Epwing dictionaries. You’ll need to use rubyeb, the Ruby bindings to the excellent libeb.

- Wikpedia/Wiktionary. This source plugin would simply perform an HTTP request to the appropriate site. Greg Hewgill kindly accepted to share his code to clean mediawiki syntax and make it more readable. I’m quoting an email he sent to me:

The current state of my code can be found at:
http://hewgill.com/viewvc/wiktiondict/trunk/

Feel free to use any of my code (or the algorithms therein) to format
mediwiki data. I imagine you already know this, but you can fetch the
raw output for individual pages using a url like this:
http://en.wiktionary.org/w/index.php?title=test&action=raw

In fact, you can also add &templates=expand to that url and mediawiki
does all the hard template work! I found the docs at:
http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php

Waiting for your comments and your source plugins!

Translating Wikipedia articles more easily

Wednesday, April 11th, 2007

It is not easy to explain the following with plain sentences so let’s take an example. Say I want to translate the following paragraph from English to French:

Tokyo is known for its many museums. Located in [[Ueno Park]] are the [[Tokyo National Museum]], the country’s largest museum and specializing in traditional [[Japanese art]]; the National Museum of Western Art; and the Tokyo Metropolitan Art Museum, which contains collections of Japanese [[modern art]] as well as over 10,000 Japanese and foreign films.

In order to complete the translation, I will need the French article name for [[Ueno Park]], [[Tokyo National Museum]], [[Japanese art]] etc. Seeking all those names is quite boring and time-consuming, isn’t it ? So I have written a little tool in Ruby that does that for us. In this very example, the program would have output:


----------
Ueno_Park: interwiki link to fr found (Parc de Ueno)
Tokyo_National_Museum: interwiki link to fr found (Musée national de Tōkyō)
Japanese_art: interwiki link to fr found (Art japonais)
modern_art: interwiki link to fr found (Art moderne)
----------
Tokyo is known for its many museums. Located in [[Parc de Ueno]] are the [[Musée national de Tōkyō]], the country’s largest museum and specializing in traditional [[Art japonais]]; the National Museum of Western Art; and the Tokyo Metropolitan Art Museum, which contains collections of Japanese [[Art moderne]] as well as over 10,000 Japanese and foreign films.

More explanations and download here.

nihongobenkyo.org launched

Saturday, March 3rd, 2007

I spent quite some time this week working on a public dictionary server focusing on Japanese language. The idea is that you just have to install a dictionary client, such as Fantasdic, in order to get a full-featured Japanese dictionary. Add “nihongobenkyo.org” in the settings and voilà ! I have decided to abandon the former Nihongo Benkyo project because I think that the Fantasdic + nihongobenkyo.org server approach advantageously replace it. See nihongobenkyo.org for more details.

今週一般向け日本語辞書サーバーを作るために、時間をけっこう費した。目的はFantasdicっていった辞書クライアントをインストールして、セッティングで ”nihongobenkyo.org”を書けば、ほら!日本語辞書が手に入ること。この方が便利だと思うから、昔の Nihongo Benkyoプロジェクトはもう止めると決めた。詳細はnihongobenkyo.orgを見てください。日本語のバージョンもあるよ。

If you are interested in creating your own dictionaries for the dictd server, I have written a class in Ruby language that does just that. Compared to the dictfmt utility distributed with dictd, its main advantage is that it is possible to associate more than one index entry with one definition. In this process, I also learned a useful trick. The following code allows to write a file and sort it “on the fly”. Easy but useful.

自分自身のdictdの辞書を作ることに興味があったら、ぼくはそのためにルビーのクラスを書いた。dictdのdictfmtツールよりいいのは、いくつかのインデックスのエントリーを一つの説明にまとめることが出来る。またそれを調べながら、いい勉強が出来た。下のコードはファイルに書いて、同時にそのファイルをソートするためのもの。簡単で便利。(^^ )


f = IO.popen("sort > file_name", "w")
["d","b","a","c"].each do |line|
f.write(line + “\n”)
end
f.close

will output :

a
b
c
d

TTF/Ruby, first release!

Thursday, November 9th, 2006

I am pleased to announce the first release (version 0.1) of TTF/Ruby, under the terms of the GNU GPL.

TTF/Ruby is a pure Ruby library to read and write TrueType fonts.

Tables supported are :

- Cmap *
- Cvt *
- Fpgm *
- Gasp
- Glyf
- Head
- Hhea
- Hmtx
- Kern *
- OS/2
- Post *
- Prep *
- Vhea
- Vmtx

(Tables marked with an * are only partially supported)

API documentation is written directly in the source code and may be generated with the following command-line:

$ rdoc –main “Font::TTF::File”

As you noticed, this release is marked 0.1 so do not expect API compatibility for the next releases.

The tarball also ships some useful tools (and proofs of concept) based on TTF/Ruby.

ttfdump: a command-line tool to extract informations about a font.

ttfsubset: a tool which from a font and an input file generates a subset from this font containing only characters in the input file. Maybe useful to embed a lighter version of a font in a document or in an embedded system.

ttfcairoglyphviewer: renders a selected glyph using Ruby/GTK, Rcairo and TTF/Ruby. It also displays markers for corner points, curve control points, and implicit points.

ttfglyph2svg: prints to stdout a selected glyph in SVG format.

Comments are of course welcome. And there is a large TODO list for the braves ;-)

I would like to thank Evermore Software, China, where I am currently an intern, for giving me the permission to release this project (which started as a prototype for a Java program).

Download: http://www.mblondel.org/files/ttf-ruby/ttf-ruby-0.1.tar.gz


The 愛 japanese Kanji rendered thanks to Ruby/GTK, RCairo and TTF/Ruby (i.e. without FreeType).