Tweeter button

Twitter

August 29th, 2009

There’s a quote I like about Twitter and other social media.

With social media, we no longer search for the news, the news find us.

In Twitter, you tend to follow people who have the same interests as you, so the news that are relayed to you through these people are more likely to be appropriate and relevant to you. Of course there’s inevitably always some noise but I think it’s a nice property of Twitter. Like many people I have started to take part to that process. I tweet mostly about computer stuff and life in Japan. If that’s relevant to you, feel free to follow me.

My account is mathieuen.

Git for personal projects

August 15th, 2009

I’ve been using git for Tegaki for over a year now and I’m very happy about it. With GNOME moving to git, Fantasdic’s source code is now also managed with git. Git does have quite a steep learning curve but it is a very powerful tool once you master it.

A feature I like is branches. In SVN, branches are handled in separate directories. In git, when you switch to another branch, git swaps files for you so your file locations don’t change. This makes testing multiple versions of your code very straightforward.

Another feature I like is the stash. This is used to temporarily undo some code. For example, the other day, I was in the middle of something when my supervisor came and asked me to show him the progress of my demo program. I used “git stash” to come back to my latest commit and later I was able to re-apply my changes with “git stash apply”.

Because all commits in git are made offline (git doesn’t need a central server), git is a very good candidate for strictly personal projects as well. Lately I have been using it systematically, not only for code but for documents (reports, presentations) too. This is especially useful if your documents’ source is in text format, like LaTeX, because git can show you the diffs. Since I backup my git repositories on my server, I find git to also be a nice way to keep all my source code and documents in sync across all my computers (at the lab, at home).

Two Fantasdic plugins

August 3rd, 2009

Someone wrote two Fantasdic plugins to query http://open-tran.eu/ and www.mancomun.org. This is the first person to write a plugin for Fantasdic that I’m aware of. It shows that Fantasdic can easily be used as a client for online dictionary or translation services. More details here. By the way, Fantasdic is now available in most Linux distributions. In Debian/Ubuntu, you can install it with “apt-get install fantasdic”.

Tegaki 0.2 released

July 20th, 2009

I released Tegaki 0.2. From the Tegaki website:

Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, specifically designed for Chinese (simplified and traditional) and Japanese, and that is suitable for both the desktop and mobile devices.

This release brings a lot of improvements and bug fixes so it’s recommended to upgrade.

You can read the announcement on the Tegaki discussion group here.

There’s also preliminary Maemo support. You can download the Debian package here. You need to install python2.5-gtk2 from the extras repository. Models (.model and .meta files) can be installed in /media/mmc1/tegaki/models/zinnia/ or /media/mmc2/tegaki/models/zinnia/.

Tegaki Project discussion group

June 25th, 2009

I’ve created a Google Group for the Tegaki Project. The advantage of Google Groups is that they can be used both like a mailing-list or like a message-board, depending on everyone’s preferences. The group is for both user and developer discussions. Feel free to join!

http://groups.google.com/group/tegaki-hwr

Using internet users to do something useful

June 13th, 2009

I found an insightful and entertaining talk that I recommend you to watch. It’s about CAPTCHA in general and more specifically about ReCAPTCHA.

CAPTCHA and ReCAPTCHA

The idea of CAPTCHA is to ask users to answer a question that only humans can answer in order to check whether they are a human or a computer program. For example, in order to prevent massive creation of email accounts by computer programs, most email services ask new subscribers to write the text present in a image. The image is distorted so that it is not possible for a computer program to recognize it but it is still readable by a human. CAPTCHA is now commonly used by all kinds of websites and so millions of users are reading CAPTCHA everyday…

The basic idea of ReCAPTCHA is to use the manpower of people writing CAPTCHA in order to help digitalize old books. Although OCR (Optical Character Recognition) software is pretty efficient for printed documents, the error rate can jump quite a lot for old books due to their poor quality. So by definition, words extracted from old books are in themselves good candidates for CAPTCHA. However, extracted words are further distorted in order to make them look like real CAPTCHA.

In ReCAPTCHA, users are presented with two words to identify. For one word, we know the answer already so it is used to check whether the user is a human or not. For the other word, we want to know the answer (in order to digitalize the book). In other words, users are asked to write two words even though only one would have suffice to fulfill the purpose of the CAPTCHA. However, measurements show that on average it takes equally long to write two plain English words and one word composed of 6-8 random characters. So ReCAPTCHA is an interesting way to fulfill the CAPTCHA’s goal while doing something useful at the same time, which is nice knowing that millions of users are reading CAPTCHAs everyday.

I though ReCAPTCHA is a pretty clever idea. It’s actually an example of human-based computation — that is, the combined use of human and computer for a task that wouldn’t have been possible for one or the other alone.

Genetic art

An idea I had sometime ago (but this has already been done), which is an example of human-based computation, is to use genetic programming in order to automatically generate art. Because it is difficult for a computer alone to tell whether a certain piece of art is beautiful or not, the idea is to let a genetic algorithm create art and use humans in order to evaluate the beauty of art at each iteration of the genetic algorithm (i.e. as a fitness function).

Handwriting sample database

Another related idea is the idea of games with a purpose. This is basically what I want to do with the Chinese handwriting sample database that I mentioned several times already in this journal. The basic idea is to make it attractive for people to write handwriting samples of Chinese characters by letting them play educational games.

Java + JRuby or Jython for scientific computing: a test-case with Hidden Markov Models

May 19th, 2009

When it comes to programming languages for scientific computing, researchers are usually faced with a trade-off between ease of programming and runtime performance. On the one hand, you have languages/toolboxes like Matlab or R which are easy to use but slow. On the other hand, you have C and C++ which take more time to develop but usually perform the best. There exist several alternative solutions that share a little bit of both worlds. Among others:

- Implement the computation-intensive parts in C or C++ and use a scripting language like Python or Ruby for all the rest. A wrapper is necessary in order to be able to use the library from the scripting language. SWIG can be used for that. The number of existing scientific libraries written in C or C++ is big.

- Implement the computation-heavy parts in Java (Java is not bad at number crunching!) and use another JVM-based language for the rest. Popular choices recently are JRuby, Jython, Scala, Groovy and Clojure but there exist many others. These languages are usually designed from the ground-up to integrate well with Java so a wrapper is not necessary in order to interact with the library. The number of existing Java packages for scientific computing is huge.

- Use Python with Numpy. Numpy is a package for fast manipulation of arrays and matrices. Since most of the time in computation-heavy algorithms is spent in calculations on arrays or matrices, there is a huge performance gain compared to plain Python. Together with matplotlib and scipy, the Python / Numpy duo is becoming more and more popular as a replacement for Matlab and thus the number of available scientific libraries is growing fast.

- Use OCaml. OCaml is a functional programming language which is said to have near the performance of C and near the ease of use of scripting languages. It’s pretty popular in the scientific community in France since this is where OCaml is originating from. However, the number of existing libraries is smaller than in C or Java.

I gave a shot to the Java solution today. I tried to use Jahmm (a Hidden Markov Models package written in Java) from both JRuby and Jython. It’s very nice to be able to use a Java library in Ruby or Python syntax! You can edit your source and try it right away without having to recompile.

One thing that I didn’t like in JRuby is that Ruby arrays must be explicitly converted to Java arrays in order to be used in a Java method. Say you have a Java method “foo(double[] n)”. If you want to use the Ruby array [1.0, 2.0, ...] as a parameter for foo, you need to convert it to a Java array with [1.0, 2.0, ...].to_java(:double). Otherwise, you get an error telling you that the parameters don’t match the signature of foo. Java supports method overloading. That is, the same method can be redefined with different number of parameters or different types of parameters. Jython has some heuristics to do the conversion transparently for you most of the time. This makes the Jython script feels more natural and easier to read.

The solution to use Java for the numerical computations and a JVM-based language for the rest is quite tempting. You can use the Java library (almost) transparently from your favorite language, whether it is Python, Ruby, Scala, Groovy or Closure… One thing that is missing though, is interoperability between JVM-based programming languages. Say you have portions of code written in JRuby, in addition to Java, it’s not yet possible to use them from Jython. So truly polyglot programming is not possible yet. You have to choose one JVM-based language in addition to Java and stick to it.

I tried HMMs with discrete, multivariate gaussians and mixtures of gaussians as observation probability distribution in both JRuby and Jython. You can have a look and compare which one you prefer.

JRuby: discrete.rb, multivariate.rb, mixture.rb
Jython: discrete.py, multivariate.py, mixture.py

Personal wiki

May 10th, 2009

A few months ago, I set up a personal wiki on my server. By personal wiki, I mean a wiki that is for you and yourself only and that is not intended for anyone else to see. The advantage of the wiki is that you can read and edit it from anywhere and the wiki syntax is very convenient. Overall, I found the idea of a personal wiki to be very useful and I think it may help you organize your work/life as well.

TODO-list

I use my wiki to keep a list of things I want to do. This can be project ideas, books or publications I want to read, movies I want to watch. The only problem with a TODO-list is that you generally add more items to it than you remove so the list can grow quite fast!

Notes

Another thing I’ve been using my wiki for is taking notes. Every time I read a publication, I now write down the ideas I found interesting in the paper. For technical books, I try to make a quick summary after each chapter I read. Of course, this takes a little more time than just reading the book but I found out that 1) it helps me memorize the content better and 2) as I write down the summary, I sometimes realize that I didn’t fully grasp a concept and thus I have to clarify my understanding in order to write the notes. I also take notes of interesting companies, conferences, links, program commands I run across…

One important thing with taking notes is not getting too far in taking your notes - otherwise it’s like you’re rewriting the book that you’re reading or you’re recreating the internet…

Diary

I’ve also been using my wiki as a personal mini-diary. At the end of the day, I write down the meaningful things I did of my day and try to remember the interesting ideas I had. It didn’t become a habit yet so I forget to do it very often. Yet, one doesn’t necessarily have something interesting to say everyday so it can become a motivation to try to do something meaningful of your day.

Writing style

Another important thing to consider is that this kind of wiki is strictly personal — it should be protected with a password. Therefore you don’t have to worry about making typos or of what will people think. As a writing style, I make extensive use of bullet lists and I use a mix of English and French, depending on what comes out. Mostly for technical stuff, it often happens that words come in English rather than French so I write directly in English.

All-female cloning ants

April 17th, 2009

Apparently researchers from the University of Arizona have discovered an amazonian ant that has developed into an all-female species which reproduces by cloning.

More details can be found here:

Sexual reproduction is a great source of diversity which allows species to evolve over time through natural selection. So, from a pure evolutionary standpoint, one may wonder why these ants have favored cloning over sexual reproduction. The “choice” of cloning suggests that these ants are already very well adapted to their environment and can give up evolving. Asexual reproduction is faster, requires less energy and produces fewer errors (deficient individuals). In the case of these ants, it thus allows for mass reproduction of already well-suited individuals. However this comes at the great risk of extinction of the species on the long term because the species will be “trapped” into its current state (mutations are more rare in asexual reproduction) and won’t be able to adapt in case of environmental change (temperature change, new predator). So this strategy can be seen as favoring individuals over the species. Anyway, this is quite a fascinating discovery.

Numbers game solver

April 6th, 2009

Genetic programming is a Machine Learning technique that allows to “create programs that create programs” by simulating evolution. The purpose is to slowly converge to a program that performs a defined task as well as possible.

I wanted to try out Genetic Programming and I figured writing a solver for the “numbers game” would be a good example. If you don’t know it, it’s part of the popular game shows “Countdown” in the UK and “Des chiffres et des lettres” in France. For example, given the numbers 3, 75, 2, 4, 1 and 1, the goal is to arrive at 888 using the four basic arithmetic operations (+, −, × and ÷) like so:

74 = 75 - 1
12 = 4 * 3
888 = 74 * 12

The above calculations can be represented as a tree like this:

        *
     /      \
    -        *
  /   \    /   \
75    1    4    3

So it becomes clear that solving the numbers game is equivalent to writing little programs. Although I used the tree representation, a possibly simpler representation that I now think of is like a string of stack operations, like in Forth:

75 1 - 4 3 * *

You start with an initial population of random programs and every generation, you create a new population by selecting the best-performing programs from the current generation, mutating programs (changing a constant with another constant, changing an operator with another operator…) and breeding programs (creating programs which are the combination of two programs).

The generated programs must respect some constraints. For example, division must not have a remainder so 7 / 2 is not allowed. Another example is that the constants used must be among the 6 input numbers and can only be used once.

In Genetic Programming, to determine how good the generated programs perform, we need a scoring function. Here it’s very clear: the closer we are from the target number, the better. To decide between two solutions that perform as well as the other, we can select the one with the fewer calculations.

You can find my program here.

Another idea for applying Genetic Programming that I had is for creating hash functions. What would be the scoring function in that case? One could use the avalanche effect (the greater the avalanche effect, the better) or the average number of collisions (the smaller number of collisions, the better). Although I had this idea by myself, it seems that researchers are already investing this!