Archive for the ‘In English’ Category

Easy parallelization with data decomposition

Friday, November 27th, 2009

Recently I came across this blog post which introduced me to the new multiprocessing module in Python 2.6, a module to execute multiple concurrent processes. It makes parallelizing your programs very easy. The author also provided a smart code snippet that makes using multiprocessing even easier. I studied how the snippet works and I came up with an alternative solution which is in my opinion very elegant and easy to read. I’m so excited about the new possibilities provided by this module that I had to spread the word. But first, off to some background.

(more…)

First look at Cython

Friday, November 27th, 2009

The Python and C/C++ duo

Lately, Python and C/C++ are becoming my language combination of choice for my research. It’s a pragmatical choice.

Regarding Python:

- It has interesting packages for scientific computing such as NumPy (fast multi-dimensional arrays and vectorized code), SciPy (reusable scientific packages), Matplotlib (plotting), IPython (Matlab-like interactive environment).
- It has many libraries and many bindings/wrappers for C/C++ libraries, including in my fields of interest such as Machine Learning, Natural Language Processing and Image Processing.
- It has many users, meaning that more people can contribute to your projects.
- It’s a full-fledge language, with powerful features and a large standard library.

Regarding C/C++:

- They are the most commonly used languages to write native extensions for Python. Even though it’s possible to get huge speedups by vectorizing your code with NumPy (avoid for loops like the plague!), you can never get anywhere close to native programs speed.
- They are pretty much the fastest languages out there, although Fortran can be faster.

In a nutshell, I try to use Python and NumPy as much as possible and when necessary, I rewrite selected portions in C or C++.

(more…)

Fantasdic on Mac OS X install how-to

Sunday, September 13th, 2009

This is how you can install Fantasdic, my (self-proclaimed ;-)) versatile dictionary application in Mac OS X. Windows users can download an application bundle from the official website and Linux users can probably install it from their distro’s package manager (at least on Debian, Ubuntu and Fedora).

1. Macports

Install Macports.

2. X11

Install X11 for Mac OS X.

3. Fantasdic

Install dependencies:
$ sudo port install rb-gtk2 rb-libglade2 git-core

Retrieve latest source code:
$ git clone git://git.gnome.org/fantasdic

Install fantasdic:
$ cd fantasdic/
$ ruby setup.rb config
$ ruby setup.rb setup
$ sudo ruby setup.rb install

You can now launch fantasdic by running the “fantasdic” command.

You can use Platypus to make it a dock application. In that case, you need to input the full path to the ruby interpreter and fantasdic: /opt/local/bin/ruby and /opt/local/bin/fantasdic, respectively.

4. Kinput2 and canna

You can safely skip this if you don’t need to input Japanese.

Install kinput2 and canna (kana-kanji conversion server):
$ sudo port install kinput2 canna

Activate canna on startup:
$ sudo launchctl load -w /opt/local/etc/LaunchDaemons/org.macports.canna/org.macports.canna.plist

Activate kinput2 on X’s startup:
$ cp /usr/X11/lib/X11/xinit/xinitrc ~/.xinitrc
$ vi ~/.xinitrc

And add the following line below “# start some nice programs”:
test -x /opt/local/bin/kinput2 && /opt/local/bin/kinput2 &

The command to launch fantasdic is now:
XMODIFIERS=”@im=kinput2″ GTK_IM_MODULE=”xim” LANG=”ja_JP.UTF-8″ fantasdic

And the obligatory screenshot ;-)

Dynamic Time Warping : theory

Monday, August 31st, 2009

Recently, I’ve been working on a new handwriting recognition engine for Tegaki based on Dynamic Time Warping and I figured it would be interesting to make a short, informal introduction to it.

Dynamic Time Warping (DTW) is a well-known algorithm which aims at comparing and aligning two sequences of data points (a.k.a time series). Although it was originally developed for speech recognition (see [1]), it has also been applied to many other fields like bioinformatics, econometrics and, of course, handwriting recognition.

Consider two sequences A and B, composed respectively of n and m feature vectors.

Each feature vector is d-dimensional and can thus be represented as a point in a d-dimensional space. For example, in handwriting recognition, we could directly use the raw (x,y) coordinates of the pen movement and that would make us sequences of 2-dimensional vectors. In practice however, one would extract more useful features from (x,y) and create vectors of dimension possibly greater than 2. It’s also worth noting that the sequences A and B can be of different length.

Time warping

DTW works by warping (hence the name) the time axis iteratively until an optimal match between the two sequences is found.

In the figure above, which is an example of two sequences of data points with only 1 dimension, the time axis is warped so that each data point in the green sequence is optimally aligned to a point in the blue sequence.

Best path

We can construct a n x m distance matrix. In this matrix, each cell (i,j) represents the distance between the i-th element of sequence A and the j-th element of sequence B. The distance metric used depends on the application but a common metric is the euclidean distance.

Finding the best alignment between two sequences can be seen as finding the shortest path to go from the bottom-left cell to the top-right cell of that matrix. The length of a path is simply the sum of all the cells that were visited along that path. The further away the optimal path wanders from the diagonal, the more the two sequences need to be warped to match together.

The brute force approach to finding the shortest path would be to try each path one by one and finally select the shortest one. However it’s apparent that it would result in an explosion of paths to explore, especially if the two sequences are long. To solve this problem, DTW uses two things: constraints and dynamic programming.

Constraints

DTW can impose several kinds of reasonable constraints, to limit the number of paths to explore.

  • Monotonicity: The alignment path doesn’t go back in time index. This guarantees that features are not repeated in the alignment.
  • Continuity: The alignment doesn’t jump in time index. This guarantees that important features are not omitted.
  • Boundary: The alignment starts at the bottom-left and ends at the top-right. This guarantees that the sequences are not considered only partially.
  • Warping window: A good alignment path is unlikely to wander too far from the diagonal. This guarantees that the alignment doesn’t try to skip different features or get stuck at similar features.
  • Shape: Aligned paths shouldn’t be too steep or too shallow. This prevents short sequences to be aligned with long ones.

These constraints are best visualized in [3].

Dynamic Programming

Taking advantage of such constraints, DTW uses dynamic programming to find the best alignment in a recursive way. Previously, the cell (i,j) of the distance matrix was defined as “the distance between the i-th element of sequence A and the j-th element of sequence B”. In the dynamic programming way of thinking, this definition is changed, and instead, the cell (i,j) is defined as the length of the shortest path up to that cell. Assuming local constraints like below,

it allows us to define the cell (i,j) recursively:

cell(i,j) = local_distance(i,j) + MIN(cell(i-1,j), cell(i-1,j-1), cell(i, j-1))

Here, recursively means that the shortest path up to the cell (i,j) is defined in terms of the shortest path up to the adjacent cells. A lot of different local constraints can be defined (see this table) and thus there are many variations in the way DTW can be implemented.

DTW as a distance metric

Once the algorithm has reached the top-right cell, we can use backtracking in order to retrieve the best alignment. If we’re just interested in comparing the two sequences however, then the top-right cell of the matrix just happens to be the length of the shortest path. We can therefore use the value stored in this cell as the distance between the two sequences. DTW has the nice property to be symmetric so DTW(a,b) = DTW(b,a). Also, DTW doesn’t fulfill the triangle inequality but it isn’t a problem in practice.

Related algorithms

DTW looks almost identical to the Levenshtein algorithm, an algorithm to compare strings, and is very similar to the Smith-Waterman algorithm, an algorithm for sequence alignment.

References

[1] Sakoe, H. and Chiba, S., Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1) pp. 43- 49, 1978

[2] DTW algorithm @ GenTχWarper

[3] PowerPoint presentation by Elena Tsiporkova

Twitter

Saturday, August 29th, 2009

There’s a quote I like about Twitter and other social media.

With social media, we no longer search for the news, the news find us.

In Twitter, you tend to follow people who have the same interests as you, so the news that are relayed to you through these people are more likely to be appropriate and relevant to you. Of course there’s inevitably always some noise but I think it’s a nice property of Twitter. Like many people I have started to take part to that process. I tweet mostly about computer stuff and life in Japan. If that’s relevant to you, feel free to follow me.

My account is mathieuen.

Git for personal projects

Saturday, August 15th, 2009

I’ve been using git for Tegaki for over a year now and I’m very happy about it. With GNOME moving to git, Fantasdic’s source code is now also managed with git. Git does have quite a steep learning curve but it is a very powerful tool once you master it.

A feature I like is branches. In SVN, branches are handled in separate directories. In git, when you switch to another branch, git swaps files for you so your file locations don’t change. This makes testing multiple versions of your code very straightforward.

Another feature I like is the stash. This is used to temporarily undo some code. For example, the other day, I was in the middle of something when my supervisor came and asked me to show him the progress of my demo program. I used “git stash” to come back to my latest commit and later I was able to re-apply my changes with “git stash apply”.

Because all commits in git are made offline (git doesn’t need a central server), git is a very good candidate for strictly personal projects as well. Lately I have been using it systematically, not only for code but for documents (reports, presentations) too. This is especially useful if your documents’ source is in text format, like LaTeX, because git can show you the diffs. Since I backup my git repositories on my server, I find git to also be a nice way to keep all my source code and documents in sync across all my computers (at the lab, at home).

Two Fantasdic plugins

Monday, August 3rd, 2009

Someone wrote two Fantasdic plugins to query http://open-tran.eu/ and www.mancomun.org. This is the first person to write a plugin for Fantasdic that I’m aware of. It shows that Fantasdic can easily be used as a client for online dictionary or translation services. More details here. By the way, Fantasdic is now available in most Linux distributions. In Debian/Ubuntu, you can install it with “apt-get install fantasdic”.

Tegaki 0.2 released

Monday, July 20th, 2009

I released Tegaki 0.2. From the Tegaki website:

Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, specifically designed for Chinese (simplified and traditional) and Japanese, and that is suitable for both the desktop and mobile devices.

This release brings a lot of improvements and bug fixes so it’s recommended to upgrade.

You can read the announcement on the Tegaki discussion group here.

There’s also preliminary Maemo support. You can download the Debian package here. You need to install python2.5-gtk2 from the extras repository. Models (.model and .meta files) can be installed in /media/mmc1/tegaki/models/zinnia/ or /media/mmc2/tegaki/models/zinnia/.

Tegaki Project discussion group

Thursday, June 25th, 2009

I’ve created a Google Group for the Tegaki Project. The advantage of Google Groups is that they can be used both like a mailing-list or like a message-board, depending on everyone’s preferences. The group is for both user and developer discussions. Feel free to join!

http://groups.google.com/group/tegaki-hwr

Using internet users to do something useful

Saturday, June 13th, 2009

I found an insightful and entertaining talk that I recommend you to watch. It’s about CAPTCHA in general and more specifically about ReCAPTCHA.

CAPTCHA and ReCAPTCHA

The idea of CAPTCHA is to ask users to answer a question that only humans can answer in order to check whether they are a human or a computer program. For example, in order to prevent massive creation of email accounts by computer programs, most email services ask new subscribers to write the text present in a image. The image is distorted so that it is not possible for a computer program to recognize it but it is still readable by a human. CAPTCHA is now commonly used by all kinds of websites and so millions of users are reading CAPTCHA everyday…

The basic idea of ReCAPTCHA is to use the manpower of people writing CAPTCHA in order to help digitalize old books. Although OCR (Optical Character Recognition) software is pretty efficient for printed documents, the error rate can jump quite a lot for old books due to their poor quality. So by definition, words extracted from old books are in themselves good candidates for CAPTCHA. However, extracted words are further distorted in order to make them look like real CAPTCHA.

In ReCAPTCHA, users are presented with two words to identify. For one word, we know the answer already so it is used to check whether the user is a human or not. For the other word, we want to know the answer (in order to digitalize the book). In other words, users are asked to write two words even though only one would have suffice to fulfill the purpose of the CAPTCHA. However, measurements show that on average it takes equally long to write two plain English words and one word composed of 6-8 random characters. So ReCAPTCHA is an interesting way to fulfill the CAPTCHA’s goal while doing something useful at the same time, which is nice knowing that millions of users are reading CAPTCHAs everyday.

I though ReCAPTCHA is a pretty clever idea. It’s actually an example of human-based computation — that is, the combined use of human and computer for a task that wouldn’t have been possible for one or the other alone.

Genetic art

An idea I had sometime ago (but this has already been done), which is an example of human-based computation, is to use genetic programming in order to automatically generate art. Because it is difficult for a computer alone to tell whether a certain piece of art is beautiful or not, the idea is to let a genetic algorithm create art and use humans in order to evaluate the beauty of art at each iteration of the genetic algorithm (i.e. as a fitness function).

Handwriting sample database

Another related idea is the idea of games with a purpose. This is basically what I want to do with the Chinese handwriting sample database that I mentioned several times already in this journal. The basic idea is to make it attractive for people to write handwriting samples of Chinese characters by letting them play educational games.