Linux in a Virtual Machine

December 26th, 2008

I own a Macbook on which I’ve been running Linux 99% of the time for over a year now. Although a Macbook is not necessarily the best choice to run Linux, I made that decision because installing Linux on a Macbook is very well documented. However, as far as you can get, it’s always difficult to get a configuration you are 100% happy with (no subwoofer support, flaky suspend…). With recent advances in virtualization technologies, both in software and hardware, I’ve been willing to test running Linux and Windows (the guest OSes) inside Mac OS X (the host OS).

Read the rest of this entry »

Difficultés du français pour les Japonais

November 3rd, 2008

(English version below)

Voici une liste non-exhaustive et en vrac de difficultés que, d’après mon expérience, les Japonais apprenant le français rencontrent.

  • Distinction entre les sons r et l, v et b, eu et ou.
    • Le français comporte 36 phonèmes (16 voyelles et 20 consonnes) tandis que le japonais n’en comporte que 19 (5 voyelles et 14 consonnes).
  • Distinction entre les articles définis (le, la, les) et les articles indéfinis (un, une, des).
  • Différences sémantiques entre les temps du passé: passés simple et composé, imparfait.
  • Verbes pronominaux comme se souvenir.
  • Distinction de son entre les pronoms il et elle.
  • Le genre des noms – masculin ou féminin.
  • La concordance des temps.
  • Une tendance à thématiser.
    • En caricaturant, « la pomme, je l’ai mangée » au lieu de « j’ai mangé la pomme ».
  • Les mots qui ne s’écrivent pas tels qu’il se prononcent comme monsieur ou oignon.

Cette liste n’inclue bien sûr pas toutes les difficultés que les Français eux-mêmes rencontrent… :-)

————————-

Below a non-exhaustive list of difficulties without particular order that, according to my experience, Japanese people learning French may encounter.

  • Distinction between r and l, v and b, eu and ou.
    • French is composed of 36 phonemes (16 vowels and 20 consonants) while Japanese has only 19 (5 vowels and 14 consonants).
  • Distinction between definite articles (le, la, les) and indefinites articles (un, une, des) .
  • Semantical differences between past tenses.
  • Reflexive verbs such as se souvenir.
  • Sound distinction between pronouns il and elle.
  • Noun genders – male and female.
  • Tense harmony (agreement of tenses).
  • Tendency to topicalization.
    • « the apple, I ate it » instead of « I ate the apple ». (exaggerated example)
  • Words which are not pronounced the way they are written such as monsieur or oignon.

ibus

September 20th, 2008

I was checking the web to see whether it’s possible to write an input method for SCIM using Python (the answer was yes), and I found ibus. ibus is a new input framework, by the author of SCIM himself. A particularity is that most of the core seems to be written in Python, which is interesting for such a low level thing as an input method.

Input methods like ibus-pinyin, ibus-anthy are thus written directly in Python. It should be pretty easy to integrate Zinnia into ibus then! (using the canvas widget provided by libtegaki-gtk)

- ibus @ Google Code
- Ibus @ Github

Zinnia

September 20th, 2008

In my last post, I was writing about this impressive Chinese character recognition demo using AJAX on the client side and Support Vector Machines (SVM) on the server side, for the recognition process. Well, I don’t know if it’s just a coincidence (this demo was from 2 years ago) but Taku Kudo released last week the backend he’s using as free software. Needless to say that this was awesome news for me! I know the basic principle of SVM but time to learn more about it I guess…

His project, called Zinnia, has been rewritten from scratch to be more flexible and reusable. Models for Japanese and Chinese are included but models for other languages can be built easily provided that you have training data. I’m pretty sure that this package could also be useful for Gesture Recognition because it’s so close to Handwriting Recognition…

For the sake of comparison, I wanted to evaluate how Zinnia performs compared to both Tomoe and my own HMM experiment. I used the same evaluation corpus as I wrote about in earlier posts, that is two sets of 50 kanjis written by a Japanese friend of mine and me. The characters have the correct stroke order and were drawn carefully. Therefore, the results below indicate how the different recognizers perform in ideal conditions and don’t indicate how robust they would be in more difficult conditions.

Tomoe – Zinnia

  Tomoe Zinnia
1st match accuracy 61% 77%
5 matches accuracy 74% 92%
10 matches accuracy 74% 93%
Recognition time 21 / 100 = 0.21 s 3 / 100 = 0.03 s
Total number of kanji 3000 3000

1st match accuracy is the percentage of characters that were recognized as first match.
5 matches accuracy is the percentage of characters that were recognized in the first 5 matches.

You can download my evaluation script for Zinnia here. Tomoe’s evaluation script is sitting in Tomoe’ SVN, in the benchmark/ folder.

A few remarks:
- Zinnia is notably better than Tomoe in terms of accuracy
- Zinnia is about 7 times faster than Tomoe, making it a good candidate for an embedded platform
- In both cases, 5 matches and 10 matches accuracy are about the same, meaning that it would be enough for the user interface to display the first 5 matches only.

Project Tegaki – Zinnia

Due to lack of training data, my personal HMM experiment (project Tegaki) was only conducted over a set of 50 characters. However, Zinnia supports over 3000 characters. For fair comparison, I thus created new models for Zinnia using the same training data as I used for my experiment.

Zinnia was trained with only one sample per character, using the same data as Tomoe, which is template-based. While SVM seems to be able to cope with only one sample per character, it’s a little bit more complicated to do that with HMM because of the need to find the parameters of the Observation Probability Density Function (e.g. mean and variance for a Gaussian).

  Project Tegaki Zinnia
1st match accuracy 92% 100%
5 matches accuracy 100% 100%
10 matches accuracy 100% 100%
Recognition time 14 / 100 = 0.14 s 1.50 / 100 = 0.015 s
Total number of kanji 50 50

A few remarks:
- My experiment is slow, which is probably due to the fact that I’m using Character-level models. Stroke-level models are known to scale much better.
- My experiment has slightly worse accuracy, which is probably because I’m only using two features per observation.

Handwriting database

If you follow my adventures in the world of handwritten Chinese character recognition, you probably know that I’m planning to create a handwriting database website. This database will aim to 1) make it easy and attractive for people to contribute their handwriting samples and 2) make it easy for the database staff to manage and organize what is supposed to become a large collection of handwriting samples.

The database will use a client/server architecture. So far I’m thinking of four important clients:
- A client that people will be able to use directly in their web browser, using my web canvas
- A client for the Maemo platform
- A client for the Iphone
- A multi-platform client for the Destkop

A client of slightly lesser priority would be a Facebook application.

The handwriting samples collected will be distributed in free software license. For projects like Zinnia or Project Tegaki, this will mean more training data and more means to evaluate the performance. I consider this database one of my priorities among my free software projects but it’s going to be quite hard for me to find time for that before December…

Contribute

As always, more people are welcome to contribute.

To download the source code of my work,

$ git clone http://www.mblondel.org/code/hwr.git

web interface

HCCR using AJAX and SVM

September 9th, 2008

I came across this impressive page (in Japanese), which shows a demonstration of Handwritten Chinese Character Recognition using AJAX for the user interface and Support Vector Machines for the training algorithm.

Looking at the Javascript code, I was surprised to see that, unlike my web canvas, it doesn’t use the <canvas> tag! It simply uses a combination of Javascript and CSS. Even though it has a few quirks, the interface is quite responsive.

The recognition process itself happens on the server side but thanks to the use of AJAX, the results are displayed very smoothly, without the need to refresh the page.

Taku Kudo, the author, explains in the page that he’s using the handwriting data from Tomoe. However since Tomoe uses a template-based algorithm, it only has one handwriting sample per character. I’m impressed that Taku Kudo can train his system only with one sample per character. Overall, the accuracy is not very impressive but I think it could improve a lot with more training samples. That’s why my handwriting database project is going to be very useful. I’ve been willing to try SVM in addition to HMMs so the fact that this project uses SVM confirmed my interest for it.

Taku Kudo‘s page has neat stuff regarding Natural Language Processing and Machine Learning. He published a lot of libraries as free software, including Mecab and TinySVM. If you like fancy stuff in AJAX, you’ll also like his Japanese Input Method.

Interactive whiteboard with a Wiimote

September 9th, 2008

If you haven’t seen all the cool stuff that Johnny Chung Lee is doing with a Wiimote, you should definitely take a look at his projects. You’ll see that Johnny is not only a good engineer but he’s also very good at making his projects attractive.

He has a video on how to make a cheap interactive whiteboard or how to turn your laptop into a tablet pc, with a Wiimote. This could be pretty handy for using Chinese character recognition!

Character encoding detection

August 17th, 2008

Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.

Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:

A composite approach to language/encoding detection

This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:

Coding scheme method

A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.

Character distribution method

For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.

Number of Most
Frequent Characters
Accumulated Percentage
10 0.11723
64

0.31983
128 0.45298

256 0.61872
512 0.79135
1024

0.92260
2048 0.98505

4096 0.99929
6763 1.00000

Character distribution in Simplified Chinese

Number of Most
Frequent Characters
Accumulated Percentage
10 0.27098
64 0.66722
128

0.77094
256 0.85710

512 0.92635
1024 0.97130
2048

0.99431
4096 0.99981

1.00000

Character distribution in Japanese

In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.

Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:

n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total – n_freq)

We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.

confidence_level = r_file / r_encoding

We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.

This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.

n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total – n_freq)

2-character sequence distribution method

For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.

Accuracy

100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.

Web Canvas

August 1st, 2008

In my last post, I was calling for contributors to write a web canvas using the <canvas> tag. If you don’t know it, <canvas> is a new tag specified in HTML5 which allows you to draw using a Javascript API. It is already supported in Firefox, Opera, Safari and is supported in Internet Explorer through a third-party Javascript.

Since nobody responded to my call (sic), I decided to tackle it by myself. It turns out that it was a nice little project. The canvas Javascript API is very similar to the cairo API so it was easy to use. I also improved my level in Javascript a lot. So far the web canvas supports draw, import (JSON), export (XML), save as an image and replay (stroke by stroke animation).

You can try it by using the online DEMO.

What can it be useful for?

- I’m planning to use it for the handwriting database website that I wrote about some time ago. While it will be possible to contribute your handwriting using a pygtk client (Desktop or Maemo), you will also be able to contribute your handwriting using your browser directly. Not having to install any program should help increase the number of people contributing their handwriting.

- A second way of using it would be to do handwriting recognition directly in the browser. For example, one could install Tomoe (or my recognizer when it’s ready ;-)) on the server side and the web canvas on the client side. Since Tomoe has Python and Ruby bindings, this is fairly easy!

You can reuse the web canvas for your own projects if you like but I would appreciate if you could send me any feature improvement. In particular, the web canvas has a bug under Internet Explorer that I couldn’t figure out…

Source code (GPL) : gitweb

Handwriting renderers

July 13th, 2008

Canvas

If you didn’t read my previous post, for short, project Tegaki is a framework for handwritten Chinese character recognition (HCCR) written in Python. It includes reusable components and is a placeholder for experimentation. The goal is to create the next-generation open-source HCCR software but it may be useful for academic researchers as well.

One reusable component is the Canvas. This is the user interface component that allows to draw characters. In addition, the Canvas supports “replaying” the character (stroke by stroke animation) and setting a background model (to help users draw an unknown character). It is multi-platform.

The Canvas
Example of a character drawn using the Canvas provided by libtegaki-gtk

The Canvas has a get_writing() method. It allows to retrieve the Writing object for the handwriting currently displayed in the Canvas.

XML representation

The Writing object supports reading from and writing to an XML file. The XML file can optionally be compressed using gzip or bz2. On my hard drive, I have a small set of handwriting samples. 500 characters take about 10 MB. That’s why compression is very useful.

The XML representation of a handwriting sample looks like that.

<character>
  <utf8>無</utf8>
  <strokes>
    <stroke>
      <point x="306" y="163" timestamp="0" />
      <point x="303" y="163" timestamp="21" />
      <point x="303" y="166" timestamp="29" />
      [...]
    </stroke>
    <stroke>
      <point x="266" y="240" timestamp="912" />
      <point x="270" y="240" timestamp="917" />
      <point x="273" y="240" timestamp="925" />
      [...]
    </stroke>
    [...]
  </strokes>
</character>

Renderers

I’ve recently added support for what I named “renderers”. They take a Writing object as parameter and generate a visual representation of it. Since I used the cairo graphics library as drawing backend, the representation can be saved to PNG, SVG and PDF! Those renderers will be very useful for the handwriting database website that I wrote about in my previous post!

Complete character renderer

Kanji

Stroke order renderer

Kanji
Stroke order with each single stroke

Kanji
Stroke order with stroke groups

Strokes can be grouped together when the stroke order is obvious. However, this requires to know which strokes to combine together. A dictionary must be created for that. A entry example would be:

駅 1,1,3,1,4,2,2

<canvas> HTML tag

The canvas I was writing about above is written in pygtk and is intended to be used for the Desktop or for Maemo. However, in the case of the handwriting database website, since we want as many people to contribute their handwriting as possible, it would be nice to not require any particular installation. For that, a canvas directly in the browser would be the ideal solution.

One solution would be to use Flash but I would prefer to use the <canvas> tag. It can be used in combination with Javascript to do drawing in the browser. It is supported natively by Firefox, Opera and Safari. It is supported in Internet Explorer through a third-party Javascript called ExplorerCanvas.

I am looking for a contributor to create a new canvas using this technology. The canvas should support drawing, displaying existing handwriting and replay (stroke by stroke animation).

For more information:

GIF stroke animation

Even though GIF uses a patented compression, GIF is still the only format with support for animations and wide support in the browsers. Therefore it would be very cool to be able to generate GIF stroke animations from a writing object.

I had a look at python-imagemagick and Python Imaging Library (PIL) but they both seem to have very limited support for GIF animations. So I’m thinking of writing my own library for GIF generation in Python. Byzanz, a software to create screencasts as GIF animations, can be used as inspiration because it includes a GIF encoder. It also supports color quantization (using octrees) and dithering. From what I see, it should take less than 1000 lines of Python code.

I read a little bit about color quantization. I found it very interesting. Here’s a short explanation about color quantization for those who don’t know about it. Basically, each pixel in an image may have three components Red Blue Green. For a 400×400 picture, this is about 400*400*3=480KB. To gain space, an idea is to store colors in a palette (a table index => color). Then each pixel only needs to refer to the index in the palette instead of having to define the three components. For a 256-color palette, this saves two bytes for each pixel. However, since we now use 256 colors only instead of 256 * 256 * 256 = 16,777,216 colors, there’s a color precision loss. The challenge is thus to find what colors to put in the palette to have the smallest precision loss possible. For example, we may want to put in the palette colors that are the closest to the most frequently used colors. This is a 3-dimensional clustering problem, thus it reminded me of Machine Learning, a topic in which I’ve been very interested recently.

For more information, I recommend the reading of those Wikipedia articles:

A roadmap for project Tegaki

July 4th, 2008

Codename Project Tegaki

I wrote in a previous post about my first experiment with applying a modern technique, namely Hidden Markov Models, for handwritten Chinese character recognition. I’m quite motivated in making this more than just a single isolated experiment so I decided to give a name to the project. I named it Project Tegaki. This is going to be the codename for the effort starting from now. Tegaki means Handwriting in Japanese.

Project statement

The aim of Project Tegaki is to push forward the creation of the next-generation open-source handwritten Chinese character recognition (HCCR) software.

Currently, the only open-source package for HCCR is Tomoe. This is a project that I have been contributing to and that I used for my Google Summer of Code project, “Japanese/Chinese handwriting recognition on maemo”. Maemo is the open-source platform used by Nokia PDAs. I have decided to start Project Tegaki as an external effort because I considered that Tomoe would not be a good environment to welcome the effort. However, if the Tomoe community is ready to help me in this effort, I will be happy to merge Project Tegaki back into Tomoe once Project Tegaki becomes ready for prime-time.


Handwritten Chinese character recognition in a PDA…

Here are some goals for the project:

- Free and open-source. The goal is to produce the next-generation free and open-source HCCR software.

- Modern. The software should use modern approaches to Handwriting recognition and be in tight connection with research.

- Embedded. The project must be designed to work with devices with restricted resources such as cell phones or PDAs.

- Online, as opposed to offline. In online recognition, characters are drawn using a device, typically a mouse, a tablet or a PDA stylus. In this setting, characters can be represented as sequences of points. In offline recognition, characters are scanned a posteriori. In this setting, characters are represented as images (width * height pixels).

- Isolated Chinese character recognition. Here Chinese character doesn’t restrict to Chinese language, since Japanese kanji are also Chinese characters! Even though the package should theoretically be generalizable to any kind of character, Chinese characters have some specific challenges and some approaches that give good results for Chinese characters may not give good results with other kinds of characters, due to the unique properties of Chinese characters. “Isolated character recognition” means that user will have to draw one character at a time in a separate box, as opposed to continuous handwriting recognition. This makes things much easier and in the case of Chinese characters, this is a reasonable limitation.

- Stroke order dependent and independent. Both situations have useful applications so Project Tegaki should ideally support both.

Python?

Usually I’m more of a Ruby fan but the project was started in Python due to dependencies on third-party libraries that only exist in Python. Even though I’m slowly getting away from those dependencies, I don’t want to re-implement everything just for the sake of using Ruby. So I keep up with Python.

As it was emphasized, this project is highly experimental. Moreover, a collaborative website will be created (see below) and it will reuse number of existing components. It thus makes sense to use a high-level language to focus on the experiments and to create the website.

Subprojects

Project Tegaki is now split into several subprojects.

libtegaki

This Python library contains functionality that will be useful to other subprojects. This includes array manipulation, character input/output, viterbi decoder…

libtegaki-gtk

This Python library contains user interface elements that will be useful to other subprojects. So far it only includes a Canvas, which can be used to draw characters. It is replacement for TomoeCanvas with some additional benefits:

- Truly reusable. TomoeCanvas assumes that a recognizer is connected to the canvas. However, there are situations when a recognizer is not needed.

- Resizable. TomoeCanvas cannot be resized at will.

- Animation. A stroke animation of a character can be displayed.

- Background character. A background character can be set as a model and animations will be displayed to help draw the same character stroke by stroke.

- Features other than (x,y) coordinates are supported such as pen pressure and pen inclination when available, stroke duration, point timestamp.

libtegaki-gtk is written in pygtk and depends on libtegaki.

tegaki-db

The most successful handwriting recognition systems nowadays use a “learn by example” philosophy. For each character supported, several samples of the handwritten character must be provided to the system in order to learn from them. Because those samples are used to train the system, they are called “training samples”. The challenge for the final recognizer is to be able to recognize unseen handwritten instances of the same characters. This is the ability of the recognizer to “generalize” the acquired knowledge.

A “training corpus” is a set of training samples. A good corpus should contain dozens of handwritten samples for each character. The corpus should be representative enough of all handwriting styles. Collecting all the handwriting samples and designing a good corpus is a huge task for Chinese characters because there exist thousands of them!

Such handwritten Chinese character databases do exist but they have a fee and they are usually restricted to academic research. They are by no means suitable for free software. The goal of the tegaki-db subproject is to create a collaborative web platform to collect handwriting samples. Native speakers and learners alike will be able to log in and contribute their own handwriting. The collected data will be published in a free license so that it can benefit to academic research as well. The tegaki-db will use a client / server architecture.

tegaki-db-client

tegaki-db-client is a client for people to input their handwriting. It will be written in Python and use the canvas provided by libtegaki-gtk. The client will communicate with the server through web services. The client should be distributed for several platforms such as Linux, Windows and Maemo to increase the number of potential contributors. A detailed specification of tegaki-db and tegaki-db-client will be provided later in a separate post.

tegaki-models

tegaki-models is by no means an end-user package and will only be used by developers. It is the placeholder for experimentation. Thanks to this package, model ideas will be tested and evaluated.

I continued to work on new model ideas… However, because my current training corpus is so small, it’s kind of irrelevant to spend to much time on models. The top priority now is to create tegaki-db.

tegaki-decoder

tegaki-decoder is going to be a high-performance decoder (recognizer). It should be a fast implementation of the Viterbi decoder. It will be written in C and designed to work with embedded systems. This is going to be the end-product that people will use. Once sufficient data have been collected, good models have been generated and the tegaki decoder is ready, then Project Tegaki will be ready for real use! Currently, implementing tegaki-decoder is not the top priority.

Roadmap

- Launch tegaki-db and tegaki-db-client.
- Hope that the collaborative effort is successfull and collect lots of handwriting samples from many different people.
- Create new models, especially stroke-based models.
- Implement tegaki-decoder.

If I continue to be the only one interested in this project, at this rate it will take from several months to a couple of years to achieve everything. That’s why I hope I can attract a few contributors.

Download

The work completed so far is still very experimental and thus targets potential contributors. If you want to test it with your own handwriting anyway, please see my previous post.

To download the source code, you can use

$ git clone http://www.mblondel.org/code/hwr.git

or

$ git pull

from the repository folder if you already have the repository on your computer.

The code can be browsed online using gitweb. By clicking the “snapshot” links you can get a complete copy of the source code at a given revision.

See my memo on git if you don’t know it yet.

I published my work under GPL license.