Entrée

June 11th, 2008

Japanese TV has many programs to learn English. Today, I was zapping channels when I saw one of those. It was about ordering food in a restaurant. And I noticed that “entrée” means “main dish” in American English. This is an example of misuse of French.

In French, like in British English I think, entrée means first course. This noun comes from the verb “entrer”, meaning “to enter”. So it should be quite obvious that entrée is the course to “enter” the dinner. Japanese is borrowing the French word too, アントレ (an-to-re).

Speaking of which… Japanese is noteworthy for having many loan words too. During the GW, I did paragliding. In Japanese, this is called パラグライダー which comes from paraglider. But in English, paraglider is the name of the person doing paragliding, not the name of the sport.

So it’s funny that sometimes some languages borrow words from other languages but they get the concept wrong or make some blatant mistakes…

First HMM experiment

May 25th, 2008

Today I’m publishing the initial results of my experiments on online handwriting recognition of Chinese characters, using Hidden Markov Models (HMM). You can see my post on Tomoe Evaluation for some background.

Download

$ git clone http://www.mblondel.org/code/hwr.git

The code can be browsed online using gitweb.

See my memo on git if you don’t know it yet. I published my work under GPL license.

Requirements

- Python (2.4)
- GHMM (SVN)
- Tomoe (SVN)
- Tomoe-GTK (SVN)

The Python bindings for the last three are also needed.

Folder structure

- data/ contains the raw training and evaluation data
- lib/ contains reusable components
- models/ contains model experiments
- tests/ contains test cases
- character-editor is the graphical interface to edit character data
- model-manager controls the training workflow, evaluation and the test pad

Each model must have an intelligible name. Each model must defined a file called model.py containing a class called Model. This class defines the behavior of the model. A model can inherit from other models in order to reuse common components. My first model is called “basic” so its file is models/basic/model.py.

First model

Here is some information regarding my first model.

- HMM unit: whole character
- Feature vectors: (deltax, deltay) with deltax = abs(xt - xt-1) and deltay = abs(yt - yt-1)
- Number of states: 3 * number of strokes
- Initial state transitions: 0.5 to stay in the same state, 0.5 to jump to the next state
- Initial state alignment: feature vectors are segmented uniformly and segments are associated with their corresponding state
- Training: Baum-Welch

If you don’t understand anything of the above, you should read more about HMMs ;) I may write an introduction on this journal if I have some time.

Training workflow

model-manger’s usage is as follows:
./model-manager model-name command

My first model is named “basic” so you may replace model-name by “basic”. Possible commands include:

- fextract, for the feature vectors extraction
- init, for the model initialization
- train, for the training
- eval, for the evaluation
- pad, to test the model with your own handwriting

“all” is a command equivalent to fextract, init, train and eval.

Testing with your own handwriting

First of all you should generate the HMMs with the following command:
./model-manager basic all

The process takes less than one minute on my computer. You may see a few warnings because of some issues in ghmm and tomoe. If all goes well, you should see the accuracy of the model.

From this point, normally, you could test the HMMs with your own handwriting with the following command:

./model-manager basic pad

However, for strange reasons, ghmm behaves incorrectly when the pygtk module is loaded. So the above command works but the character results will be incorrect. I need to contact the pygtk or ghmm mailing-list about this obscure issue. For now, you can use the following command:

./model-manager pad | ./model-manager basic eval -s

The results are displayed on the console. The system supports the following 50 kanji only.

一 二 三 泣 漢 温 使 便 旅 族
水 氷 撃 女 安 北 化 忘 妄 近
集 育 坊 訪 防 妨 駅 福 副 神
版 坂 板 金 全 錬 練 業 習 央
決 代 反 想 歯 象 始 初 発 感

Pick a few of them and try them with your own handwriting ;-)! By the way, all training and evaluation data were written by mouse.

Evaluation

match1: 80.0%
match5: 96.0%
match10: 98.0%

始	1	始, 福, 駅, 錬, 漢
旅	1	旅, 族, 駅, 練, 副
妨	1	妨, 練, 錬, 板, 発
防	1	防, 訪, 旅, 族, 板
泣	1	泣, 温, 福, 練, 駅
副	1	副, 訪, 福, 撃, 初
福	1	福, 練, 錬, 副, 駅
坂	3	板, 駅, 坂, 族, 錬
代	1	代, 板, 漢, 使, 駅
反	1	反, 福, 副, 忘, 妄
撃	3	駅, 錬, 撃, 漢, 副
業	1	業, 練, 錬, 集, 駅
氷	2	駅, 氷, 水, 妨, 版
温	1	温, 福, 駅, 錬, 想
育	1	育, 練, 副, 駅, 福
神	2	練, 神, 福, 錬, 撃
近	1	近, 駅, 練, 漢, 福
化	1	化, 練, 駅, 便, 習
一	X
央	1	央, 決, 業, 駅, 発
族	1	族, 練, 旅, 錬, 副
安	4	妄, 駅, 福, 安, 族
象	1	象, 駅, 錬, 練, 集
歯	1	歯, 練, 錬, 駅, 副
錬	1	錬, 練, 集, 駅, 福
習	1	習, 錬, 福, 駅, 漢
使	1	使, 便, 漢, 錬, 練
訪	1	訪, 駅, 錬, 副, 板
漢	1	漢, 錬, 駅, 練, 業
全	1	全, 金, 集, 錬, 福
集	1	集, 練, 業, 錬, 福
版	1	版, 板, 錬, 駅, 集
水	2	氷, 水, 旅, 駅, 便
板	1	板, 族, 坂, 福, 駅
妄	1	妄, 駅, 福, 忘, 練
初	1	初, 駅, 旅, 練, 坂
想	1	想, 駅, 副, 錬, 集
発	1	発, 練, 福, 駅, 漢
練	1	練, 錬, 福, 駅, 板
北	1	北, 坂, 副, 駅, 板
決	1	決, 漢, 便, 練, 坂
坊	X
駅	1	駅, 錬, 練, 族, 福
金	1	金, 発, 練, 錬, 駅
女	5	駅, 妨, 妄, 板, 女
忘	1	忘, 族, 副, 福, 駅
二	1	二, 三, 忘, 歯, 習
感	1	感, 福, 駅, 族, 練
便	3	練, 駅, 便, 錬, 福
三	1	三, 忘, 副, 版, 訪

The results are very promising and outperform Tomoe’s current recognizer. Incidentally, I used the same evaluation corpus for Tomoe and for my experiment. However, a few things must be emphasized:

- My experiment only supports 50 kanji while Tomoe supports thousands of them.
- The evaluation of my experiment is performed using kanji from the same people who wrote the kanji used for training. However, the kanji instances for training and evaluation are not the same.
- It’s pretty sure that using the whole character HMM symbol will not perform well in terms of computation time with thousands of kanji. Usually, stroke or sub-stroke models are preferred.

Interestingly, my recognizer doesn’t do a good job at recognizing the simplest characters: 一 二 三.

Both Tomoe and my recognizer are sensitive to stroke order. However, as it seems, my recognizer is not so sensitive to stroke number. For example, く in 女 is one stroke but it’s acceptable to write it in two strokes. However, if you write く after 一 and ノ, it doesn’t work.

Call to online handwriting database

If you’re a researcher in handwriting recognition and read this, I’m looking for a handwriting database of Chinese characters (kanji or hanzi). Please contact me if you can help me.

What’s next?

- Try more sophisticated feature vectors
- Try more sophisticated initial state alignment
- Try stroke and sub-stroke HMMs
- Collect more data
- Try techniques other than HMMs

Git memo

May 25th, 2008

I’m planning to use git, a popular new version control system, for all the developments that I don’t want to publish on a forge (like Sourceforge). Because it’s distributed, it’s possible to perform commits offline. This solves my nightmare of making modifications that break the project and spending hours to track down the problem. So far I like git very much. Here’s a memo of commands and useful information.
Read the rest of this entry »

Tomoe Evaluation

May 25th, 2008

In last December, I started a one-year internship at Asahi Kasei, in their Atsugi-based speech recognition group. Even if I have been doing quite a deal of software development, I have been able to study Hidden Markov Models (HMM) and statistics. It turns out that, hehe, I like it!

One year ago, I started to contribute to Tomoe, as part of my participation to the Google Summer of Code. This experience raised my interest in handwriting recognition, especially of Chinese characters. When I studied the Hidden Markov Models, I always kept in mind Handwriting Recognition. “How would I do this? How would I do that?”. This helped me raise more questions and have a better understanding.

One thing that I learned during this internship is the notion of corpus (plural: corpora), more precisely training and evaluation corpus. Three months ago I started my experiment project with Chinese character handwriting recognition. The first thing I had to do was to create corpora. I reused the canvas provided in Tomoe to create a character editor. The user draws the character and it is saved to a file in XML format.

Together with a Japanese friend, we selected 50 kanji. Some simple, some complex. Some completely different, some very similar to others. We each wrote 5 instances of each kanji. 8 instances were intended for training corpus. The data are used to train the system how to recognize kanji. 2 instances were intended for evaluation. The data are used to estimate how good the system performs. The performance is describe in terms of accuracy or error rate. The evaluation allows to measure improvements when one recognizer is tuned or to compare how well two recognizers perform, provided that the evaluation corpus is well designed (large and representative enough).

Well, Tomoe doesn’t use statistical learning yet so I didn’t use the training corpus for it. However, the next thing I did after collecting data was to use the evaluation corpus in order to evaluate Tomoe’s performance. At the time of the Google Summer of Code, I didn’t have this idea, although it now seems obvious to me. Verdict:

1st match: 61.0%
5 firsts: 74.0%
10 firsts: 74.0%

This means that 61% of characters are recognized as fist match and 74% are recognized in the first 5 or 10 results. Considering the first 10 matches, which is acceptable, the error rate is still 26%, which is pretty high. Here’s a more detailed view of the results. Interestingly, we can see that kanji with the same radical or shape are often in the candidate list.

駅      X
妨      1       妨, 姨, 姙, 枋, 枕
坊      1       坊, 垓, 坑, 拡, 択
発      X       癸, 廢
歯      X
全      1       全, 舎, 舍, 早, 果
金      X       昂, 氤, 釘, 覇
板      X       被
忘      1       忘, 忌, 志, 芯, 忠
女      1       女, 冊, 木, 仄, 攵
族      X       楾
始      1       始, 姶, 恰, 娯, 娃
錬      1       錬, 顕, 鍜, 鰊
集      1       集, 賃, 寔, 夐, 募
旅      X
坂      4       扼, 拔, 城, 坂, 披
訪      X       詫, 誇, 就, 駱
水      3       氷, 丞, 水, 妃, 羽
三      1       三, 工, 弖, 王, 玉
想      X       慧, 慂
神      1       神, 裡, 祝, 殉, 術
副      1       副, 歇, 飮, 尠, 飩
安      1       安, 宋, 宏, 免, 案
泣      1       泣, 注, 浜, 淳, 泡
二      1       二, 井, 云, 元, こ
感      1       感
代      1       代, 伐, 陀, 弛, 池
撃      1       撃, 磬
温      1       温, 溜, 塩, 溘, 溝
漢      1       漢, 灘, 嘱
一      1       一, 廾, 弋, 十, 七
象      X       豫
育      1       育, 昌, 匿, 高, 香
氷      2       妁, 氷, 承, 冰, 灰
反      1       反, 皮, 尻, 阪, 伎
業      1       業, 箕, 篇, 賓, 霄
防      1       防, 枋, 枕, 偽, 隧
妄      X       気
初      X
決      X       泥, 沫, 泯, 沸, 泱
央      X       史, 决, 吏, 向, 岔
習      1       習, 跫, 笥, 筥, 筍
練      X       踝, 踴, 閥, 諌, 錬
近      X
化      1       化, 价, 仙, 他, 伊
福      1       福, 熕, 褌, 複, 磆
北      1       北, 把, 地, 托, 叱
便      1       便, 峺, 悗, 栲, 僊
版      X       放, 施, 倣, 昨, 站
使      1       使, 俚, 便, 候, 俾

Three months ago, I started my experiment project when I collected kanji data. I then worked on the project an hour or two from time to time. I obtained my first results earlier this week. I was extremely happy of seeing results at last. It was difficult to keep on track because sometimes, I didn’t work on the project for days or weeks. My initial results outperform the current Tomoe recognizer, with some limitations, that I will develop later. I will publish my work and give more details about it in another post.

Some news and pictures of Japan

May 25th, 2008

I’ve been living in Japan for 6 months already. I must say, wow, time flies very very fast. This is abroad for me so there’s always something new to do or to see. I don’t have much time for myself or personal projects. But I have (very) slowly been working on some experiments on handwritten Chinese character recognition. Stay tuned.

I will graduate from my engineering school in June (masters) so lately I have been thinking to what I want to do in the future. As I said, times goes very fast. It’s thus very important to fulfill myself during the 8 hours that I spend at work, whatever it may be. So far, I’m pretty sure that I don’t want to work as a software developer for a private company. I want to do something more challenging and in connection with research. Most importantly, I want to LEARN new things every day. Options I’m considering are creating my own company, becoming a research engineer in a private lab or starting a PhD. I’m glad because I think I have narrowed down the fields I would like to work in, in the future. I would like to work in the fields of Machine Learning and Pattern Recognition, especially handwriting recognition, natural language processing (NLP) and speech recognition.

I uploaded some of my pictures of Japan. I still have many others that I need to sort out but if you are interested, take a look.

Moms

March 3rd, 2008

Read on twitter.com

Twitter is a service for friends, family, and co–workers to communicate and stay connected through the exchange of quick, frequent answers to one simple question: What are you doing?

Why? Because even basic updates are meaningful to family members, friends, or colleagues—especially when they’re timely.

* Eating soup? Research shows that moms want to know.
[…]

Haha. So true…

Fantasdic 1.0-beta5

January 6th, 2008

Just a short notice to say that I finally took the time to release Fantasdic 1.0-beta5. The release was ready since the middle of November! If you were using Fantasdic 1.0-beta4, I strongly recommend you to upgrade. Go see the Fantasdic website!

もう旅立ちだ

December 2nd, 2007

最近、このことを書く時間が見つからなかったが、明日もう日本に行きます。よろしく!取り合えず、何を日本にしに行くか少し話しましょう。今週、授業がやっと終って、試験を全部受けて、来週から研修が始まります。研修はいつも外国に行くいい機会で、ぼくは日本に行きたかったです。ぼくが行く会社は知らなかったけど、日本ではけっこう大きくて有名だそうです。旭化成です。もちろん、旭化成はとくに化学に関することをやっているけど、ぼくは厚木(神奈川)にある音声認識と音声合成チームに入ります。1年間です。今年勉強したこと(信号処理)と趣味(プログラミング)に関係があってよかったです。

二年前一ヶ月日本に行ったけど、あまり旅行しなかったから、今回は時間があったら、いろいろなところに行きたいです。他の予定の一つは一生懸命に日本語を学ぶことです。発音はきれいだし、漢字も文法もとても面白いし…日本語が上達するのに日本に行くのはぜったい嬉しい理由の一つです。

まぁー、とにかく、とっても楽しい経験になりそうです。

Ruby.new

November 22nd, 2007

I gave a talk on Ruby to TELECOM Lille 1’s LUG last Thursday. It was entitled “Ruby.new” and lasted 30 minutes. It was an introduction to Ruby without being a tutorial. I emphasized on the cool features that make Ruby apart. Apparently, it was pretty well received and there were about 15 people attending it.

You can download my slides in ODP and PDF (French).

Multiple dictionary sources in Fantasdic

November 1st, 2007

Over the past few weeks, I have slowly but surely been adding multiple dictionary sources support to Fantasdic. Until recently, Fantasdic had been a DICT client only, that is, Fantasdic connected to DICT servers (as configured by the user in the settings) in order to retrieve definitions. I thought it would always be like that and I had even objected to change that in gnome-dictionary but I’ve finally changed my mind. As I said some time ago, a great deal of Fantasdic’s source code is only user interface source code. If making a dictionary application means spending so much time on user interface, it’s best to make it general-purpose…

Currently, Fantasdic includes two new kinds of source, in addition to DICT servers:

- Google Translate
- EDICT files

Basically, it works like a plugin system. Source plugins can either be distributed and installed with Fantasdic or installed manually in $HOME/.fantasdic/sources/ for third-party plugins. Writing a new source plugin is merely a matter of extending a base class and implementing a few required methods. Plugins are written in Ruby.

Hopefully, the user interface remained as simple as it was.

Fantasdic screenshot
Fantasdic searching in an EDICT file. EDICT is a famous dictionary format for anyone learning Japanese.

Some sources may require additional fields to be configured by the user. For example, the DICT server source requires a server host and port. The EDICT file source requires a file path to be specified. The user interface for those additional fields is defined directly in the source plugins.

Fantasdic screenshot

Fantasdic screenshot
For this source, a file must be selected…

Fantasdic screenshot
With the Google Translate source, you need to select your languages for the translations.

Fantasdic screenshot
Fantasdic, using Google Translate.

I hope more and more sources can be added :) Ideally all source plugins should be multi platform. Here are a few suggestions (of course, I’m counting on you to implement them ;-)):

- dictd file: search directly in files aimed for the dictd server. See “man dictd” for a description of the format and tools/ in Fantasdic’s source code for some starters.

- Stardict file. There’s a file describing the format in Stardict’s source code. Likewise, tools/ has a script to convert stardict files, it may be a good starter.

- Stardict server. Stardict authors have created their own protocol and they’re running a server with quite some dictionaries. Directly see Stardict’s source code or use a packet sniffer.

- Epwing dictionaries. You’ll need to use rubyeb, the Ruby bindings to the excellent libeb.

- Wikpedia/Wiktionary. This source plugin would simply perform an HTTP request to the appropriate site. Greg Hewgill kindly accepted to share his code to clean mediawiki syntax and make it more readable. I’m quoting an email he sent to me:

The current state of my code can be found at:
http://hewgill.com/viewvc/wiktiondict/trunk/

Feel free to use any of my code (or the algorithms therein) to format
mediwiki data. I imagine you already know this, but you can fetch the
raw output for individual pages using a url like this:
http://en.wiktionary.org/w/index.php?title=test&action=raw

In fact, you can also add &templates=expand to that url and mediawiki
does all the hard template work! I found the docs at:
http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php

Waiting for your comments and your source plugins!