Archive for May, 2008

First HMM experiment

Sunday, May 25th, 2008

Today I’m publishing the initial results of my experiments on online handwriting recognition of Chinese characters, using Hidden Markov Models (HMM). You can see my post on Tomoe Evaluation for some background.

Download

$ git clone http://www.mblondel.org/code/hwr.git

The code can be browsed online using gitweb.

See my memo on git if you don’t know it yet. I published my work under GPL license.

Requirements

- Python (2.4)
- GHMM (SVN)
- Tomoe (SVN)
- Tomoe-GTK (SVN)

The Python bindings for the last three are also needed.

Folder structure

- data/ contains the raw training and evaluation data
- lib/ contains reusable components
- models/ contains model experiments
- tests/ contains test cases
- character-editor is the graphical interface to edit character data
- model-manager controls the training workflow, evaluation and the test pad

Each model must have an intelligible name. Each model must defined a file called model.py containing a class called Model. This class defines the behavior of the model. A model can inherit from other models in order to reuse common components. My first model is called “basic” so its file is models/basic/model.py.

First model

Here is some information regarding my first model.

- HMM unit: whole character
- Feature vectors: (deltax, deltay) with deltax = abs(xt – xt-1) and deltay = abs(yt – yt-1)
- Number of states: 3 * number of strokes
- Initial state transitions: 0.5 to stay in the same state, 0.5 to jump to the next state
- Initial state alignment: feature vectors are segmented uniformly and segments are associated with their corresponding state
- Training: Baum-Welch

If you don’t understand anything of the above, you should read more about HMMs ;) I may write an introduction on this journal if I have some time.

Training workflow

model-manger’s usage is as follows:
./model-manager model-name command

My first model is named “basic” so you may replace model-name by “basic”. Possible commands include:

- fextract, for the feature vectors extraction
- init, for the model initialization
- train, for the training
- eval, for the evaluation
- pad, to test the model with your own handwriting

“all” is a command equivalent to fextract, init, train and eval.

Testing with your own handwriting

First of all you should generate the HMMs with the following command:
./model-manager basic all

The process takes less than one minute on my computer. You may see a few warnings because of some issues in ghmm and tomoe. If all goes well, you should see the accuracy of the model.

From this point, normally, you could test the HMMs with your own handwriting with the following command:

./model-manager basic pad

However, for strange reasons, ghmm behaves incorrectly when the pygtk module is loaded. So the above command works but the character results will be incorrect. I need to contact the pygtk or ghmm mailing-list about this obscure issue. For now, you can use the following command:

./model-manager pad | ./model-manager basic eval -s

The results are displayed on the console. The system supports the following 50 kanji only.

一 二 三 泣 漢 温 使 便 旅 族
水 氷 撃 女 安 北 化 忘 妄 近
集 育 坊 訪 防 妨 駅 福 副 神
版 坂 板 金 全 錬 練 業 習 央
決 代 反 想 歯 象 始 初 発 感

Pick a few of them and try them with your own handwriting ;-)! By the way, all training and evaluation data were written by mouse.

Evaluation

match1: 80.0%
match5: 96.0%
match10: 98.0%

始	1	始, 福, 駅, 錬, 漢
旅	1	旅, 族, 駅, 練, 副
妨	1	妨, 練, 錬, 板, 発
防	1	防, 訪, 旅, 族, 板
泣	1	泣, 温, 福, 練, 駅
副	1	副, 訪, 福, 撃, 初
福	1	福, 練, 錬, 副, 駅
坂	3	板, 駅, 坂, 族, 錬
代	1	代, 板, 漢, 使, 駅
反	1	反, 福, 副, 忘, 妄
撃	3	駅, 錬, 撃, 漢, 副
業	1	業, 練, 錬, 集, 駅
氷	2	駅, 氷, 水, 妨, 版
温	1	温, 福, 駅, 錬, 想
育	1	育, 練, 副, 駅, 福
神	2	練, 神, 福, 錬, 撃
近	1	近, 駅, 練, 漢, 福
化	1	化, 練, 駅, 便, 習
一	X
央	1	央, 決, 業, 駅, 発
族	1	族, 練, 旅, 錬, 副
安	4	妄, 駅, 福, 安, 族
象	1	象, 駅, 錬, 練, 集
歯	1	歯, 練, 錬, 駅, 副
錬	1	錬, 練, 集, 駅, 福
習	1	習, 錬, 福, 駅, 漢
使	1	使, 便, 漢, 錬, 練
訪	1	訪, 駅, 錬, 副, 板
漢	1	漢, 錬, 駅, 練, 業
全	1	全, 金, 集, 錬, 福
集	1	集, 練, 業, 錬, 福
版	1	版, 板, 錬, 駅, 集
水	2	氷, 水, 旅, 駅, 便
板	1	板, 族, 坂, 福, 駅
妄	1	妄, 駅, 福, 忘, 練
初	1	初, 駅, 旅, 練, 坂
想	1	想, 駅, 副, 錬, 集
発	1	発, 練, 福, 駅, 漢
練	1	練, 錬, 福, 駅, 板
北	1	北, 坂, 副, 駅, 板
決	1	決, 漢, 便, 練, 坂
坊	X
駅	1	駅, 錬, 練, 族, 福
金	1	金, 発, 練, 錬, 駅
女	5	駅, 妨, 妄, 板, 女
忘	1	忘, 族, 副, 福, 駅
二	1	二, 三, 忘, 歯, 習
感	1	感, 福, 駅, 族, 練
便	3	練, 駅, 便, 錬, 福
三	1	三, 忘, 副, 版, 訪

The results are very promising and outperform Tomoe’s current recognizer. Incidentally, I used the same evaluation corpus for Tomoe and for my experiment. However, a few things must be emphasized:

- My experiment only supports 50 kanji while Tomoe supports thousands of them.
- The evaluation of my experiment is performed using kanji from the same people who wrote the kanji used for training. However, the kanji instances for training and evaluation are not the same.
- It’s pretty sure that using the whole character HMM symbol will not perform well in terms of computation time with thousands of kanji. Usually, stroke or sub-stroke models are preferred.

Interestingly, my recognizer doesn’t do a good job at recognizing the simplest characters: 一 二 三.

Both Tomoe and my recognizer are sensitive to stroke order. However, as it seems, my recognizer is not so sensitive to stroke number. For example, く in 女 is one stroke but it’s acceptable to write it in two strokes. However, if you write く after 一 and ノ, it doesn’t work.

Call to online handwriting database

If you’re a researcher in handwriting recognition and read this, I’m looking for a handwriting database of Chinese characters (kanji or hanzi). Please contact me if you can help me.

What’s next?

- Try more sophisticated feature vectors
- Try more sophisticated initial state alignment
- Try stroke and sub-stroke HMMs
- Collect more data
- Try techniques other than HMMs

Git memo

Sunday, May 25th, 2008

Tell git who you are

$ git config --global user.name "FirstName LastName"
$ git config --global user.email "user@example.com"

You can run those commands without –global inside a repository.

Fancy colors

$ git config --global color.ui "auto"

Initialize a local repository

$ cd /path/to/project
$ git init

The repository is initially empty, it’s necessary to add the files and directories that we want to track.

$ git add file
$ git commit

The convention for commit messages is:

Short summary of changes.

Long description of changes… Blablablabla.

$ git rm file
$ git mv oldfile newfile

Cloning repositories

Local:

$ git clone /path/to/repo name

Git protocol:

$ git clone git://git.rubini.us/code name

SSH:

$ git clone ssh://myserver.com/var/git/myapp.git name

HTTP:

$ git clone http://yourserver.com/~you/proj.git name

To import new changes from the original repository:

$ git pull

To import changes from another copy of the repository:

$ git pull /path/to/repo

To apply your commits to the original repository (provided you have permissions):

$ git push

To create an alias:

$ git remote add aliasname /path/to/repo
$ git pull aliasname

pull is equivalent to fetch + merge

$ git fetch aliasname
$ git merge aliasname/master

Repository status and history

$ git status
$ git log [file|tag|range]

Use -p to show the log as patches.

$ git diff [file|tag|range]
$ git show [commit-id]

To show an old file:

$ git show v2.5:fs/locks.c

Revert changes

To revert committed changes:

$ git revert commit-id

To revert uncommitted changes:

$ git checkout -f

Creating patches

$ git diff [commit-id-before] [commit-id-after] > my.patch

format-patch is a command to create ready-to-send patches. One commit is created in one file.

To extract the three topmost commits from the current branch:

$ git format-patch -3

To extract patches from a range of commits:

$ git format-patch commit-id-start..commit-id-end

To extract patches since a commit:

$ git format-patch commit-id

Note: In the latter case, this extracts patches since the commit-id excluded.

Tags

List of tags:

$ git tag -l

Tag current work:

$ git tag v1.0

Tag a commit:

$ git tag name commit-id

And to get the commit-id from the name:

$ git rev-parse name

Switch to a tag:

$ git checkout 0.9

HEAD refers to head of the current branch.

Working with branches

To see the current branch:

$ git branch

To see the list of branches:

$ git branch -a

* denotes the current working branch.

For remote branches:

$ git branch -r

To create a new branch and switch to it:

$ git checkout -b branch-name

You may create a branch from a starting reference:

$ git checkout -b branch-name from-branch|commit-id|tag

To switch to another branch:

$ git checkout branch-name

It’s a possible to switch to a revision:

$ git checkout commit-id

Note: if some files differ between the current branch and the branch you check out and are currently open in your text editor, it will warn you that the files have changed. In that case, just reload them.

Working with remote branches

You cannot switch to a remote branch, you need to create a local one which is the copy of the remote one:

$ git checkout -b my-branch origin/my-branch
$ git remote add linux-nfs git://linux-nfs.org/pub/nfs-2.6.git
$ git fetch linux-nfs

Finding regressions

To find a regression that happened between v2.6.18 and master:

$ git bisect start
$ git bisect good v2.6.18
$ git bisect bad master

git will try several revisions until it identifies the revision that caused the regression. You need to tell git if the regression occurs or not with:

$ git bisect bad

and

$ git bisect good

Once the revision is identified, use “git show” to examine it.

To return to the branch you were on:

$ git bisect reset

Compression and self-consistency check

$ git gc
$ git fsck

Git workflow

This is a typical workflow to contribute to a project using git.

Retrieve latest code from the remote repository:

$ git pull

Create a new branch to work safely on a new feature:

$ git checkout -b new_feature

Commit your changes:

$ git commit -a

Check if new code was added in the interim:

$ git checkout master
$ git pull

If anything was changed in the master branch:

$ git checkout new_feature
$ git rebase master

rebase unapplies the changes, updates the branch and reapplies the changes back.

Warning! If you are sharing a branch, you must use:

$ git merge master

If there are conflicts applying your changes during the git rebase command, fix them and use the following to finish applying them:

$ git rebase --continue

Merge the changes in the master branch:

$ git checkout master
$ git merge new_feature

Push your changes to the remote repository if you have permission:

$ git push

Alternatively, submit patches or publish your own copy of the repository so that other people can pull changes from you.

Finally, you may want to delete the branch:

$ git branch -d new_feature

Sharing repositories with the world

SSH is used to push commits and HTTP to allow people to pull from the repository.

$ ssh you@yourserver.com
$ mkdir /home/you/public_html/proj.git
$ cd /home/you/public_html/proj.git
$ git --bare init
$ git --bare update-server-info
$ chmod a+x hooks/post-update
$ exit

Note: init is still called init-db in Debian Etch.

$ cd /path/to/local/repo
$ git remote add origin ssh://yourserver.com/home/you/public_html/proj.git
$ git push origin master

Next time you can omit the arguments:

$ git push

Now you can use:

$ git remote
$ git remote show origin

People can now clone your repository, either by SSH or HTTP.

$ git clone ssh://yourserver.com/home/you/public_html/proj.git
$ git clone http://yourserver.com/proj.git

gitweb

In Debian, gitweb installs to /usr/lib/cgi-bin/gitweb.cgi.

I added the line below to my VirtualHost:
ScriptAlias /gitweb /usr/lib/cgi-bin/gitweb.cgi

I also edited /etc/gitweb.conf to configure the git root as well as the git logo and CSS stylesheet.

Cloning a Subversion repository

$ git-svn clone http://svn.site.org/svnroot/projet -T trunk -b branches -t tags

To import new modifications from Subversion:

$ git-svn rebase

To apply commits to Subversion:

$ git-svn dcommit

References

http://rubinius.lighthouseapp.com/projects/5089/using-git

http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#public-repositories

http://www.kernel.org/pub/software/scm/git/docs/everyday.html

http://www.eleves.ens.fr/home/oudomphe/comp/ware/git.xhtml.fr

http://toolmantim.com/article/2007/12/5/setting_up_a_new_remote_git_repository

Tomoe Evaluation

Sunday, May 25th, 2008

In last December, I started a one-year internship at Asahi Kasei, in their Atsugi-based speech recognition group. Even if I have been doing quite a deal of software development, I have been able to study Hidden Markov Models (HMM) and statistics. It turns out that, hehe, I like it!

One year ago, I started to contribute to Tomoe, as part of my participation to the Google Summer of Code. This experience raised my interest in handwriting recognition, especially of Chinese characters. When I studied the Hidden Markov Models, I always kept in mind Handwriting Recognition. “How would I do this? How would I do that?”. This helped me raise more questions and have a better understanding.

One thing that I learned during this internship is the notion of corpus (plural: corpora), more precisely training and evaluation corpus. Three months ago I started my experiment project with Chinese character handwriting recognition. The first thing I had to do was to create corpora. I reused the canvas provided in Tomoe to create a character editor. The user draws the character and it is saved to a file in XML format.

Together with a Japanese friend, we selected 50 kanji. Some simple, some complex. Some completely different, some very similar to others. We each wrote 5 instances of each kanji. 8 instances were intended for training corpus. The data are used to train the system how to recognize kanji. 2 instances were intended for evaluation. The data are used to estimate how good the system performs. The performance is describe in terms of accuracy or error rate. The evaluation allows to measure improvements when one recognizer is tuned or to compare how well two recognizers perform, provided that the evaluation corpus is well designed (large and representative enough).

Well, Tomoe doesn’t use statistical learning yet so I didn’t use the training corpus for it. However, the next thing I did after collecting data was to use the evaluation corpus in order to evaluate Tomoe’s performance. At the time of the Google Summer of Code, I didn’t have this idea, although it now seems obvious to me. Verdict:

1st match: 61.0%
5 firsts: 74.0%
10 firsts: 74.0%

This means that 61% of characters are recognized as fist match and 74% are recognized in the first 5 or 10 results. Considering the first 10 matches, which is acceptable, the error rate is still 26%, which is pretty high. Here’s a more detailed view of the results. Interestingly, we can see that kanji with the same radical or shape are often in the candidate list.

駅      X
妨      1       妨, 姨, 姙, 枋, 枕
坊      1       坊, 垓, 坑, 拡, 択
発      X       癸, 廢
歯      X
全      1       全, 舎, 舍, 早, 果
金      X       昂, 氤, 釘, 覇
板      X       被
忘      1       忘, 忌, 志, 芯, 忠
女      1       女, 冊, 木, 仄, 攵
族      X       楾
始      1       始, 姶, 恰, 娯, 娃
錬      1       錬, 顕, 鍜, 鰊
集      1       集, 賃, 寔, 夐, 募
旅      X
坂      4       扼, 拔, 城, 坂, 披
訪      X       詫, 誇, 就, 駱
水      3       氷, 丞, 水, 妃, 羽
三      1       三, 工, 弖, 王, 玉
想      X       慧, 慂
神      1       神, 裡, 祝, 殉, 術
副      1       副, 歇, 飮, 尠, 飩
安      1       安, 宋, 宏, 免, 案
泣      1       泣, 注, 浜, 淳, 泡
二      1       二, 井, 云, 元, こ
感      1       感
代      1       代, 伐, 陀, 弛, 池
撃      1       撃, 磬
温      1       温, 溜, 塩, 溘, 溝
漢      1       漢, 灘, 嘱
一      1       一, 廾, 弋, 十, 七
象      X       豫
育      1       育, 昌, 匿, 高, 香
氷      2       妁, 氷, 承, 冰, 灰
反      1       反, 皮, 尻, 阪, 伎
業      1       業, 箕, 篇, 賓, 霄
防      1       防, 枋, 枕, 偽, 隧
妄      X       気
初      X
決      X       泥, 沫, 泯, 沸, 泱
央      X       史, 决, 吏, 向, 岔
習      1       習, 跫, 笥, 筥, 筍
練      X       踝, 踴, 閥, 諌, 錬
近      X
化      1       化, 价, 仙, 他, 伊
福      1       福, 熕, 褌, 複, 磆
北      1       北, 把, 地, 托, 叱
便      1       便, 峺, 悗, 栲, 僊
版      X       放, 施, 倣, 昨, 站
使      1       使, 俚, 便, 候, 俾

Three months ago, I started my experiment project when I collected kanji data. I then worked on the project an hour or two from time to time. I obtained my first results earlier this week. I was extremely happy of seeing results at last. It was difficult to keep on track because sometimes, I didn’t work on the project for days or weeks. My initial results outperform the current Tomoe recognizer, with some limitations, that I will develop later. I will publish my work and give more details about it in another post.