Archive for the 'Machine Learning' Category

HCCR using AJAX and SVM

Tuesday, September 9th, 2008

I came across this impressive page (in Japanese), which shows a demonstration of Handwritten Chinese Character Recognition using AJAX for the user interface and Support Vector Machines for the training algorithm.

Looking at the Javascript code, I was surprised to see that, unlike my web canvas, it doesn’t use the <canvas> tag! It simply uses a combination of Javascript and CSS. Even though it has a few quirks, the interface is quite responsive.

The recognition process itself happens on the server side but thanks to the use of AJAX, the results are displayed very smoothly, without the need to refresh the page.

Taku Kudo, the author, explains in the page that he’s using the handwriting data from Tomoe. However since Tomoe uses a template-based algorithm, it only has one handwriting sample per character. I’m impressed that Taku Kudo can train his system only with one sample per character. Overall, the accuracy is not very impressive but I think it could improve a lot with more training samples. That’s why my handwriting database project is going to be very useful. I’ve been willing to try SVM in addition to HMMs so the fact that this project uses SVM confirmed my interest for it.

Taku Kudo’s page has neat stuff regarding Natural Language Processing and Machine Learning. He published a lot of libraries as free software, including Mecab and TinySVM. If you like fancy stuff in AJAX, you’ll also like his Japanese Input Method.

Character encoding detection

Sunday, August 17th, 2008

Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.

Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:

A composite approach to language/encoding detection

This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:

Coding scheme method

A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.

Character distribution method

For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.

Number of Most
Frequent Characters
Accumulated Percentage
10 0.11723
64

0.31983
128 0.45298

256 0.61872
512 0.79135
1024

0.92260
2048 0.98505

4096 0.99929
6763 1.00000

Character distribution in Simplified Chinese

Number of Most
Frequent Characters
Accumulated Percentage
10 0.27098
64 0.66722
128

0.77094
256 0.85710

512 0.92635
1024 0.97130
2048

0.99431
4096 0.99981

1.00000

Character distribution in Japanese

In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.

Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:

n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total - n_freq)

We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.

confidence_level = r_file / r_encoding

We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.

This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.

n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total - n_freq)

2-character sequence distribution method

For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.

Accuracy

100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.

Handwriting renderers

Sunday, July 13th, 2008

Canvas

If you didn’t read my previous post, for short, project Tegaki is a framework for handwritten Chinese character recognition (HCCR) written in Python. It includes reusable components and is a placeholder for experimentation. The goal is to create the next-generation open-source HCCR software but it may be useful for academic researchers as well.

One reusable component is the Canvas. This is the user interface component that allows to draw characters. In addition, the Canvas supports “replaying” the character (stroke by stroke animation) and setting a background model (to help users draw an unknown character). It is multi-platform.

The Canvas
Example of a character drawn using the Canvas provided by libtegaki-gtk

The Canvas has a get_writing() method. It allows to retrieve the Writing object for the handwriting currently displayed in the Canvas.

XML representation

The Writing object supports reading from and writing to an XML file. The XML file can optionally be compressed using gzip or bz2. On my hard drive, I have a small set of handwriting samples. 500 characters take about 10 MB. That’s why compression is very useful.

The XML representation of a handwriting sample looks like that.

<character>
  <utf8>無</utf8>
  <strokes>
    <stroke>
      <point x="306" y="163" timestamp="0" />
      <point x="303" y="163" timestamp="21" />
      <point x="303" y="166" timestamp="29" />
      [...]
    </stroke>
    <stroke>
      <point x="266" y="240" timestamp="912" />
      <point x="270" y="240" timestamp="917" />
      <point x="273" y="240" timestamp="925" />
      [...]
    </stroke>
    [...]
  </strokes>
</character>

Renderers

I’ve recently added support for what I named “renderers”. They take a Writing object as parameter and generate a visual representation of it. Since I used the cairo graphics library as drawing backend, the representation can be saved to PNG, SVG and PDF! Those renderers will be very useful for the handwriting database website that I wrote about in my previous post!

Complete character renderer

Kanji

Stroke order renderer

Kanji
Stroke order with each single stroke

Kanji
Stroke order with stroke groups

Strokes can be grouped together when the stroke order is obvious. However, this requires to know which strokes to combine together. A dictionary must be created for that. A entry example would be:

駅 1,1,3,1,4,2,2

<canvas> HTML tag

The canvas I was writing about above is written in pygtk and is intended to be used for the Desktop or for Maemo. However, in the case of the handwriting database website, since we want as many people to contribute their handwriting as possible, it would be nice to not require any particular installation. For that, a canvas directly in the browser would be the ideal solution.

One solution would be to use Flash but I would prefer to use the <canvas> tag. It can be used in combination with Javascript to do drawing in the browser. It is supported natively by Firefox, Opera and Safari. It is supported in Internet Explorer through a third-party Javascript called ExplorerCanvas.

I am looking for a contributor to create a new canvas using this technology. The canvas should support drawing, displaying existing handwriting and replay (stroke by stroke animation).

For more information:

GIF stroke animation

Even though GIF uses a patented compression, GIF is still the only format with support for animations and wide support in the browsers. Therefore it would be very cool to be able to generate GIF stroke animations from a writing object.

I had a look at python-imagemagick and Python Imaging Library (PIL) but they both seem to have very limited support for GIF animations. So I’m thinking of writing my own library for GIF generation in Python. Byzanz, a software to create screencasts as GIF animations, can be used as inspiration because it includes a GIF encoder. It also supports color quantization (using octrees) and dithering. From what I see, it should take less than 1000 lines of Python code.

I read a little bit about color quantization. I found it very interesting. Here’s a short explanation about color quantization for those who don’t know about it. Basically, each pixel in an image may have three components Red Blue Green. For a 400×400 picture, this is about 400*400*3=480KB. To gain space, an idea is to store colors in a palette (a table index => color). Then each pixel only needs to refer to the index in the palette instead of having to define the three components. For a 256-color palette, this saves two bytes for each pixel. However, since we now use 256 colors only instead of 256 * 256 * 256 = 16,777,216 colors, there’s a color precision loss. The challenge is thus to find what colors to put in the palette to have the smallest precision loss possible. For example, we may want to put in the palette colors that are the closest to the most frequently used colors. This is a 3-dimensional clustering problem, thus it reminded me of Machine Learning, a topic in which I’ve been very interested recently.

For more information, I recommend the reading of those Wikipedia articles:

First HMM experiment

Sunday, May 25th, 2008

Today I’m publishing the initial results of my experiments on online handwriting recognition of Chinese characters, using Hidden Markov Models (HMM). You can see my post on Tomoe Evaluation for some background.

Download

$ git clone http://www.mblondel.org/code/hwr.git

The code can be browsed online using gitweb.

See my memo on git if you don’t know it yet. I published my work under GPL license.

Requirements

- Python (2.4)
- GHMM (SVN)
- Tomoe (SVN)
- Tomoe-GTK (SVN)

The Python bindings for the last three are also needed.

Folder structure

- data/ contains the raw training and evaluation data
- lib/ contains reusable components
- models/ contains model experiments
- tests/ contains test cases
- character-editor is the graphical interface to edit character data
- model-manager controls the training workflow, evaluation and the test pad

Each model must have an intelligible name. Each model must defined a file called model.py containing a class called Model. This class defines the behavior of the model. A model can inherit from other models in order to reuse common components. My first model is called “basic” so its file is models/basic/model.py.

First model

Here is some information regarding my first model.

- HMM unit: whole character
- Feature vectors: (deltax, deltay) with deltax = abs(xt - xt-1) and deltay = abs(yt - yt-1)
- Number of states: 3 * number of strokes
- Initial state transitions: 0.5 to stay in the same state, 0.5 to jump to the next state
- Initial state alignment: feature vectors are segmented uniformly and segments are associated with their corresponding state
- Training: Baum-Welch

If you don’t understand anything of the above, you should read more about HMMs ;) I may write an introduction on this journal if I have some time.

Training workflow

model-manger’s usage is as follows:
./model-manager model-name command

My first model is named “basic” so you may replace model-name by “basic”. Possible commands include:

- fextract, for the feature vectors extraction
- init, for the model initialization
- train, for the training
- eval, for the evaluation
- pad, to test the model with your own handwriting

“all” is a command equivalent to fextract, init, train and eval.

Testing with your own handwriting

First of all you should generate the HMMs with the following command:
./model-manager basic all

The process takes less than one minute on my computer. You may see a few warnings because of some issues in ghmm and tomoe. If all goes well, you should see the accuracy of the model.

From this point, normally, you could test the HMMs with your own handwriting with the following command:

./model-manager basic pad

However, for strange reasons, ghmm behaves incorrectly when the pygtk module is loaded. So the above command works but the character results will be incorrect. I need to contact the pygtk or ghmm mailing-list about this obscure issue. For now, you can use the following command:

./model-manager pad | ./model-manager basic eval -s

The results are displayed on the console. The system supports the following 50 kanji only.

一 二 三 泣 漢 温 使 便 旅 族
水 氷 撃 女 安 北 化 忘 妄 近
集 育 坊 訪 防 妨 駅 福 副 神
版 坂 板 金 全 錬 練 業 習 央
決 代 反 想 歯 象 始 初 発 感

Pick a few of them and try them with your own handwriting ;-)! By the way, all training and evaluation data were written by mouse.

Evaluation

match1: 80.0%
match5: 96.0%
match10: 98.0%

始	1	始, 福, 駅, 錬, 漢
旅	1	旅, 族, 駅, 練, 副
妨	1	妨, 練, 錬, 板, 発
防	1	防, 訪, 旅, 族, 板
泣	1	泣, 温, 福, 練, 駅
副	1	副, 訪, 福, 撃, 初
福	1	福, 練, 錬, 副, 駅
坂	3	板, 駅, 坂, 族, 錬
代	1	代, 板, 漢, 使, 駅
反	1	反, 福, 副, 忘, 妄
撃	3	駅, 錬, 撃, 漢, 副
業	1	業, 練, 錬, 集, 駅
氷	2	駅, 氷, 水, 妨, 版
温	1	温, 福, 駅, 錬, 想
育	1	育, 練, 副, 駅, 福
神	2	練, 神, 福, 錬, 撃
近	1	近, 駅, 練, 漢, 福
化	1	化, 練, 駅, 便, 習
一	X
央	1	央, 決, 業, 駅, 発
族	1	族, 練, 旅, 錬, 副
安	4	妄, 駅, 福, 安, 族
象	1	象, 駅, 錬, 練, 集
歯	1	歯, 練, 錬, 駅, 副
錬	1	錬, 練, 集, 駅, 福
習	1	習, 錬, 福, 駅, 漢
使	1	使, 便, 漢, 錬, 練
訪	1	訪, 駅, 錬, 副, 板
漢	1	漢, 錬, 駅, 練, 業
全	1	全, 金, 集, 錬, 福
集	1	集, 練, 業, 錬, 福
版	1	版, 板, 錬, 駅, 集
水	2	氷, 水, 旅, 駅, 便
板	1	板, 族, 坂, 福, 駅
妄	1	妄, 駅, 福, 忘, 練
初	1	初, 駅, 旅, 練, 坂
想	1	想, 駅, 副, 錬, 集
発	1	発, 練, 福, 駅, 漢
練	1	練, 錬, 福, 駅, 板
北	1	北, 坂, 副, 駅, 板
決	1	決, 漢, 便, 練, 坂
坊	X
駅	1	駅, 錬, 練, 族, 福
金	1	金, 発, 練, 錬, 駅
女	5	駅, 妨, 妄, 板, 女
忘	1	忘, 族, 副, 福, 駅
二	1	二, 三, 忘, 歯, 習
感	1	感, 福, 駅, 族, 練
便	3	練, 駅, 便, 錬, 福
三	1	三, 忘, 副, 版, 訪

The results are very promising and outperform Tomoe’s current recognizer. Incidentally, I used the same evaluation corpus for Tomoe and for my experiment. However, a few things must be emphasized:

- My experiment only supports 50 kanji while Tomoe supports thousands of them.
- The evaluation of my experiment is performed using kanji from the same people who wrote the kanji used for training. However, the kanji instances for training and evaluation are not the same.
- It’s pretty sure that using the whole character HMM symbol will not perform well in terms of computation time with thousands of kanji. Usually, stroke or sub-stroke models are preferred.

Interestingly, my recognizer doesn’t do a good job at recognizing the simplest characters: 一 二 三.

Both Tomoe and my recognizer are sensitive to stroke order. However, as it seems, my recognizer is not so sensitive to stroke number. For example, く in 女 is one stroke but it’s acceptable to write it in two strokes. However, if you write く after 一 and ノ, it doesn’t work.

Call to online handwriting database

If you’re a researcher in handwriting recognition and read this, I’m looking for a handwriting database of Chinese characters (kanji or hanzi). Please contact me if you can help me.

What’s next?

- Try more sophisticated feature vectors
- Try more sophisticated initial state alignment
- Try stroke and sub-stroke HMMs
- Collect more data
- Try techniques other than HMMs

Tomoe Evaluation

Sunday, May 25th, 2008

In last December, I started a one-year internship at Asahi Kasei, in their Atsugi-based speech recognition group. Even if I have been doing quite a deal of software development, I have been able to study Hidden Markov Models (HMM) and statistics. It turns out that, hehe, I like it!

One year ago, I started to contribute to Tomoe, as part of my participation to the Google Summer of Code. This experience raised my interest in handwriting recognition, especially of Chinese characters. When I studied the Hidden Markov Models, I always kept in mind Handwriting Recognition. “How would I do this? How would I do that?”. This helped me raise more questions and have a better understanding.

One thing that I learned during this internship is the notion of corpus (plural: corpora), more precisely training and evaluation corpus. Three months ago I started my experiment project with Chinese character handwriting recognition. The first thing I had to do was to create corpora. I reused the canvas provided in Tomoe to create a character editor. The user draws the character and it is saved to a file in XML format.

Together with a Japanese friend, we selected 50 kanji. Some simple, some complex. Some completely different, some very similar to others. We each wrote 5 instances of each kanji. 8 instances were intended for training corpus. The data are used to train the system how to recognize kanji. 2 instances were intended for evaluation. The data are used to estimate how good the system performs. The performance is describe in terms of accuracy or error rate. The evaluation allows to measure improvements when one recognizer is tuned or to compare how well two recognizers perform, provided that the evaluation corpus is well designed (large and representative enough).

Well, Tomoe doesn’t use statistical learning yet so I didn’t use the training corpus for it. However, the next thing I did after collecting data was to use the evaluation corpus in order to evaluate Tomoe’s performance. At the time of the Google Summer of Code, I didn’t have this idea, although it now seems obvious to me. Verdict:

1st match: 61.0%
5 firsts: 74.0%
10 firsts: 74.0%

This means that 61% of characters are recognized as fist match and 74% are recognized in the first 5 or 10 results. Considering the first 10 matches, which is acceptable, the error rate is still 26%, which is pretty high. Here’s a more detailed view of the results. Interestingly, we can see that kanji with the same radical or shape are often in the candidate list.

駅      X
妨      1       妨, 姨, 姙, 枋, 枕
坊      1       坊, 垓, 坑, 拡, 択
発      X       癸, 廢
歯      X
全      1       全, 舎, 舍, 早, 果
金      X       昂, 氤, 釘, 覇
板      X       被
忘      1       忘, 忌, 志, 芯, 忠
女      1       女, 冊, 木, 仄, 攵
族      X       楾
始      1       始, 姶, 恰, 娯, 娃
錬      1       錬, 顕, 鍜, 鰊
集      1       集, 賃, 寔, 夐, 募
旅      X
坂      4       扼, 拔, 城, 坂, 披
訪      X       詫, 誇, 就, 駱
水      3       氷, 丞, 水, 妃, 羽
三      1       三, 工, 弖, 王, 玉
想      X       慧, 慂
神      1       神, 裡, 祝, 殉, 術
副      1       副, 歇, 飮, 尠, 飩
安      1       安, 宋, 宏, 免, 案
泣      1       泣, 注, 浜, 淳, 泡
二      1       二, 井, 云, 元, こ
感      1       感
代      1       代, 伐, 陀, 弛, 池
撃      1       撃, 磬
温      1       温, 溜, 塩, 溘, 溝
漢      1       漢, 灘, 嘱
一      1       一, 廾, 弋, 十, 七
象      X       豫
育      1       育, 昌, 匿, 高, 香
氷      2       妁, 氷, 承, 冰, 灰
反      1       反, 皮, 尻, 阪, 伎
業      1       業, 箕, 篇, 賓, 霄
防      1       防, 枋, 枕, 偽, 隧
妄      X       気
初      X
決      X       泥, 沫, 泯, 沸, 泱
央      X       史, 决, 吏, 向, 岔
習      1       習, 跫, 笥, 筥, 筍
練      X       踝, 踴, 閥, 諌, 錬
近      X
化      1       化, 价, 仙, 他, 伊
福      1       福, 熕, 褌, 複, 磆
北      1       北, 把, 地, 托, 叱
便      1       便, 峺, 悗, 栲, 僊
版      X       放, 施, 倣, 昨, 站
使      1       使, 俚, 便, 候, 俾

Three months ago, I started my experiment project when I collected kanji data. I then worked on the project an hour or two from time to time. I obtained my first results earlier this week. I was extremely happy of seeing results at last. It was difficult to keep on track because sometimes, I didn’t work on the project for days or weeks. My initial results outperform the current Tomoe recognizer, with some limitations, that I will develop later. I will publish my work and give more details about it in another post.

Some news and pictures of Japan

Sunday, May 25th, 2008

I’ve been living in Japan for 6 months already. I must say, wow, time flies very very fast. This is abroad for me so there’s always something new to do or to see. I don’t have much time for myself or personal projects. But I have (very) slowly been working on some experiments on handwritten Chinese character recognition. Stay tuned.

I will graduate from my engineering school in June (masters) so lately I have been thinking to what I want to do in the future. As I said, times goes very fast. It’s thus very important to fulfill myself during the 8 hours that I spend at work, whatever it may be. So far, I’m pretty sure that I don’t want to work as a software developer for a private company. I want to do something more challenging and in connection with research. Most importantly, I want to LEARN new things every day. Options I’m considering are creating my own company, becoming a research engineer in a private lab or starting a PhD. I’m glad because I think I have narrowed down the fields I would like to work in, in the future. I would like to work in the fields of Machine Learning and Pattern Recognition, especially handwriting recognition, natural language processing (NLP) and speech recognition.

I uploaded some of my pictures of Japan. I still have many others that I need to sort out but if you are interested, take a look.