Tweeter button

Archive for the ‘Machine Learning’ Category

The Little Machine Learner

Thursday, February 18th, 2010

The idea

I’ve been having this idea on my mind for quite some time: wouldn’t it be nice to write a book about Machine Learning where each chapter is a literate program?

From Wikipedia:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

From the PyLit homepage:

The idea is that you do not document programs (after the fact), but write documents that contain the programs.

There are plenty of great textbooks about Machine Learning out there, so the point would not be to write yet another one, but write something different. Here’s what I had been thinking.

  • Each chapter written as a literate program, organized so as to maximize understanding
  • Code in Python (+Numpy + Scipy but without any additional dependencies)
  • Readability over Performance
  • Intuitions, nice figures, useful tips or tricks
  • Real-world applications at the end of each chapter
  • Don’t shy away from the maths, especially if at high-school or undergraduate level…

I bet that quite a few algorithms can be written this way, yet remain very concise!

Except for the maths part, the closest book to this idea that I know of is probably “Programming Collective Intelligence: Building Smart Web 2.0 Applications”, by Toby Segaran.

An example with logistic regression

So, in order to experiment with what such a book could look like, I’ve decided to write a chapter about Logistic Regression. Topics I cover include Maximum Likelihood Estimation, Regularization and Cross-validation. At the end, I use heart disease prediction as an example of real-world application. Probably many things could be improved or added but the point for now is mainly to show what it could look like.

Tools

For the documentation tool, I’ve decided to go for Sphinx, which seems to be emerging as the de-facto documentation tool in the Python community. It has nice features like syntax highlighting, latex support and matplotlib plots support and can output to HTML and PDF.

Normally, in literate programming, there’s the literate source, which uses some kind of markup-language and tools are used to generate either code or documentation from it. I took a different approach. In my case, the source file is the code and the documentation is extracted from the comments in the code. Technically, it’s therefore closer to extensively documented code than actual literate programming. It has some limitations but the main advantages are that the program is runnable directly (since Python is interpreted) and the programmer can benefit from syntax highlighting. I wrote a simple program that converts Python source code to reStructuredText, as necessary for integration in Sphinx.

Interested?

It took quite some time to collect the information and do the actual writing but I feel like I improved my own understanding in the process, so I’m thinking of writing a chapter from time to time. If I do so, at the end of my PhD, I may have gathered enough material to make it a real book! The book could affectionately be entitled “The Little Machine Learner”, hence the title of this post.

Since Machine Learning is a very large field and to write a better book than I could possibly write alone, I’m also thinking that it could actually be a collaborative effort (by researchers, students and practitioners). If you’re interested, please leave a comment. I will create a discussion group if there’s enough interest.

As usual, the source code is available in my git repo:

$ git clone http://www.mblondel.org/code/tlml.git

web interface

Numbers game solver

Monday, April 6th, 2009

Genetic programming is a Machine Learning technique that allows to “create programs that create programs” by simulating evolution. The purpose is to slowly converge to a program that performs a defined task as well as possible.

I wanted to try out Genetic Programming and I figured writing a solver for the “numbers game” would be a good example. If you don’t know it, it’s part of the popular game shows “Countdown” in the UK and “Des chiffres et des lettres” in France. For example, given the numbers 3, 75, 2, 4, 1 and 1, the goal is to arrive at 888 using the four basic arithmetic operations (+, −, × and ÷) like so:

74 = 75 - 1
12 = 4 * 3
888 = 74 * 12

The above calculations can be represented as a tree like this:

        *
     /      \
    -        *
  /   \    /   \
75    1    4    3

So it becomes clear that solving the numbers game is equivalent to writing little programs. Although I used the tree representation, a possibly simpler representation that I now think of is like a string of stack operations, like in Forth:

75 1 - 4 3 * *

You start with an initial population of random programs and every generation, you create a new population by selecting the best-performing programs from the current generation, mutating programs (changing a constant with another constant, changing an operator with another operator…) and breeding programs (creating programs which are the combination of two programs).

The generated programs must respect some constraints. For example, division must not have a remainder so 7 / 2 is not allowed. Another example is that the constants used must be among the 6 input numbers and can only be used once.

In Genetic Programming, to determine how good the generated programs perform, we need a scoring function. Here it’s very clear: the closer we are from the target number, the better. To decide between two solutions that perform as well as the other, we can select the one with the fewer calculations.

You can find my program here.

Another idea for applying Genetic Programming that I had is for creating hash functions. What would be the scoring function in that case? One could use the avalanche effect (the greater the avalanche effect, the better) or the average number of collisions (the smaller number of collisions, the better). Although I had this idea by myself, it seems that researchers are already investing this!

Source code classifier

Wednesday, March 18th, 2009

Of the numerous Machine Learning algorithms, naive Bayes classifiers are probably one of the simplest forms. Yet, they are known to be able to yield very good performance in some circumstances. The most famous example of their use is probably the spam filters, in which spam and non-spam emails are marked beforehand and throughout the filter’s life in order to train the filter to recognize spam from non-spam.

I’ve decided to try out this technique for source code classification: given a source file for which we don’t know the programming language, the purpose is to output the n most probable programming languages in which the source is written. Although I did it mainly to see what kind of results I could get, one use I can think of is in a text-editor, to activate or deactivate syntax highlighting when the file extension is missing.

Difficulties

I can think of several difficulties in trying to classify code source files:
- Unlike many natural languages in which spaces delimit words, in programming languages, spaces can often be omitted. For example print(”hello”) and print (”hello”) could well both be correct in a given language. The lack for spaces is detrimental to the performance because print(”hello”) will be treated as a single word and as a result have a very low probability.
- Programs are sometimes heavily commented by the programmer, which can cause the classifier to confuse source files and plain-text files.
- Since we don’t know the language in advance, we cannot rely on a grammar. Intuitively, we need to use the occurrence probability of words for each programming language.

Collect data

Because this technique involves training the program to recognize programming languages, it was first necessary to collect source files for different programming languages. I decided to limit myself to 24 languages from Programming Language Popularity for which I had easy access to open-source programs. Of course it is possible to add more by training the program with more source files.

For each programming language, I grabbed a few projects written in the language and for interpreted languages, I also grabbed the virtual machine source code because their standard library is often written in the language directly. I downloaded all the source files with the Debian command “apt-get source”, which grabs project sources from Debian mirrors, making the task a little bit more convenient but overall collecting source files was the most boring and time-consuming part of this project.

The thing is, to account for different programming styles and to find a good approximation of the word distributions, especially given the difficulties mentioned above, it’s very important to collect as much source files as one can. Once this was done I was able to start the program itself.

Ruby and C implementations

I made the program (trainer and classifier) in Ruby. The training part is quite slow (Ruby is quite slow for text processing) but the classifier is well responsive, which is the most important. I also made a C version of the classifier: the Ruby program is used to generate a header file (.h) which contains a hash table with the data for the statistical model and when the classifier is compiled, the hash table is included directly in the executable for blazing fast processing. The executable remains small (~200K for 24 languages).

An advantage of this approach is that most of the work is done offline (i.e. during the training) and when I tweak some parameters in the Ruby program, I just have to retrain, regenerate the .h file and recompile. The core C module doesn’t change.

Performance evaluation

I used a validation set to estimate how good the classifier performs: the percentage of correctly classified source files and the percentage of incorrectly classified source files and I got an overall accuracy of 85%. The validation set was composed of 40 sources files for each of the 24 programming languages. Here are a few interesting points I noticed:

- A few languages such ada, css, forth, ocaml are recognized correctly 100% of the time. Others such as tcl, erlang, scheme, haskell, smalltalk get an average accuracy over 95%. This shows that these languages stand out in their word distributions: the most common words in these languages are quite different from the most common words in other languages making it easier for the classifier to distinguish them.

- On the contrary, other languages get poor accuracy because they easily get recognized as another language. An example is C++ which, according to my tests, got recognized as C or Javascript in 30% of the source files. This is not very surprising knowing that these languages have a similar syntax.

- It’s also interesting to note what languages does the classifier recognizes when it makes mistakes. This shows us how close some languages can be. For example, not surpringly, the classifiers sometimes confuses common lisp with scheme and vice-versa.

Source / Plain text

If this classifier had to be integrated into a text editor, it would first be necessary to tell if the file is even source code or just plain text (like a README file). To do that, I would personally build a second model whose role is solely to distinguish source code from plain text. It’s possible to do that by training the classifier appropriately. If the file is recognized as plain text, syntax highlighting can be disabled, otherwise the programming language model can be used to determine the programming language and set the syntax highlighting accordingly. The only disadvantage to this approach is that the first model becomes a bottleneck to the overall process. If it does a poor job at distinguishing plain text from source code, then it is the whole system that will do a poor job.

Download

You can get the source code (LGPL license) with this command:
$ git clone http://www.mblondel.org/code/source-classifier.git

or by browsing the web interface.

HCCR using AJAX and SVM

Tuesday, September 9th, 2008

I came across this impressive page (in Japanese), which shows a demonstration of Handwritten Chinese Character Recognition using AJAX for the user interface and Support Vector Machines for the training algorithm.

Looking at the Javascript code, I was surprised to see that, unlike my web canvas, it doesn’t use the <canvas> tag! It simply uses a combination of Javascript and CSS. Even though it has a few quirks, the interface is quite responsive.

The recognition process itself happens on the server side but thanks to the use of AJAX, the results are displayed very smoothly, without the need to refresh the page.

Taku Kudo, the author, explains in the page that he’s using the handwriting data from Tomoe. However since Tomoe uses a template-based algorithm, it only has one handwriting sample per character. I’m impressed that Taku Kudo can train his system only with one sample per character. Overall, the accuracy is not very impressive but I think it could improve a lot with more training samples. That’s why my handwriting database project is going to be very useful. I’ve been willing to try SVM in addition to HMMs so the fact that this project uses SVM confirmed my interest for it.

Taku Kudo’s page has neat stuff regarding Natural Language Processing and Machine Learning. He published a lot of libraries as free software, including Mecab and TinySVM. If you like fancy stuff in AJAX, you’ll also like his Japanese Input Method.

Character encoding detection

Sunday, August 17th, 2008

Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.

Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:

A composite approach to language/encoding detection

This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:

Coding scheme method

A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.

Character distribution method

For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.

Number of Most
Frequent Characters
Accumulated Percentage
10 0.11723
64

0.31983
128 0.45298

256 0.61872
512 0.79135
1024

0.92260
2048 0.98505

4096 0.99929
6763 1.00000

Character distribution in Simplified Chinese

Number of Most
Frequent Characters
Accumulated Percentage
10 0.27098
64 0.66722
128

0.77094
256 0.85710

512 0.92635
1024 0.97130
2048

0.99431
4096 0.99981

1.00000

Character distribution in Japanese

In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.

Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:

n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total - n_freq)

We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.

confidence_level = r_file / r_encoding

We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.

This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.

n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total - n_freq)

2-character sequence distribution method

For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.

Accuracy

100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.

Handwriting renderers

Sunday, July 13th, 2008

Canvas

If you didn’t read my previous post, for short, project Tegaki is a framework for handwritten Chinese character recognition (HCCR) written in Python. It includes reusable components and is a placeholder for experimentation. The goal is to create the next-generation open-source HCCR software but it may be useful for academic researchers as well.

One reusable component is the Canvas. This is the user interface component that allows to draw characters. In addition, the Canvas supports “replaying” the character (stroke by stroke animation) and setting a background model (to help users draw an unknown character). It is multi-platform.

The Canvas
Example of a character drawn using the Canvas provided by libtegaki-gtk

The Canvas has a get_writing() method. It allows to retrieve the Writing object for the handwriting currently displayed in the Canvas.

XML representation

The Writing object supports reading from and writing to an XML file. The XML file can optionally be compressed using gzip or bz2. On my hard drive, I have a small set of handwriting samples. 500 characters take about 10 MB. That’s why compression is very useful.

The XML representation of a handwriting sample looks like that.

<character>
  <utf8>無</utf8>
  <strokes>
    <stroke>
      <point x="306" y="163" timestamp="0" />
      <point x="303" y="163" timestamp="21" />
      <point x="303" y="166" timestamp="29" />
      [...]
    </stroke>
    <stroke>
      <point x=”266″ y=”240″ timestamp=”912″ />
      <point x=”270″ y=”240″ timestamp=”917″ />
      <point x=”273″ y=”240″ timestamp=”925″ />
      [...]
    </stroke>
    [...]
  </strokes>
</character>

Renderers

I’ve recently added support for what I named “renderers”. They take a Writing object as parameter and generate a visual representation of it. Since I used the cairo graphics library as drawing backend, the representation can be saved to PNG, SVG and PDF! Those renderers will be very useful for the handwriting database website that I wrote about in my previous post!

Complete character renderer

Kanji

Stroke order renderer

Kanji
Stroke order with each single stroke

Kanji
Stroke order with stroke groups

Strokes can be grouped together when the stroke order is obvious. However, this requires to know which strokes to combine together. A dictionary must be created for that. A entry example would be:

駅 1,1,3,1,4,2,2

<canvas> HTML tag

The canvas I was writing about above is written in pygtk and is intended to be used for the Desktop or for Maemo. However, in the case of the handwriting database website, since we want as many people to contribute their handwriting as possible, it would be nice to not require any particular installation. For that, a canvas directly in the browser would be the ideal solution.

One solution would be to use Flash but I would prefer to use the <canvas> tag. It can be used in combination with Javascript to do drawing in the browser. It is supported natively by Firefox, Opera and Safari. It is supported in Internet Explorer through a third-party Javascript called ExplorerCanvas.

I am looking for a contributor to create a new canvas using this technology. The canvas should support drawing, displaying existing handwriting and replay (stroke by stroke animation).

For more information:

GIF stroke animation

Even though GIF uses a patented compression, GIF is still the only format with support for animations and wide support in the browsers. Therefore it would be very cool to be able to generate GIF stroke animations from a writing object.

I had a look at python-imagemagick and Python Imaging Library (PIL) but they both seem to have very limited support for GIF animations. So I’m thinking of writing my own library for GIF generation in Python. Byzanz, a software to create screencasts as GIF animations, can be used as inspiration because it includes a GIF encoder. It also supports color quantization (using octrees) and dithering. From what I see, it should take less than 1000 lines of Python code.

I read a little bit about color quantization. I found it very interesting. Here’s a short explanation about color quantization for those who don’t know about it. Basically, each pixel in an image may have three components Red Blue Green. For a 400×400 picture, this is about 400*400*3=480KB. To gain space, an idea is to store colors in a palette (a table index => color). Then each pixel only needs to refer to the index in the palette instead of having to define the three components. For a 256-color palette, this saves two bytes for each pixel. However, since we now use 256 colors only instead of 256 * 256 * 256 = 16,777,216 colors, there’s a color precision loss. The challenge is thus to find what colors to put in the palette to have the smallest precision loss possible. For example, we may want to put in the palette colors that are the closest to the most frequently used colors. This is a 3-dimensional clustering problem, thus it reminded me of Machine Learning, a topic in which I’ve been very interested recently.

For more information, I recommend the reading of those Wikipedia articles:

First HMM experiment

Sunday, May 25th, 2008

Today I’m publishing the initial results of my experiments on online handwriting recognition of Chinese characters, using Hidden Markov Models (HMM). You can see my post on Tomoe Evaluation for some background.

Download

$ git clone http://www.mblondel.org/code/hwr.git

The code can be browsed online using gitweb.

See my memo on git if you don’t know it yet. I published my work under GPL license.

Requirements

- Python (2.4)
- GHMM (SVN)
- Tomoe (SVN)
- Tomoe-GTK (SVN)

The Python bindings for the last three are also needed.

Folder structure

- data/ contains the raw training and evaluation data
- lib/ contains reusable components
- models/ contains model experiments
- tests/ contains test cases
- character-editor is the graphical interface to edit character data
- model-manager controls the training workflow, evaluation and the test pad

Each model must have an intelligible name. Each model must defined a file called model.py containing a class called Model. This class defines the behavior of the model. A model can inherit from other models in order to reuse common components. My first model is called “basic” so its file is models/basic/model.py.

First model

Here is some information regarding my first model.

- HMM unit: whole character
- Feature vectors: (deltax, deltay) with deltax = abs(xt - xt-1) and deltay = abs(yt - yt-1)
- Number of states: 3 * number of strokes
- Initial state transitions: 0.5 to stay in the same state, 0.5 to jump to the next state
- Initial state alignment: feature vectors are segmented uniformly and segments are associated with their corresponding state
- Training: Baum-Welch

If you don’t understand anything of the above, you should read more about HMMs ;) I may write an introduction on this journal if I have some time.

Training workflow

model-manger’s usage is as follows:
./model-manager model-name command

My first model is named “basic” so you may replace model-name by “basic”. Possible commands include:

- fextract, for the feature vectors extraction
- init, for the model initialization
- train, for the training
- eval, for the evaluation
- pad, to test the model with your own handwriting

“all” is a command equivalent to fextract, init, train and eval.

Testing with your own handwriting

First of all you should generate the HMMs with the following command:
./model-manager basic all

The process takes less than one minute on my computer. You may see a few warnings because of some issues in ghmm and tomoe. If all goes well, you should see the accuracy of the model.

From this point, normally, you could test the HMMs with your own handwriting with the following command:

./model-manager basic pad

However, for strange reasons, ghmm behaves incorrectly when the pygtk module is loaded. So the above command works but the character results will be incorrect. I need to contact the pygtk or ghmm mailing-list about this obscure issue. For now, you can use the following command:

./model-manager pad | ./model-manager basic eval -s

The results are displayed on the console. The system supports the following 50 kanji only.

一 二 三 泣 漢 温 使 便 旅 族
水 氷 撃 女 安 北 化 忘 妄 近
集 育 坊 訪 防 妨 駅 福 副 神
版 坂 板 金 全 錬 練 業 習 央
決 代 反 想 歯 象 始 初 発 感

Pick a few of them and try them with your own handwriting ;-)! By the way, all training and evaluation data were written by mouse.

Evaluation

match1: 80.0%
match5: 96.0%
match10: 98.0%

始	1	始, 福, 駅, 錬, 漢
旅	1	旅, 族, 駅, 練, 副
妨	1	妨, 練, 錬, 板, 発
防	1	防, 訪, 旅, 族, 板
泣	1	泣, 温, 福, 練, 駅
副	1	副, 訪, 福, 撃, 初
福	1	福, 練, 錬, 副, 駅
坂	3	板, 駅, 坂, 族, 錬
代	1	代, 板, 漢, 使, 駅
反	1	反, 福, 副, 忘, 妄
撃	3	駅, 錬, 撃, 漢, 副
業	1	業, 練, 錬, 集, 駅
氷	2	駅, 氷, 水, 妨, 版
温	1	温, 福, 駅, 錬, 想
育	1	育, 練, 副, 駅, 福
神	2	練, 神, 福, 錬, 撃
近	1	近, 駅, 練, 漢, 福
化	1	化, 練, 駅, 便, 習
一	X
央	1	央, 決, 業, 駅, 発
族	1	族, 練, 旅, 錬, 副
安	4	妄, 駅, 福, 安, 族
象	1	象, 駅, 錬, 練, 集
歯	1	歯, 練, 錬, 駅, 副
錬	1	錬, 練, 集, 駅, 福
習	1	習, 錬, 福, 駅, 漢
使	1	使, 便, 漢, 錬, 練
訪	1	訪, 駅, 錬, 副, 板
漢	1	漢, 錬, 駅, 練, 業
全	1	全, 金, 集, 錬, 福
集	1	集, 練, 業, 錬, 福
版	1	版, 板, 錬, 駅, 集
水	2	氷, 水, 旅, 駅, 便
板	1	板, 族, 坂, 福, 駅
妄	1	妄, 駅, 福, 忘, 練
初	1	初, 駅, 旅, 練, 坂
想	1	想, 駅, 副, 錬, 集
発	1	発, 練, 福, 駅, 漢
練	1	練, 錬, 福, 駅, 板
北	1	北, 坂, 副, 駅, 板
決	1	決, 漢, 便, 練, 坂
坊	X
駅	1	駅, 錬, 練, 族, 福
金	1	金, 発, 練, 錬, 駅
女	5	駅, 妨, 妄, 板, 女
忘	1	忘, 族, 副, 福, 駅
二	1	二, 三, 忘, 歯, 習
感	1	感, 福, 駅, 族, 練
便	3	練, 駅, 便, 錬, 福
三	1	三, 忘, 副, 版, 訪

The results are very promising and outperform Tomoe’s current recognizer. Incidentally, I used the same evaluation corpus for Tomoe and for my experiment. However, a few things must be emphasized:

- My experiment only supports 50 kanji while Tomoe supports thousands of them.
- The evaluation of my experiment is performed using kanji from the same people who wrote the kanji used for training. However, the kanji instances for training and evaluation are not the same.
- It’s pretty sure that using the whole character HMM symbol will not perform well in terms of computation time with thousands of kanji. Usually, stroke or sub-stroke models are preferred.

Interestingly, my recognizer doesn’t do a good job at recognizing the simplest characters: 一 二 三.

Both Tomoe and my recognizer are sensitive to stroke order. However, as it seems, my recognizer is not so sensitive to stroke number. For example, く in 女 is one stroke but it’s acceptable to write it in two strokes. However, if you write く after 一 and ノ, it doesn’t work.

Call to online handwriting database

If you’re a researcher in handwriting recognition and read this, I’m looking for a handwriting database of Chinese characters (kanji or hanzi). Please contact me if you can help me.

What’s next?

- Try more sophisticated feature vectors
- Try more sophisticated initial state alignment
- Try stroke and sub-stroke HMMs
- Collect more data
- Try techniques other than HMMs

Tomoe Evaluation

Sunday, May 25th, 2008

In last December, I started a one-year internship at Asahi Kasei, in their Atsugi-based speech recognition group. Even if I have been doing quite a deal of software development, I have been able to study Hidden Markov Models (HMM) and statistics. It turns out that, hehe, I like it!

One year ago, I started to contribute to Tomoe, as part of my participation to the Google Summer of Code. This experience raised my interest in handwriting recognition, especially of Chinese characters. When I studied the Hidden Markov Models, I always kept in mind Handwriting Recognition. “How would I do this? How would I do that?”. This helped me raise more questions and have a better understanding.

One thing that I learned during this internship is the notion of corpus (plural: corpora), more precisely training and evaluation corpus. Three months ago I started my experiment project with Chinese character handwriting recognition. The first thing I had to do was to create corpora. I reused the canvas provided in Tomoe to create a character editor. The user draws the character and it is saved to a file in XML format.

Together with a Japanese friend, we selected 50 kanji. Some simple, some complex. Some completely different, some very similar to others. We each wrote 5 instances of each kanji. 8 instances were intended for training corpus. The data are used to train the system how to recognize kanji. 2 instances were intended for evaluation. The data are used to estimate how good the system performs. The performance is describe in terms of accuracy or error rate. The evaluation allows to measure improvements when one recognizer is tuned or to compare how well two recognizers perform, provided that the evaluation corpus is well designed (large and representative enough).

Well, Tomoe doesn’t use statistical learning yet so I didn’t use the training corpus for it. However, the next thing I did after collecting data was to use the evaluation corpus in order to evaluate Tomoe’s performance. At the time of the Google Summer of Code, I didn’t have this idea, although it now seems obvious to me. Verdict:

1st match: 61.0%
5 firsts: 74.0%
10 firsts: 74.0%

This means that 61% of characters are recognized as fist match and 74% are recognized in the first 5 or 10 results. Considering the first 10 matches, which is acceptable, the error rate is still 26%, which is pretty high. Here’s a more detailed view of the results. Interestingly, we can see that kanji with the same radical or shape are often in the candidate list.

駅      X
妨      1       妨, 姨, 姙, 枋, 枕
坊      1       坊, 垓, 坑, 拡, 択
発      X       癸, 廢
歯      X
全      1       全, 舎, 舍, 早, 果
金      X       昂, 氤, 釘, 覇
板      X       被
忘      1       忘, 忌, 志, 芯, 忠
女      1       女, 冊, 木, 仄, 攵
族      X       楾
始      1       始, 姶, 恰, 娯, 娃
錬      1       錬, 顕, 鍜, 鰊
集      1       集, 賃, 寔, 夐, 募
旅      X
坂      4       扼, 拔, 城, 坂, 披
訪      X       詫, 誇, 就, 駱
水      3       氷, 丞, 水, 妃, 羽
三      1       三, 工, 弖, 王, 玉
想      X       慧, 慂
神      1       神, 裡, 祝, 殉, 術
副      1       副, 歇, 飮, 尠, 飩
安      1       安, 宋, 宏, 免, 案
泣      1       泣, 注, 浜, 淳, 泡
二      1       二, 井, 云, 元, こ
感      1       感
代      1       代, 伐, 陀, 弛, 池
撃      1       撃, 磬
温      1       温, 溜, 塩, 溘, 溝
漢      1       漢, 灘, 嘱
一      1       一, 廾, 弋, 十, 七
象      X       豫
育      1       育, 昌, 匿, 高, 香
氷      2       妁, 氷, 承, 冰, 灰
反      1       反, 皮, 尻, 阪, 伎
業      1       業, 箕, 篇, 賓, 霄
防      1       防, 枋, 枕, 偽, 隧
妄      X       気
初      X
決      X       泥, 沫, 泯, 沸, 泱
央      X       史, 决, 吏, 向, 岔
習      1       習, 跫, 笥, 筥, 筍
練      X       踝, 踴, 閥, 諌, 錬
近      X
化      1       化, 价, 仙, 他, 伊
福      1       福, 熕, 褌, 複, 磆
北      1       北, 把, 地, 托, 叱
便      1       便, 峺, 悗, 栲, 僊
版      X       放, 施, 倣, 昨, 站
使      1       使, 俚, 便, 候, 俾

Three months ago, I started my experiment project when I collected kanji data. I then worked on the project an hour or two from time to time. I obtained my first results earlier this week. I was extremely happy of seeing results at last. It was difficult to keep on track because sometimes, I didn’t work on the project for days or weeks. My initial results outperform the current Tomoe recognizer, with some limitations, that I will develop later. I will publish my work and give more details about it in another post.