A roadmap for project Tegaki
Codename Project Tegaki
I wrote in a previous post about my first experiment with applying a modern technique, namely Hidden Markov Models, for handwritten Chinese character recognition. I’m quite motivated in making this more than just a single isolated experiment so I decided to give a name to the project. I named it Project Tegaki. This is going to be the codename for the effort starting from now. Tegaki means Handwriting in Japanese.
Project statement
The aim of Project Tegaki is to push forward the creation of the next-generation open-source handwritten Chinese character recognition (HCCR) software.
Currently, the only open-source package for HCCR is Tomoe. This is a project that I have been contributing to and that I used for my Google Summer of Code project, “Japanese/Chinese handwriting recognition on maemo”. Maemo is the open-source platform used by Nokia PDAs. I have decided to start Project Tegaki as an external effort because I considered that Tomoe would not be a good environment to welcome the effort. However, if the Tomoe community is ready to help me in this effort, I will be happy to merge Project Tegaki back into Tomoe once Project Tegaki becomes ready for prime-time.

Handwritten Chinese character recognition in a PDA…
Here are some goals for the project:
- Free and open-source. The goal is to produce the next-generation free and open-source HCCR software.
- Modern. The software should use modern approaches to Handwriting recognition and be in tight connection with research.
- Embedded. The project must be designed to work with devices with restricted resources such as cell phones or PDAs.
- Online, as opposed to offline. In online recognition, characters are drawn using a device, typically a mouse, a tablet or a PDA stylus. In this setting, characters can be represented as sequences of points. In offline recognition, characters are scanned a posteriori. In this setting, characters are represented as images (width * height pixels).
- Isolated Chinese character recognition. Here Chinese character doesn’t restrict to Chinese language, since Japanese kanji are also Chinese characters! Even though the package should theoretically be generalizable to any kind of character, Chinese characters have some specific challenges and some approaches that give good results for Chinese characters may not give good results with other kinds of characters, due to the unique properties of Chinese characters. “Isolated character recognition” means that user will have to draw one character at a time in a separate box, as opposed to continuous handwriting recognition. This makes things much easier and in the case of Chinese characters, this is a reasonable limitation.
- Stroke order dependent and independent. Both situations have useful applications so Project Tegaki should ideally support both.
Python?
Usually I’m more of a Ruby fan but the project was started in Python due to dependencies on third-party libraries that only exist in Python. Even though I’m slowly getting away from those dependencies, I don’t want to re-implement everything just for the sake of using Ruby. So I keep up with Python.
As it was emphasized, this project is highly experimental. Moreover, a collaborative website will be created (see below) and it will reuse number of existing components. It thus makes sense to use a high-level language to focus on the experiments and to create the website.
Subprojects
Project Tegaki is now split into several subprojects.
libtegaki
This Python library contains functionality that will be useful to other subprojects. This includes array manipulation, character input/output, viterbi decoder…
libtegaki-gtk
This Python library contains user interface elements that will be useful to other subprojects. So far it only includes a Canvas, which can be used to draw characters. It is replacement for TomoeCanvas with some additional benefits:
- Truly reusable. TomoeCanvas assumes that a recognizer is connected to the canvas. However, there are situations when a recognizer is not needed.
- Resizable. TomoeCanvas cannot be resized at will.
- Animation. A stroke animation of a character can be displayed.
- Background character. A background character can be set as a model and animations will be displayed to help draw the same character stroke by stroke.
- Features other than (x,y) coordinates are supported such as pen pressure and pen inclination when available, stroke duration, point timestamp.
libtegaki-gtk is written in pygtk and depends on libtegaki.
tegaki-db
The most successful handwriting recognition systems nowadays use a “learn by example” philosophy. For each character supported, several samples of the handwritten character must be provided to the system in order to learn from them. Because those samples are used to train the system, they are called “training samples”. The challenge for the final recognizer is to be able to recognize unseen handwritten instances of the same characters. This is the ability of the recognizer to “generalize” the acquired knowledge.
A “training corpus” is a set of training samples. A good corpus should contain dozens of handwritten samples for each character. The corpus should be representative enough of all handwriting styles. Collecting all the handwriting samples and designing a good corpus is a huge task for Chinese characters because there exist thousands of them!
Such handwritten Chinese character databases do exist but they have a fee and they are usually restricted to academic research. They are by no means suitable for free software. The goal of the tegaki-db subproject is to create a collaborative web platform to collect handwriting samples. Native speakers and learners alike will be able to log in and contribute their own handwriting. The collected data will be published in a free license so that it can benefit to academic research as well. The tegaki-db will use a client / server architecture.
tegaki-db-client
tegaki-db-client is a client for people to input their handwriting. It will be written in Python and use the canvas provided by libtegaki-gtk. The client will communicate with the server through web services. The client should be distributed for several platforms such as Linux, Windows and Maemo to increase the number of potential contributors. A detailed specification of tegaki-db and tegaki-db-client will be provided later in a separate post.
tegaki-models
tegaki-models is by no means an end-user package and will only be used by developers. It is the placeholder for experimentation. Thanks to this package, model ideas will be tested and evaluated.
I continued to work on new model ideas… However, because my current training corpus is so small, it’s kind of irrelevant to spend to much time on models. The top priority now is to create tegaki-db.
tegaki-decoder
tegaki-decoder is going to be a high-performance decoder (recognizer). It should be a fast implementation of the Viterbi decoder. It will be written in C and designed to work with embedded systems. This is going to be the end-product that people will use. Once sufficient data have been collected, good models have been generated and the tegaki decoder is ready, then Project Tegaki will be ready for real use! Currently, implementing tegaki-decoder is not the top priority.
Roadmap
- Launch tegaki-db and tegaki-db-client.
- Hope that the collaborative effort is successfull and collect lots of handwriting samples from many different people.
- Create new models, especially stroke-based models.
- Implement tegaki-decoder.
If I continue to be the only one interested in this project, at this rate it will take from several months to a couple of years to achieve everything. That’s why I hope I can attract a few contributors.
Download
The work completed so far is still very experimental and thus targets potential contributors. If you want to test it with your own handwriting anyway, please see my previous post.
To download the source code, you can use
$ git clone http://www.mblondel.org/code/hwr.git
or
$ git pull
from the repository folder if you already have the repository on your computer.
The code can be browsed online using gitweb. By clicking the “snapshot” links you can get a complete copy of the source code at a given revision.
See my memo on git if you don’t know it yet.
I published my work under GPL license.
July 5th, 2008 at 10:42 am
Sorry, but it’s a real asshat move, and insulting on top of that, to give a Chinese character recognition project a Japanese name. It’s almost as though you are implying that Chinese derives from Japanese, which would be backwards.
Even if you try to solve the problem you’ve gotten yourself into by claiming that you are building a recognizer for Japanese, that’s still wrong. The core problem you are dealing with is Chinese character recogntion, not Japanese. Deal with it as such, without a prissy Japanese project name.
July 6th, 2008 at 2:21 am
I’m really happy that work is still being done for japanese handwriting recognition on maemo platform. Keep up the great work.
July 6th, 2008 at 12:06 pm
To Mark:
Thank you for your delightful comment. I wish I didn’t have to waste my time seeing and replying to this kind of comment. However, I think that the only person insulting here is you. Your message is extremist and rude.
First of all a name is just a name. It’s useful to refer to the project. And the plan is to eventually merge the results of this effort to Tomoe.
By the way, Tomoe itself is an acronym for Tegaki Online MOji recognition Engine. This is nothing less than three English words and two Japanese words. On top of that, “A tomoe or tomoye (archaic) (巴) is a Japanese abstract shape (i.e. a swirl) that resembles a comma or the usual form of magatama.” (Wikipedia) Hum, I hope you don’t feel too much insulted about that! At least the two Chinese people who contributed the Chinese dictionary of Tomoe didn’t!
Second of all, tegaki means “handwriting”, it doesn’t mean “Chinese character”. As you probably know, Chinese characters are referred to as “kanji” in Japanese and as “hanzi” in Chinese. This is why I didn’t pick “kanji” or “hanzi” for the name of the project.
Third of all, you claim that I’m implying that Chinese derives from Japanese but I explicitly wrote in my message that Japanese kanji come from Chinese!
Fourth of all, tegaki has a nice sounding. Even people who don’t know anything of Japanese can like it and remember it.
I feel insulted by your message because I love both Chinese and Japanese languages. I’m studying both and I spent quite some time in both countries. Absolutely nobody is in right to tell me that I’m willing to be insulting to any of those languages.
July 6th, 2008 at 12:07 pm
To Truls:
Thank you for your kind words. I really appreciate your support.
July 9th, 2008 at 2:07 am
I’m very interested in your project, because I wanted to create almost the same project.
My interest is creating a digital note-taking platform( including OS ), as a useful replacement of pens and papers.
Pens(Pencils) and papers are the important device in order to create, process, summarize and reuse one’s thoughts.
But pen and paper has some weakpoints in searching, re-ordering and has phisical restrictions(the paper size).
So, I think it is very useful if a digital device should be paper like.
Your interest is creating a online handwriting recognition application(library) on maemo platform, so your interest and my one is very similar in `online handwriting recogition’.
I know some research papers, so if you don’t know, below may help:
An research using online handwriting recognition based on hidden markov models(in japanese)
www.jaist.ac.jp/library/thesis/is-master-2002/paper/sudotaka/paper.pdf
Online Handwritten Kanji String Recognition Based on Grammar Description of Character Structures(in japanese)
hil.t.u-tokyo.ac.jp/publications/download.php?bib=Ota2008PRMU02.pdf
The research is being progressed at Sagayama Laboratory, University of Tokyo.
http://hil.t.u-tokyo.ac.jp/index-j.html(in japanese)
By the way, I’m an office worker in Tokyo.
You are now living in near Tokyo(Yokohama?).
How about talking with face to face?
July 9th, 2008 at 12:18 pm
Actually my interest is not limited to the Maemo platform. But I think it is important to keep in mind that embedded systems are an important target in order to write algorithms with good performance.
I’m indeed living near Tokyo. You can contact me by private email. My email is mathieu at this domain.