Tweeter button

Archive for the ‘Handwriting Recognition’ Category

Dynamic Time Warping : theory

Monday, August 31st, 2009

Recently, I’ve been working on a new handwriting recognition engine for Tegaki based on Dynamic Time Warping and I figured it would be interesting to make a short, informal introduction to it.

Dynamic Time Warping (DTW) is a well-known algorithm which aims at comparing and aligning two sequences of data points (a.k.a time series). Although it was originally developed for speech recognition (see [1]), it has also been applied to many other fields like bioinformatics, econometrics and, of course, handwriting recognition.

Consider two sequences A and B, composed respectively of n and m feature vectors.

Each feature vector is d-dimensional and can thus be represented as a point in a d-dimensional space. For example, in handwriting recognition, we could directly use the raw (x,y) coordinates of the pen movement and that would make us sequences of 2-dimensional vectors. In practice however, one would extract more useful features from (x,y) and create vectors of dimension possibly greater than 2. It’s also worth noting that the sequences A and B can be of different length.

Time warping

DTW works by warping (hence the name) the time axis iteratively until an optimal match between the two sequences is found.

In the figure above, which is an example of two sequences of data points with only 1 dimension, the time axis is warped so that each data point in the green sequence is optimally aligned to a point in the blue sequence.

Best path

We can construct a n x m distance matrix. In this matrix, each cell (i,j) represents the distance between the i-th element of sequence A and the j-th element of sequence B. The distance metric used depends on the application but a common metric is the euclidean distance.

Finding the best alignment between two sequences can be seen as finding the shortest path to go from the bottom-left cell to the top-right cell of that matrix. The length of a path is simply the sum of all the cells that were visited along that path. The further away the optimal path wanders from the diagonal, the more the two sequences need to be warped to match together.

The brute force approach to finding the shortest path would be to try each path one by one and finally select the shortest one. However it’s apparent that it would result in an explosion of paths to explore, especially if the two sequences are long. To solve this problem, DTW uses two things: constraints and dynamic programming.

Constraints

DTW can impose several kinds of reasonable constraints, to limit the number of paths to explore.

  • Monotonicity: The alignment path doesn’t go back in time index. This guarantees that features are not repeated in the alignment.
  • Continuity: The alignment doesn’t jump in time index. This guarantees that important features are not omitted.
  • Boundary: The alignment starts at the bottom-left and ends at the top-right. This guarantees that the sequences are not considered only partially.
  • Warping window: A good alignment path is unlikely to wander too far from the diagonal. This guarantees that the alignment doesn’t try to skip different features or get stuck at similar features.
  • Shape: Aligned paths shouldn’t be too steep or too shallow. This prevents short sequences to be aligned with long ones.

These constraints are best visualized in [3].

Dynamic Programming

Taking advantage of such constraints, DTW uses dynamic programming to find the best alignment in a recursive way. Previously, the cell (i,j) of the distance matrix was defined as “the distance between the i-th element of sequence A and the j-th element of sequence B”. In the dynamic programming way of thinking, this definition is changed, and instead, the cell (i,j) is defined as the length of the shortest path up to that cell. Assuming local constraints like below,

it allows us to define the cell (i,j) recursively:

cell(i,j) = local_distance(i,j) + MIN(cell(i-1,j), cell(i-1,j-1), cell(i, j-1))

Here, recursively means that the shortest path up to the cell (i,j) is defined in terms of the shortest path up to the adjacent cells. A lot of different local constraints can be defined (see this table) and thus there are many variations in the way DTW can be implemented.

DTW as a distance metric

Once the algorithm has reached the top-right cell, we can use backtracking in order to retrieve the best alignment. If we’re just interested in comparing the two sequences however, then the top-right cell of the matrix just happens to be the length of the shortest path. We can therefore use the value stored in this cell as the distance between the two sequences. DTW has the nice property to be symmetric so DTW(a,b) = DTW(b,a). Also, DTW doesn’t fulfill the triangle inequality but it isn’t a problem in practice.

Related algorithms

DTW looks almost identical to the Levenshtein algorithm, an algorithm to compare strings, and is very similar to the Smith-Waterman algorithm, an algorithm for sequence alignment.

References

[1] Sakoe, H. and Chiba, S., Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1) pp. 43- 49, 1978

[2] DTW algorithm @ GenTχWarper

[3] PowerPoint presentation by Elena Tsiporkova

Tegaki 0.2 released

Monday, July 20th, 2009

I released Tegaki 0.2. From the Tegaki website:

Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, specifically designed for Chinese (simplified and traditional) and Japanese, and that is suitable for both the desktop and mobile devices.

This release brings a lot of improvements and bug fixes so it’s recommended to upgrade.

You can read the announcement on the Tegaki discussion group here.

There’s also preliminary Maemo support. You can download the Debian package here. You need to install python2.5-gtk2 from the extras repository. Models (.model and .meta files) can be installed in /media/mmc1/tegaki/models/zinnia/ or /media/mmc2/tegaki/models/zinnia/.

Tegaki Project discussion group

Thursday, June 25th, 2009

I’ve created a Google Group for the Tegaki Project. The advantage of Google Groups is that they can be used both like a mailing-list or like a message-board, depending on everyone’s preferences. The group is for both user and developer discussions. Feel free to join!

http://groups.google.com/group/tegaki-hwr

Using internet users to do something useful

Saturday, June 13th, 2009

I found an insightful and entertaining talk that I recommend you to watch. It’s about CAPTCHA in general and more specifically about ReCAPTCHA.

CAPTCHA and ReCAPTCHA

The idea of CAPTCHA is to ask users to answer a question that only humans can answer in order to check whether they are a human or a computer program. For example, in order to prevent massive creation of email accounts by computer programs, most email services ask new subscribers to write the text present in a image. The image is distorted so that it is not possible for a computer program to recognize it but it is still readable by a human. CAPTCHA is now commonly used by all kinds of websites and so millions of users are reading CAPTCHA everyday…

The basic idea of ReCAPTCHA is to use the manpower of people writing CAPTCHA in order to help digitalize old books. Although OCR (Optical Character Recognition) software is pretty efficient for printed documents, the error rate can jump quite a lot for old books due to their poor quality. So by definition, words extracted from old books are in themselves good candidates for CAPTCHA. However, extracted words are further distorted in order to make them look like real CAPTCHA.

In ReCAPTCHA, users are presented with two words to identify. For one word, we know the answer already so it is used to check whether the user is a human or not. For the other word, we want to know the answer (in order to digitalize the book). In other words, users are asked to write two words even though only one would have suffice to fulfill the purpose of the CAPTCHA. However, measurements show that on average it takes equally long to write two plain English words and one word composed of 6-8 random characters. So ReCAPTCHA is an interesting way to fulfill the CAPTCHA’s goal while doing something useful at the same time, which is nice knowing that millions of users are reading CAPTCHAs everyday.

I though ReCAPTCHA is a pretty clever idea. It’s actually an example of human-based computation — that is, the combined use of human and computer for a task that wouldn’t have been possible for one or the other alone.

Genetic art

An idea I had sometime ago (but this has already been done), which is an example of human-based computation, is to use genetic programming in order to automatically generate art. Because it is difficult for a computer alone to tell whether a certain piece of art is beautiful or not, the idea is to let a genetic algorithm create art and use humans in order to evaluate the beauty of art at each iteration of the genetic algorithm (i.e. as a fitness function).

Handwriting sample database

Another related idea is the idea of games with a purpose. This is basically what I want to do with the Chinese handwriting sample database that I mentioned several times already in this journal. The basic idea is to make it attractive for people to write handwriting samples of Chinese characters by letting them play educational games.

First release of Tegaki

Wednesday, February 11th, 2009

Today I’m releasing Tegaki 0.1. Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, that is suitable for both the desktop and mobile devices, and that is designed from the ground up to work well with Chinese and Japanese.

Screencast video: ogg or youtube.

This release features desktop and SCIM integration. However, the main “innovation” brought to you by this release is the user interface. It uses two drawing areas for continuous writing. The user can eventually fix recognition errors by choosing alternative candidates or editing characters. Since a video is worth a thousand words, see the screencast above. This interface is largely inspired from the Nintendo DS game “Kanji Sono Mama Rakubiki Jiten” (漢字そのまま楽引辞典).

Tegaki is designed to be able to use several recognition engines. However so far it only supports Zinnia, which is the only recognition engine that I know with acceptable recognition accuracy and good performance on mobile devices. One challenge of the project in the future will be to create a new recognition engine that can yield better results than Zinnia.

A take that I have on this project is to use Python whenever this is possible and only use C or C++ when performance is critical, like in recognition engines. Compared to Tomoe, which implements everything in C and provides bindings for several languages, this means less reusability of the components but I hope this will make the project go forward faster.

There are still a lot of things that can be done in various areas but I really wanted to release the code I’ve put together so far because I think it can already be useful to end-users. By the way, Maemo supports both pygtk and SCIM through third-party projects, thus Tegaki is just a few Debian packages away from being available on Maemo.

For further details:
http://tegaki.sourceforge.net/

Zinnia

Saturday, September 20th, 2008

In my last post, I was writing about this impressive Chinese character recognition demo using AJAX on the client side and Support Vector Machines (SVM) on the server side, for the recognition process. Well, I don’t know if it’s just a coincidence (this demo was from 2 years ago) but Taku Kudo released last week the backend he’s using as free software. Needless to say that this was awesome news for me! I know the basic principle of SVM but time to learn more about it I guess…

His project, called Zinnia, has been rewritten from scratch to be more flexible and reusable. Models for Japanese and Chinese are included but models for other languages can be built easily provided that you have training data. I’m pretty sure that this package could also be useful for Gesture Recognition because it’s so close to Handwriting Recognition…

For the sake of comparison, I wanted to evaluate how Zinnia performs compared to both Tomoe and my own HMM experiment. I used the same evaluation corpus as I wrote about in earlier posts, that is two sets of 50 kanjis written by a Japanese friend of mine and me. The characters have the correct stroke order and were drawn carefully. Therefore, the results below indicate how the different recognizers perform in ideal conditions and don’t indicate how robust they would be in more difficult conditions.

Tomoe - Zinnia

  Tomoe Zinnia
1st match accuracy 61% 77%
5 matches accuracy 74% 92%
10 matches accuracy 74% 93%
Recognition time 21 / 100 = 0.21 s 3 / 100 = 0.03 s
Total number of kanji 3000 3000

1st match accuracy is the percentage of characters that were recognized as first match.
5 matches accuracy is the percentage of characters that were recognized in the first 5 matches.

You can download my evaluation script for Zinnia here. Tomoe’s evaluation script is sitting in Tomoe’ SVN, in the benchmark/ folder.

A few remarks:
- Zinnia is notably better than Tomoe in terms of accuracy
- Zinnia is about 7 times faster than Tomoe, making it a good candidate for an embedded platform
- In both cases, 5 matches and 10 matches accuracy are about the same, meaning that it would be enough for the user interface to display the first 5 matches only.

Project Tegaki - Zinnia

Due to lack of training data, my personal HMM experiment (project Tegaki) was only conducted over a set of 50 characters. However, Zinnia supports over 3000 characters. For fair comparison, I thus created new models for Zinnia using the same training data as I used for my experiment.

Zinnia was trained with only one sample per character, using the same data as Tomoe, which is template-based. While SVM seems to be able to cope with only one sample per character, it’s a little bit more complicated to do that with HMM because of the need to find the parameters of the Observation Probability Density Function (e.g. mean and variance for a Gaussian).

  Project Tegaki Zinnia
1st match accuracy 92% 100%
5 matches accuracy 100% 100%
10 matches accuracy 100% 100%
Recognition time 14 / 100 = 0.14 s 1.50 / 100 = 0.015 s
Total number of kanji 50 50

A few remarks:
- My experiment is slow, which is probably due to the fact that I’m using Character-level models. Stroke-level models are known to scale much better.
- My experiment has slightly worse accuracy, which is probably because I’m only using two features per observation.

Handwriting database

If you follow my adventures in the world of handwritten Chinese character recognition, you probably know that I’m planning to create a handwriting database website. This database will aim to 1) make it easy and attractive for people to contribute their handwriting samples and 2) make it easy for the database staff to manage and organize what is supposed to become a large collection of handwriting samples.

The database will use a client/server architecture. So far I’m thinking of four important clients:
- A client that people will be able to use directly in their web browser, using my web canvas
- A client for the Maemo platform
- A client for the Iphone
- A multi-platform client for the Destkop

A client of slightly lesser priority would be a Facebook application.

The handwriting samples collected will be distributed in free software license. For projects like Zinnia or Project Tegaki, this will mean more training data and more means to evaluate the performance. I consider this database one of my priorities among my free software projects but it’s going to be quite hard for me to find time for that before December…

Contribute

As always, more people are welcome to contribute.

To download the source code of my work,

$ git clone http://www.mblondel.org/code/hwr.git

web interface

HCCR using AJAX and SVM

Tuesday, September 9th, 2008

I came across this impressive page (in Japanese), which shows a demonstration of Handwritten Chinese Character Recognition using AJAX for the user interface and Support Vector Machines for the training algorithm.

Looking at the Javascript code, I was surprised to see that, unlike my web canvas, it doesn’t use the <canvas> tag! It simply uses a combination of Javascript and CSS. Even though it has a few quirks, the interface is quite responsive.

The recognition process itself happens on the server side but thanks to the use of AJAX, the results are displayed very smoothly, without the need to refresh the page.

Taku Kudo, the author, explains in the page that he’s using the handwriting data from Tomoe. However since Tomoe uses a template-based algorithm, it only has one handwriting sample per character. I’m impressed that Taku Kudo can train his system only with one sample per character. Overall, the accuracy is not very impressive but I think it could improve a lot with more training samples. That’s why my handwriting database project is going to be very useful. I’ve been willing to try SVM in addition to HMMs so the fact that this project uses SVM confirmed my interest for it.

Taku Kudo’s page has neat stuff regarding Natural Language Processing and Machine Learning. He published a lot of libraries as free software, including Mecab and TinySVM. If you like fancy stuff in AJAX, you’ll also like his Japanese Input Method.

Interactive whiteboard with a Wiimote

Tuesday, September 9th, 2008

If you haven’t seen all the cool stuff that Johnny Chung Lee is doing with a Wiimote, you should definitely take a look at his projects. You’ll see that Johnny is not only a good engineer but he’s also very good at making his projects attractive.

He has a video on how to make a cheap interactive whiteboard or how to turn your laptop into a tablet pc, with a Wiimote. This could be pretty handy for using Chinese character recognition!

Web Canvas

Friday, August 1st, 2008

In my last post, I was calling for contributors to write a web canvas using the <canvas> tag. If you don’t know it, <canvas> is a new tag specified in HTML5 which allows you to draw using a Javascript API. It is already supported in Firefox, Opera, Safari and is supported in Internet Explorer through a third-party Javascript.

Since nobody responded to my call (sic), I decided to tackle it by myself. It turns out that it was a nice little project. The canvas Javascript API is very similar to the cairo API so it was easy to use. I also improved my level in Javascript a lot. So far the web canvas supports draw, import (JSON), export (XML), save as an image and replay (stroke by stroke animation).

You can try it by using the online DEMO.

What can it be useful for?

- I’m planning to use it for the handwriting database website that I wrote about some time ago. While it will be possible to contribute your handwriting using a pygtk client (Desktop or Maemo), you will also be able to contribute your handwriting using your browser directly. Not having to install any program should help increase the number of people contributing their handwriting.

- A second way of using it would be to do handwriting recognition directly in the browser. For example, one could install Tomoe (or my recognizer when it’s ready ;-)) on the server side and the web canvas on the client side. Since Tomoe has Python and Ruby bindings, this is fairly easy!

You can reuse the web canvas for your own projects if you like but I would appreciate if you could send me any feature improvement. In particular, the web canvas has a bug under Internet Explorer that I couldn’t figure out…

Source code (GPL) : gitweb

Handwriting renderers

Sunday, July 13th, 2008

Canvas

If you didn’t read my previous post, for short, project Tegaki is a framework for handwritten Chinese character recognition (HCCR) written in Python. It includes reusable components and is a placeholder for experimentation. The goal is to create the next-generation open-source HCCR software but it may be useful for academic researchers as well.

One reusable component is the Canvas. This is the user interface component that allows to draw characters. In addition, the Canvas supports “replaying” the character (stroke by stroke animation) and setting a background model (to help users draw an unknown character). It is multi-platform.

The Canvas
Example of a character drawn using the Canvas provided by libtegaki-gtk

The Canvas has a get_writing() method. It allows to retrieve the Writing object for the handwriting currently displayed in the Canvas.

XML representation

The Writing object supports reading from and writing to an XML file. The XML file can optionally be compressed using gzip or bz2. On my hard drive, I have a small set of handwriting samples. 500 characters take about 10 MB. That’s why compression is very useful.

The XML representation of a handwriting sample looks like that.

<character>
  <utf8>無</utf8>
  <strokes>
    <stroke>
      <point x="306" y="163" timestamp="0" />
      <point x="303" y="163" timestamp="21" />
      <point x="303" y="166" timestamp="29" />
      [...]
    </stroke>
    <stroke>
      <point x=”266″ y=”240″ timestamp=”912″ />
      <point x=”270″ y=”240″ timestamp=”917″ />
      <point x=”273″ y=”240″ timestamp=”925″ />
      [...]
    </stroke>
    [...]
  </strokes>
</character>

Renderers

I’ve recently added support for what I named “renderers”. They take a Writing object as parameter and generate a visual representation of it. Since I used the cairo graphics library as drawing backend, the representation can be saved to PNG, SVG and PDF! Those renderers will be very useful for the handwriting database website that I wrote about in my previous post!

Complete character renderer

Kanji

Stroke order renderer

Kanji
Stroke order with each single stroke

Kanji
Stroke order with stroke groups

Strokes can be grouped together when the stroke order is obvious. However, this requires to know which strokes to combine together. A dictionary must be created for that. A entry example would be:

駅 1,1,3,1,4,2,2

<canvas> HTML tag

The canvas I was writing about above is written in pygtk and is intended to be used for the Desktop or for Maemo. However, in the case of the handwriting database website, since we want as many people to contribute their handwriting as possible, it would be nice to not require any particular installation. For that, a canvas directly in the browser would be the ideal solution.

One solution would be to use Flash but I would prefer to use the <canvas> tag. It can be used in combination with Javascript to do drawing in the browser. It is supported natively by Firefox, Opera and Safari. It is supported in Internet Explorer through a third-party Javascript called ExplorerCanvas.

I am looking for a contributor to create a new canvas using this technology. The canvas should support drawing, displaying existing handwriting and replay (stroke by stroke animation).

For more information:

GIF stroke animation

Even though GIF uses a patented compression, GIF is still the only format with support for animations and wide support in the browsers. Therefore it would be very cool to be able to generate GIF stroke animations from a writing object.

I had a look at python-imagemagick and Python Imaging Library (PIL) but they both seem to have very limited support for GIF animations. So I’m thinking of writing my own library for GIF generation in Python. Byzanz, a software to create screencasts as GIF animations, can be used as inspiration because it includes a GIF encoder. It also supports color quantization (using octrees) and dithering. From what I see, it should take less than 1000 lines of Python code.

I read a little bit about color quantization. I found it very interesting. Here’s a short explanation about color quantization for those who don’t know about it. Basically, each pixel in an image may have three components Red Blue Green. For a 400×400 picture, this is about 400*400*3=480KB. To gain space, an idea is to store colors in a palette (a table index => color). Then each pixel only needs to refer to the index in the palette instead of having to define the three components. For a 256-color palette, this saves two bytes for each pixel. However, since we now use 256 colors only instead of 256 * 256 * 256 = 16,777,216 colors, there’s a color precision loss. The challenge is thus to find what colors to put in the palette to have the smallest precision loss possible. For example, we may want to put in the palette colors that are the closest to the most frequently used colors. This is a 3-dimensional clustering problem, thus it reminded me of Machine Learning, a topic in which I’ve been very interested recently.

For more information, I recommend the reading of those Wikipedia articles: