Archive for the 'Natural Language Processing' Category

HCCR using AJAX and SVM

Tuesday, September 9th, 2008

I came across this impressive page (in Japanese), which shows a demonstration of Handwritten Chinese Character Recognition using AJAX for the user interface and Support Vector Machines for the training algorithm.

Looking at the Javascript code, I was surprised to see that, unlike my web canvas, it doesn’t use the <canvas> tag! It simply uses a combination of Javascript and CSS. Even though it has a few quirks, the interface is quite responsive.

The recognition process itself happens on the server side but thanks to the use of AJAX, the results are displayed very smoothly, without the need to refresh the page.

Taku Kudo, the author, explains in the page that he’s using the handwriting data from Tomoe. However since Tomoe uses a template-based algorithm, it only has one handwriting sample per character. I’m impressed that Taku Kudo can train his system only with one sample per character. Overall, the accuracy is not very impressive but I think it could improve a lot with more training samples. That’s why my handwriting database project is going to be very useful. I’ve been willing to try SVM in addition to HMMs so the fact that this project uses SVM confirmed my interest for it.

Taku Kudo’s page has neat stuff regarding Natural Language Processing and Machine Learning. He published a lot of libraries as free software, including Mecab and TinySVM. If you like fancy stuff in AJAX, you’ll also like his Japanese Input Method.

Character encoding detection

Sunday, August 17th, 2008

Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.

Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:

A composite approach to language/encoding detection

This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:

Coding scheme method

A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.

Character distribution method

For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.

Number of Most
Frequent Characters
Accumulated Percentage
10 0.11723
64

0.31983
128 0.45298

256 0.61872
512 0.79135
1024

0.92260
2048 0.98505

4096 0.99929
6763 1.00000

Character distribution in Simplified Chinese

Number of Most
Frequent Characters
Accumulated Percentage
10 0.27098
64 0.66722
128

0.77094
256 0.85710

512 0.92635
1024 0.97130
2048

0.99431
4096 0.99981

1.00000

Character distribution in Japanese

In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.

Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:

n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total - n_freq)

We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.

confidence_level = r_file / r_encoding

We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.

This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.

n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total - n_freq)

2-character sequence distribution method

For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.

Accuracy

100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.

Some news and pictures of Japan

Sunday, May 25th, 2008

I’ve been living in Japan for 6 months already. I must say, wow, time flies very very fast. This is abroad for me so there’s always something new to do or to see. I don’t have much time for myself or personal projects. But I have (very) slowly been working on some experiments on handwritten Chinese character recognition. Stay tuned.

I will graduate from my engineering school in June (masters) so lately I have been thinking to what I want to do in the future. As I said, times goes very fast. It’s thus very important to fulfill myself during the 8 hours that I spend at work, whatever it may be. So far, I’m pretty sure that I don’t want to work as a software developer for a private company. I want to do something more challenging and in connection with research. Most importantly, I want to LEARN new things every day. Options I’m considering are creating my own company, becoming a research engineer in a private lab or starting a PhD. I’m glad because I think I have narrowed down the fields I would like to work in, in the future. I would like to work in the fields of Machine Learning and Pattern Recognition, especially handwriting recognition, natural language processing (NLP) and speech recognition.

I uploaded some of my pictures of Japan. I still have many others that I need to sort out but if you are interested, take a look.