Archive for August, 2008

Character encoding detection

Sunday, August 17th, 2008

Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.

Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:

A composite approach to language/encoding detection

This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:

Coding scheme method

A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.

Character distribution method

For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.

Number of Most
Frequent Characters
Accumulated Percentage
10 0.11723
64

0.31983
128 0.45298

256 0.61872
512 0.79135
1024

0.92260
2048 0.98505

4096 0.99929
6763 1.00000

Character distribution in Simplified Chinese

Number of Most
Frequent Characters
Accumulated Percentage
10 0.27098
64 0.66722
128

0.77094
256 0.85710

512 0.92635
1024 0.97130
2048

0.99431
4096 0.99981

1.00000

Character distribution in Japanese

In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.

Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:

n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total - n_freq)

We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.

confidence_level = r_file / r_encoding

We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.

This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.

n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total - n_freq)

2-character sequence distribution method

For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.

Accuracy

100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.

Web Canvas

Friday, August 1st, 2008

In my last post, I was calling for contributors to write a web canvas using the <canvas> tag. If you don’t know it, <canvas> is a new tag specified in HTML5 which allows you to draw using a Javascript API. It is already supported in Firefox, Opera, Safari and is supported in Internet Explorer through a third-party Javascript.

Since nobody responded to my call (sic), I decided to tackle it by myself. It turns out that it was a nice little project. The canvas Javascript API is very similar to the cairo API so it was easy to use. I also improved my level in Javascript a lot. So far the web canvas supports draw, import (JSON), export (XML), save as an image and replay (stroke by stroke animation).

You can try it by using the online DEMO.

What can it be useful for?

- I’m planning to use it for the handwriting database website that I wrote about some time ago. While it will be possible to contribute your handwriting using a pygtk client (Desktop or Maemo), you will also be able to contribute your handwriting using your browser directly. Not having to install any program should help increase the number of people contributing their handwriting.

- A second way of using it would be to do handwriting recognition directly in the browser. For example, one could install Tomoe (or my recognizer when it’s ready ;-)) on the server side and the web canvas on the client side. Since Tomoe has Python and Ruby bindings, this is fairly easy!

You can reuse the web canvas for your own projects if you like but I would appreciate if you could send me any feature improvement. In particular, the web canvas has a bug under Internet Explorer that I couldn’t figure out…

Source code (GPL) : gitweb