Character encoding detection
Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.
Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:
A composite approach to language/encoding detection
This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:
Coding scheme method
A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.
Character distribution method
For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.
|Number of Most
Character distribution in Simplified Chinese
|Number of Most
Character distribution in Japanese
In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.
Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:
n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total – n_freq)
We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.
confidence_level = r_file / r_encoding
We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.
This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.
n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total – n_freq)
2-character sequence distribution method
For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.
100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.