Tweeter button

Archive for the ‘Machine Learning’ Category

Latent Dirichlet Allocation in Python

Saturday, August 21st, 2010

Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) - see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. LDA can be seen as a Bayesian extension of pLSA.

As Blei, the author of LDA, points out, the topic proportions in pLSA are tied with the training documents. This is problematic: 1) the number of parameters grows linearly with the number of training documents, which can cause serious overfitting 2) it is difficult to generalize to new documents and requires so-called “folding-in”. LDA fixes those issues by being a fully generative model: where pLSA uses a matrix of P(topic|document) probabilities, LDA uses a distribution over topics.

To date, there exists several parameter estimation schemes for LDA: variational Bayes, expectation propagation and Gibbs sampling. I’ve chosen to implement the latter. It has first been described in a paper entitled “Finding scientific topics”, by Griffiths and Steyvers.

Artificial data

As with all model-based algorithms, during the early development phase, it is useful to work with artificial data, generated by following the model assumptions. In the case of LDA (and pLSA), the core assumption is that words (w) in documents are generated by mixture of topics (z). In other words, the probability of a word is:

P(w) = \sum_{z} P(w|z) P(z)

The generative process can be summarized as follows: 1) set the topic proportions once for all when the collection is instantiated and 2) for each document and for as many words as needed, sample a topic from the topic distribution and sample a word from the word distribution of the selected topic. Obviously, this is only an approximation of how documents are created in reality.

To generate an artificial dataset, we can fix the word distribution of each topic and then generate documents as explained above. Since we generated documents by sticking to the generative assumption of the model, if the algorithm is correctly implemented, it should be able to recover the word distribution of each topic, from the generated documents.

Graphical example

To gain insight and intuition, we can reuse the graphical example from Griffiths and Steyvers’ paper.

In the bag-of-words model, documents are represented by vectors of dimension V, where V is the vocabulary size. Moreover, an image of size \sqrt{V} \times \sqrt{V} has V pixels: it can thus be stored as a string/vector of length/size V. This means that a document in the bag-of-words model can be represented as an image, where pixels correspond to words and pixel intensities correspond to word counts!

As put previously, we first need to fix the word distribution of each topic. Let’s arbitrarily create 10 topics.

5 with “vertical” bars:

and another 5 with “horizontal” bars:

Each topic distribution is represented by a 5×5 image, so the vocabulary is of size 25. Black pixels correspond to words that the topic will never possibly generate. White pixels correspond to words that the topic can generate with probability 1/5.

Now let’s generate 500 documents using the generative process previously described. Here are 3 examples of such generated documents.

We clearly see bars emerging from the documents and can thus confirm that documents are mixtures of topics.

We can now use the generated documents as training data. If the Gibbs sampler is correctly implemented, we should be able to recover the original topics. Here are the results for the 1st, 6th and 26th iterations. The number between brackets is the log-likelihood.

1st iteration (-278541.7835):

5th iteration (-165139.56193):

[...]

26th iteration (-129272.328181):

After a few iterations, we see that the algorithm recovered the topics correctly. Also, the log-likelihood increases: as the number of iterations increases, it becomes more and more likely that the model generated the data. The fact that it works pretty well is not surprising: the data used were generated by sticking to the model assumptions.

Gibbs sampling

The Gibbs sampler used is said to be collapsed: the parameters of interest are not sampled directly. Instead we sample the topic assignments and the parameters can be computed in terms of those.

It is not necessarily obvious from the equation of the full conditional distribution (from which the topic assignments are sampled) but the sampler is naturally sparse: it doesn’t need to iterate over words with zero-count. This is a nice property, given that sampling algorithms are often considered slow.

Source code

http://gist.github.com/542786

Fairly readable and compact code but to be considered a toy implementation.

Useful Resources

MCMC

- “MCMC lecture at MLSS09” (Iain Murray). Nice for a first general overview and the insights.

- “Gibbs sampling for the uninitiated” (Resnik and Hardisty). Nice for a first general overview and the insights.

- “Pattern Recognition and Machine Learning” (Bishop), Chapters 8 and 11 on graphical models and sampling methods. Excellent chapters.

- “Review Course: Markov Chains and Monte Carlo Methods” (Cosma and Evers). Very nice free online course and solutions to exercises in Python and R!

LDA

- “Latent Dirichlet Allocation” (Blei et al, 2003). By Blei himself.

- “Finding scientific topics” (Griffiths and Steyvers). Insightful comments and nice intuitive graphical example.

- “Parameter Estimation for text analysis” (Heinrich). Very nice introduction to Bayesian thinking. Pseudo-code for the LDA Gibbs sampler.

- “On an equivalence between PLSI and LDA” (Girolami and Kaban). Connections between pLSA and LDA.

- “Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling” (Carpenter). Very detailed, step-by-step derivation of the collapsed Gibbs samplers for LDA and NB.

- “Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details” (Wang). Insightful comments and pseudo-code of the LDA Gibbs sampler.

Other Python implementations

- nrolland’s pyLDA. Works fine but mixes Python-style and Numpy-style.

- alextp’s pylda. Numpy-style but not tested.

Semi-supervised Naive Bayes in Python

Monday, June 21st, 2010

Expectation-Maximization

The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.

Naive Bayes trained with EM

In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:

  • Train a classifier with your labeled data
  • While the model likelihood increases:
    • E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
    • M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.

The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.

Code

I made a simple implementation of it in Python + Numpy. The code is fairly optimized.

$ git clone http://www.mblondel.org/code/seminb.git

web interface

Implementation details

Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.

Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:

P(x_i|c_j) = \prod_{t}^V P(w_t|c_j)^{x_{it}}

However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:

\log P(x_i|c_j) = \sum_{t}^V x_{it} \log P(w_t|c_j)

To turn around P(x|c), we use Bayes’rule:

P(y_i=c_j|x_i) = \frac{P(c_j)P(x_i|c_j)}{P(x_i)} = \frac{P(c_j)P(x_i|c_j)}{\sum_k P(c_k)P(x_i|c_k)}

By posing z_j = \log P(c_j) + \log P(x_i|c_j), we get:

P(y_i=c_j|x_i) = \frac{e^{z_j}}{\sum_k e^{z_k}}

This is the softmax function. However, we are back to our initial problem because, since the z_j are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant e^{-m}:

P(y_i=c_j|x_i) = \frac{e^{z_j-m}}{\sum_k e^{z_k-m}}

Setting m to max_j~z_j, the z_j-m values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.

This trick will improve the situation quite a lot but in case this is not enough:

P(y_i=c_j|x_i) = \begin{cases} 0, & \mbox{if }  z_j - m  \le t  \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m > t\}} e^{z_k-m}},  & \mbox{otherwise} \end{cases}

This sets the exponentials to zero when e^{z_j-m} \le e^t. For t=-10, this is 0.000045. Equivalently this corresponds to setting the exponentials to zero when e^{z_j} \le e^{t+m}. Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.

Reference

Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

LSA and pLSA in Python

Sunday, June 13th, 2010

Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.

Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing.

Comparison

  LSA pLSA
1. Theoretical background Linear Algebra Probabilities and Statistics
2. Objective function Frobenius norm Likelihood function
3. Polysemy No Yes
4. Folding-in Straightforward Complicated

1. LSA stems from Linear Algebra as it is nothing more than a Singular Value Decomposition. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).

2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the Frobenius norm of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the least squares solution corresponds to the Maximum Likelihood solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling.

The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.

Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).

3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.

4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called “folding-in”. Clearly, new documents folded-in don’t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time.

In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm.

Implementation in Python

LSA is straightforward to implement as it is nothing more than a SVD and Numpy’s Linear Algebra module has a function “svd” already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn’t mean that the SVD is not full, just that the returned matrices don’t contain vectors corresponding to zero singular values. Scipy’s Linear Algebra package unfortunately doesn’t seem to have a sparse SVD. Likewise, there’s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).

pLSA’s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.

The implementation is sparse. It works with both Numpy’s ndarrays and Scipy’s sparce matrices.

$ git clone http://www.mblondel.org/code/plsa.git

web interface

Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it’s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.

The Little Machine Learner

Thursday, February 18th, 2010

The idea

I’ve been having this idea on my mind for quite some time: wouldn’t it be nice to write a book about Machine Learning where each chapter is a literate program?

From Wikipedia:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

From the PyLit homepage:

The idea is that you do not document programs (after the fact), but write documents that contain the programs.

There are plenty of great textbooks about Machine Learning out there, so the point would not be to write yet another one, but write something different. Here’s what I had been thinking.

  • Each chapter written as a literate program, organized so as to maximize understanding
  • Code in Python (+Numpy + Scipy but without any additional dependencies)
  • Readability over Performance
  • Intuitions, nice figures, useful tips or tricks
  • Real-world applications at the end of each chapter
  • Don’t shy away from the maths, especially if at high-school or undergraduate level…

I bet that quite a few algorithms can be written this way, yet remain very concise!

Except for the maths part, the closest book to this idea that I know of is probably “Programming Collective Intelligence: Building Smart Web 2.0 Applications”, by Toby Segaran.

An example with logistic regression

So, in order to experiment with what such a book could look like, I’ve decided to write a chapter about Logistic Regression. Topics I cover include Maximum Likelihood Estimation, Regularization and Cross-validation. At the end, I use heart disease prediction as an example of real-world application. Probably many things could be improved or added but the point for now is mainly to show what it could look like.

Tools

For the documentation tool, I’ve decided to go for Sphinx, which seems to be emerging as the de-facto documentation tool in the Python community. It has nice features like syntax highlighting, latex support and matplotlib plots support and can output to HTML and PDF.

Normally, in literate programming, there’s the literate source, which uses some kind of markup-language and tools are used to generate either code or documentation from it. I took a different approach. In my case, the source file is the code and the documentation is extracted from the comments in the code. Technically, it’s therefore closer to extensively documented code than actual literate programming. It has some limitations but the main advantages are that the program is runnable directly (since Python is interpreted) and the programmer can benefit from syntax highlighting. I wrote a simple program that converts Python source code to reStructuredText, as necessary for integration in Sphinx.

Interested?

It took quite some time to collect the information and do the actual writing but I feel like I improved my own understanding in the process, so I’m thinking of writing a chapter from time to time. If I do so, at the end of my PhD, I may have gathered enough material to make it a real book! The book could affectionately be entitled “The Little Machine Learner”, hence the title of this post.

Since Machine Learning is a very large field and to write a better book than I could possibly write alone, I’m also thinking that it could actually be a collaborative effort (by researchers, students and practitioners). If you’re interested, please leave a comment. I will create a discussion group if there’s enough interest.

As usual, the source code is available in my git repo:

$ git clone http://www.mblondel.org/code/tlml.git

web interface

Numbers game solver

Monday, April 6th, 2009

Genetic programming is a Machine Learning technique that allows to “create programs that create programs” by simulating evolution. The purpose is to slowly converge to a program that performs a defined task as well as possible.

I wanted to try out Genetic Programming and I figured writing a solver for the “numbers game” would be a good example. If you don’t know it, it’s part of the popular game shows “Countdown” in the UK and “Des chiffres et des lettres” in France. For example, given the numbers 3, 75, 2, 4, 1 and 1, the goal is to arrive at 888 using the four basic arithmetic operations (+, −, × and ÷) like so:

74 = 75 - 1
12 = 4 * 3
888 = 74 * 12

The above calculations can be represented as a tree like this:

        *
     /      \
    -        *
  /   \    /   \
75    1    4    3

So it becomes clear that solving the numbers game is equivalent to writing little programs. Although I used the tree representation, a possibly simpler representation that I now think of is like a string of stack operations, like in Forth:

75 1 - 4 3 * *

You start with an initial population of random programs and every generation, you create a new population by selecting the best-performing programs from the current generation, mutating programs (changing a constant with another constant, changing an operator with another operator…) and breeding programs (creating programs which are the combination of two programs).

The generated programs must respect some constraints. For example, division must not have a remainder so 7 / 2 is not allowed. Another example is that the constants used must be among the 6 input numbers and can only be used once.

In Genetic Programming, to determine how good the generated programs perform, we need a scoring function. Here it’s very clear: the closer we are from the target number, the better. To decide between two solutions that perform as well as the other, we can select the one with the fewer calculations.

You can find my program here.

Another idea for applying Genetic Programming that I had is for creating hash functions. What would be the scoring function in that case? One could use the avalanche effect (the greater the avalanche effect, the better) or the average number of collisions (the smaller number of collisions, the better). Although I had this idea by myself, it seems that researchers are already investing this!

Source code classifier

Wednesday, March 18th, 2009

Of the numerous Machine Learning algorithms, naive Bayes classifiers are probably one of the simplest forms. Yet, they are known to be able to yield very good performance in some circumstances. The most famous example of their use is probably the spam filters, in which spam and non-spam emails are marked beforehand and throughout the filter’s life in order to train the filter to recognize spam from non-spam.

I’ve decided to try out this technique for source code classification: given a source file for which we don’t know the programming language, the purpose is to output the n most probable programming languages in which the source is written. Although I did it mainly to see what kind of results I could get, one use I can think of is in a text-editor, to activate or deactivate syntax highlighting when the file extension is missing.

Difficulties

I can think of several difficulties in trying to classify code source files:
- Unlike many natural languages in which spaces delimit words, in programming languages, spaces can often be omitted. For example print(”hello”) and print (”hello”) could well both be correct in a given language. The lack for spaces is detrimental to the performance because print(”hello”) will be treated as a single word and as a result have a very low probability.
- Programs are sometimes heavily commented by the programmer, which can cause the classifier to confuse source files and plain-text files.
- Since we don’t know the language in advance, we cannot rely on a grammar. Intuitively, we need to use the occurrence probability of words for each programming language.

Collect data

Because this technique involves training the program to recognize programming languages, it was first necessary to collect source files for different programming languages. I decided to limit myself to 24 languages from Programming Language Popularity for which I had easy access to open-source programs. Of course it is possible to add more by training the program with more source files.

For each programming language, I grabbed a few projects written in the language and for interpreted languages, I also grabbed the virtual machine source code because their standard library is often written in the language directly. I downloaded all the source files with the Debian command “apt-get source”, which grabs project sources from Debian mirrors, making the task a little bit more convenient but overall collecting source files was the most boring and time-consuming part of this project.

The thing is, to account for different programming styles and to find a good approximation of the word distributions, especially given the difficulties mentioned above, it’s very important to collect as much source files as one can. Once this was done I was able to start the program itself.

Ruby and C implementations

I made the program (trainer and classifier) in Ruby. The training part is quite slow (Ruby is quite slow for text processing) but the classifier is well responsive, which is the most important. I also made a C version of the classifier: the Ruby program is used to generate a header file (.h) which contains a hash table with the data for the statistical model and when the classifier is compiled, the hash table is included directly in the executable for blazing fast processing. The executable remains small (~200K for 24 languages).

An advantage of this approach is that most of the work is done offline (i.e. during the training) and when I tweak some parameters in the Ruby program, I just have to retrain, regenerate the .h file and recompile. The core C module doesn’t change.

Performance evaluation

I used a validation set to estimate how good the classifier performs: the percentage of correctly classified source files and the percentage of incorrectly classified source files and I got an overall accuracy of 85%. The validation set was composed of 40 sources files for each of the 24 programming languages. Here are a few interesting points I noticed:

- A few languages such ada, css, forth, ocaml are recognized correctly 100% of the time. Others such as tcl, erlang, scheme, haskell, smalltalk get an average accuracy over 95%. This shows that these languages stand out in their word distributions: the most common words in these languages are quite different from the most common words in other languages making it easier for the classifier to distinguish them.

- On the contrary, other languages get poor accuracy because they easily get recognized as another language. An example is C++ which, according to my tests, got recognized as C or Javascript in 30% of the source files. This is not very surprising knowing that these languages have a similar syntax.

- It’s also interesting to note what languages does the classifier recognizes when it makes mistakes. This shows us how close some languages can be. For example, not surpringly, the classifiers sometimes confuses common lisp with scheme and vice-versa.

Source / Plain text

If this classifier had to be integrated into a text editor, it would first be necessary to tell if the file is even source code or just plain text (like a README file). To do that, I would personally build a second model whose role is solely to distinguish source code from plain text. It’s possible to do that by training the classifier appropriately. If the file is recognized as plain text, syntax highlighting can be disabled, otherwise the programming language model can be used to determine the programming language and set the syntax highlighting accordingly. The only disadvantage to this approach is that the first model becomes a bottleneck to the overall process. If it does a poor job at distinguishing plain text from source code, then it is the whole system that will do a poor job.

Download

You can get the source code (LGPL license) with this command:
$ git clone http://www.mblondel.org/code/source-classifier.git

or by browsing the web interface.

HCCR using AJAX and SVM

Tuesday, September 9th, 2008

I came across this impressive page (in Japanese), which shows a demonstration of Handwritten Chinese Character Recognition using AJAX for the user interface and Support Vector Machines for the training algorithm.

Looking at the Javascript code, I was surprised to see that, unlike my web canvas, it doesn’t use the <canvas> tag! It simply uses a combination of Javascript and CSS. Even though it has a few quirks, the interface is quite responsive.

The recognition process itself happens on the server side but thanks to the use of AJAX, the results are displayed very smoothly, without the need to refresh the page.

Taku Kudo, the author, explains in the page that he’s using the handwriting data from Tomoe. However since Tomoe uses a template-based algorithm, it only has one handwriting sample per character. I’m impressed that Taku Kudo can train his system only with one sample per character. Overall, the accuracy is not very impressive but I think it could improve a lot with more training samples. That’s why my handwriting database project is going to be very useful. I’ve been willing to try SVM in addition to HMMs so the fact that this project uses SVM confirmed my interest for it.

Taku Kudo’s page has neat stuff regarding Natural Language Processing and Machine Learning. He published a lot of libraries as free software, including Mecab and TinySVM. If you like fancy stuff in AJAX, you’ll also like his Japanese Input Method.

Character encoding detection

Sunday, August 17th, 2008

Two years ago, I wrote about a port to Ruby of Universal Encoding Detector, which is itself a port to Python of Mozilla’s character encoding detection algorithm.

Recently being interested in Machine Learning, I read about naive Bayes classifiers. I then remembered the encoding detector program and thought that naive Bayes classifiers would be a good candidate for this kind of problem. Going back to the Universal Encoding Detector’s home page, I found a link to:

A composite approach to language/encoding detection

This is an interesting read. Here’s a summary. The algorithm used is a composite approach of 3 different methods:

Coding scheme method

A state machine is used to identify illegal code points (in which case we can remove the encoding from the search space) or code points that only exist in one encoding (in which case we can immediately identify the encoding). This method is good for multi-byte encodings but not for all and is not good for single-byte encodings because unused code points in one encoding are very seldom used in other encodings.

Character distribution method

For languages with many different characters like Chinese, Japanese and Korean, use the Character distribution. The paper has interesting figures about this.

Number of Most
Frequent Characters
Accumulated Percentage
10 0.11723
64

0.31983
128 0.45298

256 0.61872
512 0.79135
1024

0.92260
2048 0.98505

4096 0.99929
6763 1.00000

Character distribution in Simplified Chinese

Number of Most
Frequent Characters
Accumulated Percentage
10 0.27098
64 0.66722
128

0.77094
256 0.85710

512 0.92635
1024 0.97130
2048

0.99431
4096 0.99981

1.00000

Character distribution in Japanese

In other words, even though Chinese and Japanese include literally thousands of characters, the 512 most frequent characters represent respectively 79% and 93% of the whole character distribution. This is extremely interesting! This means that we can use the frequency of occurence of code points to detect encodings.

Although a naive Bayes classifier could have been used for this, a much simpler approach was taken. Characters were divided into two categories: frequent and non-frequent characters. Are considered frequent the characters that fall into the first 512 most frequent characters. For a given file the encoding of which is unknown, we define a distribution ratio r_file:

n_freq = number of frequent characters in the file
n_total = total number of characters in the file
r_file = n_freq / (n_total - n_freq)

We compare r_file with the average distribution ratio for a given encoding, r_encoding. This comparison, in the form of the ratio of the two distribution ratios, gives us our confidence level.

confidence_level = r_file / r_encoding

We calculate the confidence level for each encoding by replacing r_encoding with the corresponding value and the highest confidence level gives us the most likely encoding.

This technique requires a training set, that is, files for which we know the encoding. First of all this allows us to find the character frequency for each character. Second of all, it allows us to determine the average distribution ratio r_encoding for each encoding.

n_freq = number of frequent characters in the files from the training set with a given encoding
n_total = total number of characters in the same files
r_encoding = n_freq / (n_total - n_freq)

2-character sequence distribution method

For character sets with few different characters such as in western languages, a simple character distribution is not enough to discriminate among encodings so a distribution of sequences of 2 characters is used instead.

Accuracy

100% accuracy was achieved when the detector was applied to the home pages of 100 popular international web sites, according to the paper.

Handwriting renderers

Sunday, July 13th, 2008

Canvas

If you didn’t read my previous post, for short, project Tegaki is a framework for handwritten Chinese character recognition (HCCR) written in Python. It includes reusable components and is a placeholder for experimentation. The goal is to create the next-generation open-source HCCR software but it may be useful for academic researchers as well.

One reusable component is the Canvas. This is the user interface component that allows to draw characters. In addition, the Canvas supports “replaying” the character (stroke by stroke animation) and setting a background model (to help users draw an unknown character). It is multi-platform.

The Canvas
Example of a character drawn using the Canvas provided by libtegaki-gtk

The Canvas has a get_writing() method. It allows to retrieve the Writing object for the handwriting currently displayed in the Canvas.

XML representation

The Writing object supports reading from and writing to an XML file. The XML file can optionally be compressed using gzip or bz2. On my hard drive, I have a small set of handwriting samples. 500 characters take about 10 MB. That’s why compression is very useful.

The XML representation of a handwriting sample looks like that.

<character>
  <utf8>無</utf8>
  <strokes>
    <stroke>
      <point x="306" y="163" timestamp="0" />
      <point x="303" y="163" timestamp="21" />
      <point x="303" y="166" timestamp="29" />
      [...]
    </stroke>
    <stroke>
      <point x=”266″ y=”240″ timestamp=”912″ />
      <point x=”270″ y=”240″ timestamp=”917″ />
      <point x=”273″ y=”240″ timestamp=”925″ />
      [...]
    </stroke>
    [...]
  </strokes>
</character>

Renderers

I’ve recently added support for what I named “renderers”. They take a Writing object as parameter and generate a visual representation of it. Since I used the cairo graphics library as drawing backend, the representation can be saved to PNG, SVG and PDF! Those renderers will be very useful for the handwriting database website that I wrote about in my previous post!

Complete character renderer

Kanji

Stroke order renderer

Kanji
Stroke order with each single stroke

Kanji
Stroke order with stroke groups

Strokes can be grouped together when the stroke order is obvious. However, this requires to know which strokes to combine together. A dictionary must be created for that. A entry example would be:

駅 1,1,3,1,4,2,2

<canvas> HTML tag

The canvas I was writing about above is written in pygtk and is intended to be used for the Desktop or for Maemo. However, in the case of the handwriting database website, since we want as many people to contribute their handwriting as possible, it would be nice to not require any particular installation. For that, a canvas directly in the browser would be the ideal solution.

One solution would be to use Flash but I would prefer to use the <canvas> tag. It can be used in combination with Javascript to do drawing in the browser. It is supported natively by Firefox, Opera and Safari. It is supported in Internet Explorer through a third-party Javascript called ExplorerCanvas.

I am looking for a contributor to create a new canvas using this technology. The canvas should support drawing, displaying existing handwriting and replay (stroke by stroke animation).

For more information:

GIF stroke animation

Even though GIF uses a patented compression, GIF is still the only format with support for animations and wide support in the browsers. Therefore it would be very cool to be able to generate GIF stroke animations from a writing object.

I had a look at python-imagemagick and Python Imaging Library (PIL) but they both seem to have very limited support for GIF animations. So I’m thinking of writing my own library for GIF generation in Python. Byzanz, a software to create screencasts as GIF animations, can be used as inspiration because it includes a GIF encoder. It also supports color quantization (using octrees) and dithering. From what I see, it should take less than 1000 lines of Python code.

I read a little bit about color quantization. I found it very interesting. Here’s a short explanation about color quantization for those who don’t know about it. Basically, each pixel in an image may have three components Red Blue Green. For a 400×400 picture, this is about 400*400*3=480KB. To gain space, an idea is to store colors in a palette (a table index => color). Then each pixel only needs to refer to the index in the palette instead of having to define the three components. For a 256-color palette, this saves two bytes for each pixel. However, since we now use 256 colors only instead of 256 * 256 * 256 = 16,777,216 colors, there’s a color precision loss. The challenge is thus to find what colors to put in the palette to have the smallest precision loss possible. For example, we may want to put in the palette colors that are the closest to the most frequently used colors. This is a 3-dimensional clustering problem, thus it reminded me of Machine Learning, a topic in which I’ve been very interested recently.

For more information, I recommend the reading of those Wikipedia articles:

First HMM experiment

Sunday, May 25th, 2008

Today I’m publishing the initial results of my experiments on online handwriting recognition of Chinese characters, using Hidden Markov Models (HMM). You can see my post on Tomoe Evaluation for some background.

Download

$ git clone http://www.mblondel.org/code/hwr.git

The code can be browsed online using gitweb.

See my memo on git if you don’t know it yet. I published my work under GPL license.

Requirements

- Python (2.4)
- GHMM (SVN)
- Tomoe (SVN)
- Tomoe-GTK (SVN)

The Python bindings for the last three are also needed.

Folder structure

- data/ contains the raw training and evaluation data
- lib/ contains reusable components
- models/ contains model experiments
- tests/ contains test cases
- character-editor is the graphical interface to edit character data
- model-manager controls the training workflow, evaluation and the test pad

Each model must have an intelligible name. Each model must defined a file called model.py containing a class called Model. This class defines the behavior of the model. A model can inherit from other models in order to reuse common components. My first model is called “basic” so its file is models/basic/model.py.

First model

Here is some information regarding my first model.

- HMM unit: whole character
- Feature vectors: (deltax, deltay) with deltax = abs(xt - xt-1) and deltay = abs(yt - yt-1)
- Number of states: 3 * number of strokes
- Initial state transitions: 0.5 to stay in the same state, 0.5 to jump to the next state
- Initial state alignment: feature vectors are segmented uniformly and segments are associated with their corresponding state
- Training: Baum-Welch

If you don’t understand anything of the above, you should read more about HMMs ;) I may write an introduction on this journal if I have some time.

Training workflow

model-manger’s usage is as follows:
./model-manager model-name command

My first model is named “basic” so you may replace model-name by “basic”. Possible commands include:

- fextract, for the feature vectors extraction
- init, for the model initialization
- train, for the training
- eval, for the evaluation
- pad, to test the model with your own handwriting

“all” is a command equivalent to fextract, init, train and eval.

Testing with your own handwriting

First of all you should generate the HMMs with the following command:
./model-manager basic all

The process takes less than one minute on my computer. You may see a few warnings because of some issues in ghmm and tomoe. If all goes well, you should see the accuracy of the model.

From this point, normally, you could test the HMMs with your own handwriting with the following command:

./model-manager basic pad

However, for strange reasons, ghmm behaves incorrectly when the pygtk module is loaded. So the above command works but the character results will be incorrect. I need to contact the pygtk or ghmm mailing-list about this obscure issue. For now, you can use the following command:

./model-manager pad | ./model-manager basic eval -s

The results are displayed on the console. The system supports the following 50 kanji only.

一 二 三 泣 漢 温 使 便 旅 族
水 氷 撃 女 安 北 化 忘 妄 近
集 育 坊 訪 防 妨 駅 福 副 神
版 坂 板 金 全 錬 練 業 習 央
決 代 反 想 歯 象 始 初 発 感

Pick a few of them and try them with your own handwriting ;-)! By the way, all training and evaluation data were written by mouse.

Evaluation

match1: 80.0%
match5: 96.0%
match10: 98.0%

始	1	始, 福, 駅, 錬, 漢
旅	1	旅, 族, 駅, 練, 副
妨	1	妨, 練, 錬, 板, 発
防	1	防, 訪, 旅, 族, 板
泣	1	泣, 温, 福, 練, 駅
副	1	副, 訪, 福, 撃, 初
福	1	福, 練, 錬, 副, 駅
坂	3	板, 駅, 坂, 族, 錬
代	1	代, 板, 漢, 使, 駅
反	1	反, 福, 副, 忘, 妄
撃	3	駅, 錬, 撃, 漢, 副
業	1	業, 練, 錬, 集, 駅
氷	2	駅, 氷, 水, 妨, 版
温	1	温, 福, 駅, 錬, 想
育	1	育, 練, 副, 駅, 福
神	2	練, 神, 福, 錬, 撃
近	1	近, 駅, 練, 漢, 福
化	1	化, 練, 駅, 便, 習
一	X
央	1	央, 決, 業, 駅, 発
族	1	族, 練, 旅, 錬, 副
安	4	妄, 駅, 福, 安, 族
象	1	象, 駅, 錬, 練, 集
歯	1	歯, 練, 錬, 駅, 副
錬	1	錬, 練, 集, 駅, 福
習	1	習, 錬, 福, 駅, 漢
使	1	使, 便, 漢, 錬, 練
訪	1	訪, 駅, 錬, 副, 板
漢	1	漢, 錬, 駅, 練, 業
全	1	全, 金, 集, 錬, 福
集	1	集, 練, 業, 錬, 福
版	1	版, 板, 錬, 駅, 集
水	2	氷, 水, 旅, 駅, 便
板	1	板, 族, 坂, 福, 駅
妄	1	妄, 駅, 福, 忘, 練
初	1	初, 駅, 旅, 練, 坂
想	1	想, 駅, 副, 錬, 集
発	1	発, 練, 福, 駅, 漢
練	1	練, 錬, 福, 駅, 板
北	1	北, 坂, 副, 駅, 板
決	1	決, 漢, 便, 練, 坂
坊	X
駅	1	駅, 錬, 練, 族, 福
金	1	金, 発, 練, 錬, 駅
女	5	駅, 妨, 妄, 板, 女
忘	1	忘, 族, 副, 福, 駅
二	1	二, 三, 忘, 歯, 習
感	1	感, 福, 駅, 族, 練
便	3	練, 駅, 便, 錬, 福
三	1	三, 忘, 副, 版, 訪

The results are very promising and outperform Tomoe’s current recognizer. Incidentally, I used the same evaluation corpus for Tomoe and for my experiment. However, a few things must be emphasized:

- My experiment only supports 50 kanji while Tomoe supports thousands of them.
- The evaluation of my experiment is performed using kanji from the same people who wrote the kanji used for training. However, the kanji instances for training and evaluation are not the same.
- It’s pretty sure that using the whole character HMM symbol will not perform well in terms of computation time with thousands of kanji. Usually, stroke or sub-stroke models are preferred.

Interestingly, my recognizer doesn’t do a good job at recognizing the simplest characters: 一 二 三.

Both Tomoe and my recognizer are sensitive to stroke order. However, as it seems, my recognizer is not so sensitive to stroke number. For example, く in 女 is one stroke but it’s acceptable to write it in two strokes. However, if you write く after 一 and ノ, it doesn’t work.

Call to online handwriting database

If you’re a researcher in handwriting recognition and read this, I’m looking for a handwriting database of Chinese characters (kanji or hanzi). Please contact me if you can help me.

What’s next?

- Try more sophisticated feature vectors
- Try more sophisticated initial state alignment
- Try stroke and sub-stroke HMMs
- Collect more data
- Try techniques other than HMMs