Tweeter button

Archive for the ‘Genetic Programming’ Category

Using internet users to do something useful

Saturday, June 13th, 2009

I found an insightful and entertaining talk that I recommend you to watch. It’s about CAPTCHA in general and more specifically about ReCAPTCHA.

CAPTCHA and ReCAPTCHA

The idea of CAPTCHA is to ask users to answer a question that only humans can answer in order to check whether they are a human or a computer program. For example, in order to prevent massive creation of email accounts by computer programs, most email services ask new subscribers to write the text present in a image. The image is distorted so that it is not possible for a computer program to recognize it but it is still readable by a human. CAPTCHA is now commonly used by all kinds of websites and so millions of users are reading CAPTCHA everyday…

The basic idea of ReCAPTCHA is to use the manpower of people writing CAPTCHA in order to help digitalize old books. Although OCR (Optical Character Recognition) software is pretty efficient for printed documents, the error rate can jump quite a lot for old books due to their poor quality. So by definition, words extracted from old books are in themselves good candidates for CAPTCHA. However, extracted words are further distorted in order to make them look like real CAPTCHA.

In ReCAPTCHA, users are presented with two words to identify. For one word, we know the answer already so it is used to check whether the user is a human or not. For the other word, we want to know the answer (in order to digitalize the book). In other words, users are asked to write two words even though only one would have suffice to fulfill the purpose of the CAPTCHA. However, measurements show that on average it takes equally long to write two plain English words and one word composed of 6-8 random characters. So ReCAPTCHA is an interesting way to fulfill the CAPTCHA’s goal while doing something useful at the same time, which is nice knowing that millions of users are reading CAPTCHAs everyday.

I though ReCAPTCHA is a pretty clever idea. It’s actually an example of human-based computation — that is, the combined use of human and computer for a task that wouldn’t have been possible for one or the other alone.

Genetic art

An idea I had sometime ago (but this has already been done), which is an example of human-based computation, is to use genetic programming in order to automatically generate art. Because it is difficult for a computer alone to tell whether a certain piece of art is beautiful or not, the idea is to let a genetic algorithm create art and use humans in order to evaluate the beauty of art at each iteration of the genetic algorithm (i.e. as a fitness function).

Handwriting sample database

Another related idea is the idea of games with a purpose. This is basically what I want to do with the Chinese handwriting sample database that I mentioned several times already in this journal. The basic idea is to make it attractive for people to write handwriting samples of Chinese characters by letting them play educational games.

Numbers game solver

Monday, April 6th, 2009

Genetic programming is a Machine Learning technique that allows to “create programs that create programs” by simulating evolution. The purpose is to slowly converge to a program that performs a defined task as well as possible.

I wanted to try out Genetic Programming and I figured writing a solver for the “numbers game” would be a good example. If you don’t know it, it’s part of the popular game shows “Countdown” in the UK and “Des chiffres et des lettres” in France. For example, given the numbers 3, 75, 2, 4, 1 and 1, the goal is to arrive at 888 using the four basic arithmetic operations (+, −, × and ÷) like so:

74 = 75 - 1
12 = 4 * 3
888 = 74 * 12

The above calculations can be represented as a tree like this:

        *
     /      \
    -        *
  /   \    /   \
75    1    4    3

So it becomes clear that solving the numbers game is equivalent to writing little programs. Although I used the tree representation, a possibly simpler representation that I now think of is like a string of stack operations, like in Forth:

75 1 - 4 3 * *

You start with an initial population of random programs and every generation, you create a new population by selecting the best-performing programs from the current generation, mutating programs (changing a constant with another constant, changing an operator with another operator…) and breeding programs (creating programs which are the combination of two programs).

The generated programs must respect some constraints. For example, division must not have a remainder so 7 / 2 is not allowed. Another example is that the constants used must be among the 6 input numbers and can only be used once.

In Genetic Programming, to determine how good the generated programs perform, we need a scoring function. Here it’s very clear: the closer we are from the target number, the better. To decide between two solutions that perform as well as the other, we can select the one with the fewer calculations.

You can find my program here.

Another idea for applying Genetic Programming that I had is for creating hash functions. What would be the scoring function in that case? One could use the avalanche effect (the greater the avalanche effect, the better) or the average number of collisions (the smaller number of collisions, the better). Although I had this idea by myself, it seems that researchers are already investing this!