Archive for January, 2010

Caching computation tasks

Wednesday, January 27th, 2010

When I work on computationally expensive projects (e.g., Machine Learning), I always find myself in the same situation: my programs can be broken down into a chain of tasks, where tasks may depend on the results of other tasks. A typical such chain would be:

preprocessing -> feature-extraction -> training -> evaluation

If I make a modification in my training algorithm and want to re-evaluate it, I do need to re-run the “training” and “evaluation” tasks, but I don’t need and don’t want to re-run the “processing” and “feature-extraction” tasks, especially if they take time to compute.

At first, I tried to save and load task results manually. This quickly proved unmanageable so I started to think of ways to automate this. Since I had quite a precise idea of what I wanted, I’ve decided to write my own tool, at the risk of reinventing the wheel. (I suspect it’s quite hard to come up with a universal tool, though) To keep things simple, I’ve decided to limit the tool’s scope to projects that can be run on a single computer, typically with multi-cores. In particular, it won’t support any kind of distributed computing.
(more…)