Tweeter button

Caching computation tasks

When I work on computationally expensive projects (e.g., Machine Learning), I always find myself in the same situation: my programs can be broken down into a chain of tasks, where tasks may depend on the results of other tasks. A typical such chain would be:

preprocessing -> feature-extraction -> training -> evaluation

If I make a modification in my training algorithm and want to re-evaluate it, I do need to re-run the “training” and “evaluation” tasks, but I don’t need and don’t want to re-run the “processing” and “feature-extraction” tasks, especially if they take time to compute.

At first, I tried to save and load task results manually. This quickly proved unmanageable so I started to think of ways to automate this. Since I had quite a precise idea of what I wanted, I’ve decided to write my own tool, at the risk of reinventing the wheel. (I suspect it’s quite hard to come up with a universal tool, though) To keep things simple, I’ve decided to limit the tool’s scope to projects that can be run on a single computer, typically with multi-cores. In particular, it won’t support any kind of distributed computing.

Dependency resolution & Object persistence

Basically, the tool boils down to dependency resolution and object persistence. make is an obvious possibility for dependency resolution, but it can only use file modification time to decide whether to recompute tasks or not. In my tool, I check cache availability for a task based on the task inputs (these can be outputs from previous tasks, files, algorithm parameters…) as well as source code. The source code is also taken into consideration because a task’s result is likely to change if the task source code has changed.

To store objects on the filesystem or in a database, you need a way to serialize and deserialize objects. In the python world, the obvious choice is pickle, which is also used by the module shelve to store objects in a dbm database with a dict-like interface. Pickle is quite slow to load and save big lists of objects though, so I created two sqlite-based stores called KeyListStore and ListStore, to address this issue. A difficulty is how to efficiently compute a hash that identifies objects uniquely. To make things simple, I just took the hash of pickled objects. This is wrong, since pickle doesn’t guarantee to return twice the same strings for two same objects. However, while this can lead to incorrectly invalidating a cache, forcing the task to be recomputed, hopefully, it’s very unlikely that a cache is mistaken for the cache of another object. In practice, I haven’t had any problem with cache so far.

One feature in Python that was particularly useful for this tool was decorators. They can be used to change the behavior of a function, in a declarative style.

Example

Here’s a concrete example of a program written with my tool:

import numpy as np
import taskmanager as tm
 
DFLT_TRAIN = "/path/to/..."
DFLT_EVAL = "/path/to/..."
 
# Preprocessing
 
def preprocess(img_folder, normalize):
    return [preprocess_img(img, normalize) \ 
                  for img in img_folder.get_files()]
 
@tm.task(tm.directory("*.jpg"), bool)
def preproc_train(img_folder=DFLT_TRAIN, normalize=False):
    return preprocess(img_folder, normalize)
 
@tm.task(tm.directory("*.jpg"), bool)
def preproc_eval(img_folder=DFLT_EVAL, normalize=False):
    return preprocess(img_folder, normalize)
 
# Feature extraction
 
@tm.task(preproc_train):
def fextract_train(images):
    return [...]
 
@tm.task(preproc_eval):
def fextract_eval(images):
    return [...]
 
# Training
 
@tm.task(fextract_train, int, float)
def train(features, maxiter=10, esp=0.0001):
    return [...]
 
# Evaluation results
 
@tm.task(fextract_eval, train)
def evaluate(features, models):
    return [...]
 
@tm.task(evaluate)
@nocache
def results(eval_res):
    print [...]
 
def main():
    try:
        tm.TaskManager.OUTPUT_FOLDER = "./tmp"
        tm.run_command(sys.argv[1:])
    except tm.TaskManagerError, m:
        print >>sys.stderr, m
 
if __name__ == "__main__":
    main()
  • The tool is quite unobtrusive.
  • There’s no need to deal with file names or file versions, this is all done transparently for you.
  • You get a command-line interface for free. Here, since all tasks have a default parameter, you could just run “./mypgm.py results” and it would work. If you wanted to try out different parameters for, e.g., “train”, you could run “./mypgm.py train:5 results”

Code

Code available here.

Everything is kept in one file to make it easy to copy the tool to another project. The tool is quite usable already but of course, it’s a work in progress.

Leave a Reply