Dictzip reader in Ruby

Both Ruby and Python have classes in their standard library to read transparently gzip-compressed files. This is very convenient because you can read compressed files just like you would do with normal files. However, random file access (i.e. moving the file position indicator to an arbitrary offset, using fseek) is not possible without performing serial access to the whole file. Because the file is compressed, there’s no way to know where a given portion of the uncompressed file is in the compressed file. Decompressing the whole file is unacceptable for large files and would be damn slow.

dictzip files are perfectly legal gzip files. As a result, they can be uncompressed with gunzip and zcat. However, as permitted by the gzip specification, they contain additional information in their header that libraries can take advantage of to provide pseudo random access to the compressed files.

How is it achieved? Basically, the original uncompressed file is divided into chunks of equal size. Chunks must be below 64KB. Each chunk is compressed separately, then the number of chunks and the size of each chunk (in its compressed form) is kept in a table in the gzip file header. Thanks to this table, when we want to read only an arbitrary portion of the original uncompressed file, we can know what chunks the portion is in, uncompress only those chunks and return the wanted portion of the file.

As the dictzip man page states, there’s a tradeoff. True random file access is not realized: any access, even for a single byte, requires that a 64kB chunk be read and decompressed. This is slower than accessing a flat text file, but is much, much faster than performing serial access on a fully compressed file. Space-wise, a dictzip file is only about 4% larger than the same file compressed all at once.

dictzip is currently used by the dictd server and the Stardict dictionary application to store dictionaries. In both cases, dictionaries have an associated index file, allowing fast lookups. Thanks to dictzip, the body of the definitions found can be quickly accessed, directly in the compressed dictionary file.

In order to be able to use dictionaries made for dictd or Stardict in Fantasdic, I wrote a pure Ruby class to read dictzip files. Currently it supports only the “read” and “pos=” methods, but it would be interesting to add more IO methods. That class is only a reader. dictzip files can be created with the “dictzip” utility. In Debian/Ubuntu, it is available in the eponymous package. If you’re interested, writing a Dictzip writer in Ruby should not be too hard and should be a good exercise.

Code available in Fantasdic’s SVN (250 lines)

4 Responses to “Dictzip reader in Ruby”

  1. Daniel Says:

    Are you using dictzip on mac OSX? I’ve been using it in linux, but just switched to mac. dictzip doesn’t seem to be available thru macports, and I’d rather not use it in a linux vm if I can avoid it…

    Also, is development on Fantasdic continuing?

  2. Mathieu Says:

    You can compile it from source (get the source from http://sourceforge.net/projects/dict/files/dictd/).

    Unfortunately, I don’t have time for Fantasdic anymore…

  3. coder Says:

    There is a significant typo in the 3rd paragraph
    (in its compressed form)
    should read
    (in its UNcompressed form)

  4. Mathieu Says:

    No, what I wrote is correct. How would you read the compressed chunk if the size was the one of the uncompressed form?…