Dictzip reader in Ruby
Both Ruby and Python have classes in their standard library to read transparently gzip-compressed files. This is very convenient because you can read compressed files just like you would do with normal files. However, random file access (i.e. moving the file position indicator to an arbitrary offset, using fseek) is not possible without performing serial access to the whole file. Because the file is compressed, there’s no way to know where a given portion of the uncompressed file is in the compressed file. Decompressing the whole file is unacceptable for large files and would be damn slow.
dictzip files are perfectly legal gzip files. As a result, they can be uncompressed with gunzip and zcat. However, as permitted by the gzip specification, they contain additional information in their header that libraries can take advantage of to provide pseudo random access to the compressed files.
How is it achieved? Basically, the original uncompressed file is divided into chunks of equal size. Chunks must be below 64KB. Each chunk is compressed separately, then the number of chunks and the size of each chunk (in its compressed form) is kept in a table in the gzip file header. Thanks to this table, when we want to read only an arbitrary portion of the original uncompressed file, we can know what chunks the portion is in, uncompress only those chunks and return the wanted portion of the file.
As the dictzip man page states, there’s a tradeoff. True random file access is not realized: any access, even for a single byte, requires that a 64kB chunk be read and decompressed. This is slower than accessing a flat text file, but is much, much faster than performing serial access on a fully compressed file. Space-wise, a dictzip file is only about 4% larger than the same file compressed all at once.
dictzip is currently used by the dictd server and the Stardict dictionary application to store dictionaries. In both cases, dictionaries have an associated index file, allowing fast lookups. Thanks to dictzip, the body of the definitions found can be quickly accessed, directly in the compressed dictionary file.
In order to be able to use dictionaries made for dictd or Stardict in Fantasdic, I wrote a pure Ruby class to read dictzip files. Currently it supports only the “read” and “pos=” methods, but it would be interesting to add more IO methods. That class is only a reader. dictzip files can be created with the “dictzip” utility. In Debian/Ubuntu, it is available in the eponymous package. If you’re interested, writing a Dictzip writer in Ruby should not be too hard and should be a good exercise.
Code available in Fantasdic’s SVN (250 lines)