Data compression for AIRR format data

laserson · August 24, 2016, 4:33pm

@schristley rightly pointed out on our call that the annotated rearrangement files grow considerably in size compared with the sequencing data, especially when using text formats like CSV/XML. As data sets grow larger, it will be important to compress them on disk. We should adopt a recommended (required) compression scheme for the AIRR data.

One crucial aspect to compression is ensuring that it’s “splittable”, meaning that you can point at some random place in the file and begin decompressing the data from there. Formats like GZIP do NOT allow this behavior, so reading just the last chunk of a file requires you to decompress the entire thing. I would suggest that for any tools/pipelines that are compressing these data files, a format like BGZF be used that is splittable. I believe Bzip2 and LZO are also splittable.

javh · August 26, 2016, 3:44pm

I think xv (LZMA) is splittable. It also provides a high compression ratio, but at significant CPU cost. Compression ratio vs CPU demand is another important thing to consider. This balance may differ between local analysis and servers, like VDJServer, depending upon how limited storage space is.

Also, ease of implementation is a concern. For example, xz is not currently supported in the standard library of Python 2, but is supported in Python 3. Base R (ie, read.table) supports gzip, bzip2 and xz.

Without thinking about it too hard, I suspect bzip2 is going to be the ideal choice.

dooley · August 27, 2016, 4:01pm

It might be worth giving a nod to algorithms that have solid javascript libraries so realtime viz of the compressed sections can be streamed and queried using standard http semantics.

schristley · August 29, 2016, 4:46pm

The SAM/BAM specification uses BGZF. They also have a separate CRAM specification for a specialized compression scheme. They also seem to support zlib, they call it Random Access compressed(Z) Format (RAZF). We can use their code as a starting point for our implementation.

laserson · September 1, 2016, 8:34pm

Do you have a sense for which algos those are? I would guess that most of the popular compression algos have good impls in javascript.

laserson · September 1, 2016, 8:38pm

Facebook recently published a post on Zstandard, which is generally interesting, though I wouldn’t advocate we use it.

But it does reinforce that gzip/zlib probably offers some of the best tradeoffs. IIRC, the BGZF version should be compatible with gzip.