@schristley rightly pointed out on our call that the annotated rearrangement files grow considerably in size compared with the sequencing data, especially when using text formats like CSV/XML. As data sets grow larger, it will be important to compress them on disk. We should adopt a recommended (required) compression scheme for the AIRR data.
One crucial aspect to compression is ensuring that it’s “splittable”, meaning that you can point at some random place in the file and begin decompressing the data from there. Formats like GZIP do NOT allow this behavior, so reading just the last chunk of a file requires you to decompress the entire thing. I would suggest that for any tools/pipelines that are compressing these data files, a format like BGZF be used that is splittable. I believe Bzip2 and LZO are also splittable.