File format for VDJ annotation

laserson · June 27, 2016, 1:29pm

I would also strongly promote the use of a modern serialization framework for this, i.e., one of Apache Thrift, Apache Avro, or Google Protocol Buffers. Much has been written about the advantages of these, with a good summary here:

The frameworks have very efficient binary representations, are compatible with JSON, offer RPC IDLs, support schema evolution, and generate efficient code for in-memory models in many programming languages that are all binary compatible. They are also all compatible with the Parquet file format, which is incredibly efficient for analytical queries and compression, and for use with cloud-compatible file formats. All of these tools are industry-standard at big data companies, and are heavily maintained by the open source community. Mostly, they unburden scientists from dealing with the sometimes horror-show of file format/encoding issues.