Great comments and excellent ideas came out of the AIRR community meeting. The VDJML team is very interested in helping define a new community standard. One thing that came out of the meeting is the different preferences people had towards file formats: TSV, JSON, XML, etc. Here @laserson comments are spot on in that we should use a serialization framework that allows any of these file formats to be generated from some base data representation, taking that issue out of the equation and let’s us focus on the data model. I’d like to start with a prototype implementation but my first question is which of these frameworks should we use?
While our general goal is to define a common data model for VDJ annotation, I think we need to specifically allow for extensions(maybe not right word) to that data model. Each tool will not provide all annotation, they may in fact only provide some piece of the total annotation based upon their specific algorithm. For example, a tool with a specialized algorithm to define clones would only stick in annotation information about clones. We want a data model that
- Allows a tool to check what annotation data is in the file, and inform the user when the file is missing annotations needed by the algorithm, e.g. “tool X needs VJ gene segments calls to perform its analysis, but this annotation data is missing from the input file”
- Allows a tool to add annotation data without overwriting the existing annotations. That is, I might want to run multiple clonal analysis tools that each write the same annotation information, but I want those results to be kept for each tool. Maybe I have a tool that takes those and does some ensemble analysis.
Apache Thrift looks interesting but it seems to focused more on defining interfaces versus a data definition language.
Apache Avro has the nice feature that the data schema is embedded with the data, which is great for allowing extensions and different versions of the data model that tools can utilize without recompiling, etc. It also seems to support many languages. It is unclear whether there are automatic conversions tools, i.e. transform the Avro binary file into a TSV for example?
Protocol Buffers sounds good as it is meant to be smaller and faster XML, which makes sense as I don’t think we need any of the complicated data model features in XML. It doesn’t provide the schema with the data, so reflection is harder to do (I think), we would need to carefully design the data model so tools can check whether specific annotation data is available or not.
Of course XML has a data schema language (XSD) for defining the data model, which is what we have for VDJML V1. In theory this should allow the same things that Avro and Protocol Buffers allow with code generation, automatic parsing and etc. However, it seems to me that the XML community didn’t really go this route, and many codes essentially hard-code the interpretation of the data model. I do see some stuff for Java but not much for other languages, anybody know? I suppose another point is that XSD allows for such a sophisticated data model, that this ends up being a complicated task.
Are there any other frameworks we should consider? I tend to like Avro because the schema is embedded with the data, is there any advantage to use Protocol Buffers over Avro?