Standardizing the format of a germline set

w.lees · September 23, 2016, 1:54pm

Thanks to @laserson @werner.muller @caschramm and @ematsen for the points raised this week.

Following discussion on the backend design thread, it looks as though there will only be a need for a small number of naming schemes and numbering schemes within each species. For example, for naming, we may wish to have an ‘AIRR name’ and a ‘Legacy name’ against each germline, but probably no other names will crop up. For numbering, similarly, we may wish to list an IMGT numbering scheme, Kabat and Chothia, but again, probably no others, or at most a very small number.

That being the case, the structure I proposed, with independent numbering and naming sections that can be authored by different teams and split out at will, is looking a bit over-complicated for the task. Over the weekend I propose to de-normalise it, moving instead to a single table with fields for alternate names and numbering. At the same time I will add the additional fields that @werner.muller has proposed, and the naming convention that @laserson has proposed. I will do another comb-through for additional fields at the same time. If anyone has concerns about this approach please let me know and we can hold off and discuss, but it seems to reflect the way we are all thinking.

We also seem to be moving towards consensus in JSON as a file format, by the way. It looks as though implementation has already started on that basis (this is fine by me, by the way, it’s always struck me as a good choice, even though I’ve really been focussed on the logical level). If anyone has issues with JSON as a file format for the ‘full’ germline set, please post something here so that we can resolve - otherwise I think it becomes our choice.

There is also a desire for a simple cut down germline set for parsers as @caschramm highlighted. It feels to me as though the priority right now is to get the full schema defined. It should be easy enough to pick out the fields for the cut-down version, and to provide it in a number of formats if that proves necessary. Again if anyone has concerns about putting the priority on the full schema right now, please post here and we can discuss, otherwise I suggest we come back to the cut down version and its format(s) once the full schema is ok with everyone.

Thanks

William