The Germline DB WG has asked for a summary of the position on this at its next meeting, which will be at the end of the month. With that in mind I have trawled through all the relevant threads in the forum and have put an initial draft together in this post. Could people please help develop this - what issues should be added? Which of the points on the list can we discuss and resolve now? And does the overall development approach seem sensible?
Purpose of the Germline Set Format
The format is intended to publish germline sets that can be imported by analysis tools, specifically:
- Germline parsers:
- Facilitate use with IgBlast (community effort?)
- Encourage authors of [other parsers] (List of V(D)J annotation software) to import our format
- Tools that infer novel germline genes or alleles
- Tools that infer haplotype?
- [Repertoire browsers] (VDJviz: a versatile browser for immunogenomics data)?
- Any other tools we should add to this list?
Proposed Development Approach
- Agree the purpose - hopefully this month, in this thread
- Define a first version schema and align iteratively with database definition - hopefully do the first iteration this month
- Define an initial file format (there could be others, over time)
- Deploy the format on the Germline DB website
- Publish a reference implementation that imports to IgBLAST
Progress so far
Key outstanding points
- Fields and their naming convention (@laserson) need to be aligned with the DB schema definition
- How should we handle multiple names for an identical sequence? Needs further consideration in terms of the biological process (@a.collins, @cwatson), and overall approach in our DB and website.
- Ensure the overall format is neutral to gene naming convention (@mats.ohlin) - need to keep in mind as we understand what conventions might arise
- Needs to be extensible to leader and constant regions (@javh). Should there be there associated metadata and naming, for example for isotypes and subtypes?
Additional fields to consider for incorporation
(the overall framework is extensible. I think the main question at the moment should be whether these fields would be useful for the tools we aim to support)
- Functionality (@werner.muller)
- GO terms for species etc (@werner.muller)
- Inference class (@werner muller). Should we include other supporting evidence?
- gene family, gene, allele (@schristley)
File Format or Formats
- Data compression does not seem a pressing priority for us, given that the datasets will be relatively small
- We need an intuitive layout that is easy for people to work with and hard to misinterpret (@psathyrella)
- We should consider using a well-known standard framework, e.g. JSON, that is well supported and allows for later extension (@laserson)
- Genbank format another possibility: it is well understood in the field (@werner.muller)
- We need to consider metadata format, for which there are also precedents (@laserson)