Here’s a summary of current thinking.
We see a need to store three different kinds of information:
- observed sequences
- curated germline sets
- the detailed information leading to the inclusion or exclusion of a gene from a germline set
At the moment I think we are leaning towards storing this information in three separate kinds of file (and see notes below on observed sequences, which may be stored elsewhere). The use of separate files reflects the different usage and authorship of the three kinds of information. More detailed notes, questions, and current status, below.
1 - Observed Sequences
There is, currently, no suitable public database which will accept inferred sequences, such as germline genes inferred fromrepertoire analysis. Hence we need to create a repository, at least for inferred sequences. I propose that we only accept inferred sequences into this repository, and require gene sequences to be deposited in genbank.
Status - the schema for this file has not been defined as yet.
Questions - Is this approach acceptable? Is it over-simplistic: are there other classes of sequence that we need to consider?
2 - Curated Germline Sets
This is intended to be suitable for use by analysis tools such as IgBLAST, IMGT. It should contain sufficient information for such tools, and should err on the side of being rich without containing excessive information (such as references to all observed sequences contributing evidence) that is unlikely to be needed by the large majority of such tools.
Status - the schema for this file is reasonably mature. Detailed thinking on the current draft is summarised here. More recent work has focussed on refining annotations to be included. I propose that, where it covers annotations that we wish to incorporate, we adopt the IMGT Ontology unless that turns out to give us copyright issues, which I think is unlikely.
Questions - Are we comfortable with the use of the IMGT Ontology? Are there further comments on the schema, or are we ready to declare it complete, at last as a first draft?
3. Detailed curation information on a gene
Having discussed the size and type of this information, I think we are leaning towards creating a file per gene that will reference all observed sequences contributing to its inference, contain such scoring as we agree to incorporate (such as level of confidence in the gene) and so on. There sems to be sufficient depth of information to split it out rather than try to hold this information in the germline set.
Status - the schema for this file has not been defined as yet.
Questions - Are we comfortable with the approach of creating one file per gene to hold this information?