Sponsored by the AIRR Community

Standardizing the format of a germline set

Thanks for the clarification @w.lees. Some of the group working on evaluation of the existing repertoire and working on ways to incorporate inferred sequences favor a red-orange-green system based upon evidence. I have just sent around a proposal involving 14 rules. I would like to let the group consider my email for a day or two before posting it to B-T.cr. These rules would probably answer some of your questions. I have suggested that inferred sequences would enter at the red level, while all existing IMGT sequences would begin at orange. Many of the IMGT sequences would then be demoted to red, as the rules were applied. The inferred sequences would need to be seen in an accepted proportion of all sequences, in association with different D and J. There would therefore be a set of VDJ sequences that would need to be documented somewhere, to support the decision. As more reports of inferences came in, a sequence could climb from red to orange, and then green. So ultimately three sets of VDJ sequences could be associated with a germline sequence.

In response to one of your points, I have suggested that if a submitted sequence is rejected, it would not be recorded. Hopefully the rules would be sufficiently clear, adn the process sufficiently transparent that virtually all submitted sequences would be accepted.

Yes, please make it useful for the curators :slight_smile: I would expect that the criteria of the germline evaluation group will be able to operationalize the curation process to a large extent. Decisions of a curator that are not covered by / not in line with these rules, will require free text documentation. Right now I do not see a scenario in which this information could not be directly attached to a source or segment (aggregation) record. Therefore the release notes (i.e. free text not in the DB) could be kept to a minimum. Please let me know if I am missing something.

I think I would prefer this.

We also have to think about how these additional features (e.g., RSS, Spacers, etc) factor into our allele calls. At present, there is so little data on these within IMGT, that they tend to have little to no effect on allele names. But there are several examples for which an “allele” in IMGT, say *01 of geneA is associated with more than one RSS. Should these be split into two alleles? Or remain as one allele with an annotation/metadata entry that includes two RSS variants?

I was a bit hesitant whether a requirement for formal publication would be too restrictive, but I agree with your point about creating incentives for publication. So if we assume that publication will be a requirement, what would we count as a publication? Should a manuscript have undergone peer-review and have a PMID or would we also accept stuff that has not undergone formal review but is permanently accessible via a DOI (e.g. bioRxiv or other pre-print servers)?

As a practical matter - this seems to imply that any study annotating a repertoire should cite perhaps 100s of original papers that document the initial observation of each allele used in the database…?

@caschramm I suggest that we create a bibliography as part of an automated report that can be used as Supplementary Information for a paper. This would save quite a bit of time and also perhaps help by defining a standardised format for the detail that should be provided.

Re @caschramm’s query, I would also add that there are times when it would be diifficult and not really helpful to cite references to gene discovery. So a report on a repertoire, with so many genes being used would be one example. But if your repertoire paper had a major focus on IGHV2-5 alleles in the anti-measles response, it might be appropriate to acknowledge the discoverers of those sequences.

Is there a document somewhere summarizing the latest thinking here? Perhaps it can be pinned in this thread?

Here’s a summary of current thinking.

We see a need to store three different kinds of information:

  1. observed sequences
  2. curated germline sets
  3. the detailed information leading to the inclusion or exclusion of a gene from a germline set

At the moment I think we are leaning towards storing this information in three separate kinds of file (and see notes below on observed sequences, which may be stored elsewhere). The use of separate files reflects the different usage and authorship of the three kinds of information. More detailed notes, questions, and current status, below.

1 - Observed Sequences
There is, currently, no suitable public database which will accept inferred sequences, such as germline genes inferred fromrepertoire analysis. Hence we need to create a repository, at least for inferred sequences. I propose that we only accept inferred sequences into this repository, and require gene sequences to be deposited in genbank.

Status - the schema for this file has not been defined as yet.

Questions - Is this approach acceptable? Is it over-simplistic: are there other classes of sequence that we need to consider?

2 - Curated Germline Sets
This is intended to be suitable for use by analysis tools such as IgBLAST, IMGT. It should contain sufficient information for such tools, and should err on the side of being rich without containing excessive information (such as references to all observed sequences contributing evidence) that is unlikely to be needed by the large majority of such tools.

Status - the schema for this file is reasonably mature. Detailed thinking on the current draft is summarised here. More recent work has focussed on refining annotations to be included. I propose that, where it covers annotations that we wish to incorporate, we adopt the IMGT Ontology unless that turns out to give us copyright issues, which I think is unlikely.

Questions - Are we comfortable with the use of the IMGT Ontology? Are there further comments on the schema, or are we ready to declare it complete, at last as a first draft?

3. Detailed curation information on a gene
Having discussed the size and type of this information, I think we are leaning towards creating a file per gene that will reference all observed sequences contributing to its inference, contain such scoring as we agree to incorporate (such as level of confidence in the gene) and so on. There sems to be sufficient depth of information to split it out rather than try to hold this information in the germline set.

Status - the schema for this file has not been defined as yet.

Questions - Are we comfortable with the approach of creating one file per gene to hold this information?

1 Like

Here’s a further question. Are people happy with the level of interaction and discussion on this thread, or do we need a call to discuss some of these points in more depth?

I recently found that there could be a possiblity to provide and annotate such sequences within the DDBJ/EMBL/GenBank database framework, namely as inferential third party annotation (see the last but one point in the “Examples of TPA:inferential” section). It requires that at least one of the sequences has some kind of experimental support, but this should not be a major problem as long as your dataset also contains sequences without SHM.

With “gene sequences” you are referring to sequences for which there is direct experimental evidence?

In general, yes. I do not see any other sequence classes besides “inferred” and “observed”. However, I think we should only create an own primary repository for inferred sequences after all attempts to find a stable solution within the existing database frameworks have failed. Note that this is different from a secondary database that collects and annotates germline segments from different sources as discussed above.

Agree entirely that we shouldn’t create a new primary repository unless we have to. I read through the material you mentioned and the [TPA section] (https://www.ncbi.nlm.nih.gov/genbank/tpa/) of Genbank, and it does look promising. The fundamental limitations, as far as I can see, are that:

  1. all new annotations will be experimentally determined to exist, directly or indirectly (this is a quote from the Genbank manual: I suspect that it applies both to inferred sequences and annotations)
  2. the primary sequences on which the inference is based must be referenced
  3. post submission, but before the sequence is made publicly available in Genbank, it must be described in a peer-reviewed journal.

Inference of sequences from NGS data is not explicitly mentioned, and I suspect may not have been considered when the guidelines were drawn up.

Given the quite strong opinions expressed at the AIRR meeting that Genbank would not accept our inferred germline sequences, I wonder if anyone on the thread has direct experience of trying to submit such sequences?

(Christian, to the question you asked re. “gene sequences”, yes that’s exactly what I meant. Thanks.)

Just to keep the link easily available from this thread, here is a link to the most recent version of @w.lees’ schema.

@martin_corcoran a followup to your suggestion - are you able to provide, or point me to, definitions of the 5’UTR and the leading (L-REGION in IMGT terms) that precisely define their boundaries? I have looked at the IMGT definitions and ontology, but I don’t see any detail.

Many thanks for your help

William

We use a utilitarian approach towards leader and 5’ UTR identification in IgDiscover. We are generally dealing with 5’ RACE data when identifying upstream sequences and we define the leader region as the segment upstream of the V gene up until the furthest upstream ATG between 51 and 69 nucleotides that is in-frame with the V segment.
That seems to work well in a variety of different species.
The 5’ UTR is a little harder to define since it is not easy to get consistent amplification from an exact starting nucleotide using RACE so you end up with a series of sequence lengths of the 5’ UTR.
I’m not sure how we can word this as a concise definition for your table other than say something along the lines of the (consensus?) expressed V gene sequence upstream of the leader sequence.

@martin_corcoran So then wouldn’t the inclusion of 5"UTR potentially present issues, if not everyone is describing them in their repertoire sequencing? I certainly get that there could be SNPs within the 5’UTR that are allele determinants (and I would typically be in favor of including these), but if the data coming into the database are regularly going to have varying lengths of 5’UTRs captured, then doesn’t this make this a challenging feature to include in allele designation? I wonder if we shouldn’t simply stick to coding sequence for now??? I think this deserves some discussion…

It does present issues, I agree, but if we consider one use of the database is as a resource for groups who may have an interest in cloning monoclonal antibodies then I think it is worthwhile including this region - even if we do not have a perfect solution at present. I am not advocating that people switch to sequencing the 5’UTR and leader as part of their standard analysis, just that having this information in a germline database is going to be helpful for some purposes.

So, in a scenario where you describe a new allele, and this new allele includes 10 bp of the 5’UTR, and then @bussec describes a second sequence that matches this allele but includes 16 bp of the 5’UTR, does his sequence then supersede yours?

@bussec, how would your db scheme handle such a situation re aggregation and source layers?

First, a new source record would be generated for the second sequence. Then, as the sequences are identical in the overlapping parts, the id of the new source record would be included in the aggregation record. The overall confidence in the existence of the segment (represented by the aggregation record) would increase, assuming that the new sequence originates from an independent study.
We have not yet decided which sequence would be reported back to a user, this could either be one of the associated sequences without further editing (e.g. the longest one or the one with the highest overall quality) or a trimmed version focusing on a predefined region (e.g. ATG to RSS for V segments).