Sponsored by the AIRR Community

Standardizing the format of a germline set

The Human Germline Genes subgroup has proposed guidelines which have some bearing on the schema:

  1. Sequences must be reported in a peer reviewed journal.
  2. Sequences must not include ambiguities
  3. A single cDNA derived sequence cannot be the sole evidence for a germline sequence
  4. The database must include full-length sequences.
  5. We exclude all sequences generated by six studies that have high sequencing error rates.

For (1) I have included a list of citations in the gene record. I’ve also added a list of citations for the gene set itself.

For (2) and (4), the schema is more permissive at the moment and we can consider whether it should be tightened it up to impose these restrictions. My inclination, at the moment, is not to do so. The same guidelines might not be applicable to other species, where the germline is not so well described. In any case, I think the discrimination should be made by the curators of the gene set: for example, as discussed in that post, there is sometimes uncertainty over the final nucleotides of the V-gene. Also, the schema might be used in other contexts where the restrictions are not imposed. For the same reason, I have made the citations fields optional rather than mandatory.

As always, please let me know whether you support this approach or have a different view.

Thanks

William

Hello there @w.lees-- this is great, and we’re taking a look at it now. However, the link you put in doesn’t allow comments. Could you paste a link that does? Thanks.

Sorry, I have updated the message with a link that should allow comments.

This looks great @w.lees, I’ve compared your schema against what VDJServer uses for its germline database and I think you’ve covered all of the fields that we have.

One thought I had was about the alternative/synonym names, instead of having two fields, would it be better to have a list in case there are more?

Thanks @schristley. I thought the two columns for alternative names might work better, because the curators could choose to use one column, say for IMGT names, the other column for some other set of names. So the column refers to a specific set. We could achieve the same effect with name/value pairs in a list, but it would not be very human readable. If necessary we can add more columns at a later date, but the number of different sets of names will hopefully be small.

Does this make sense?

William

We had some AIRR folks in Seattle for the Immune Profiling meeting, and had a lot of productive discussions about germline database structure, in particular with @bussec, @cswarth, Charles Linkem (Adaptive), and @wsdewitt. @cswarth put together some notes that I’m pulling apart, editing, and posting here.

The discussion quickly led to a central question: what defines a “record” in the database? This doesn’t seem to be quite clear in the schema.

We considered two principles around which to organize the information:

One record per unique sequence

Having one record per unique sequence seems like the obvious thing for bioinformaticians, however, there is a pitfall: identical sequences at different loci. This is important information.

One could aggregate all of the locations into a list for that sequence record, but note that each location can have its own references and its own levels of confidence. This leads to nesting complications that might argue for the second option…

One record per original source reference

An alternative is to have one record per original source reference. Each record would be annotated with all the attributes for that particular germline sequence, including:

  • citation to published reference, or information leading to the dataset from which this allele was inferred
  • genome position, if known
  • segment name, with synonyms
  • confidence level in that observation of the sequence

[One record per unique position on the genome]

This doesn’t really get traction because one can’t express inferred alleles, one loses the uniqueness of sequences, but this has all of the nesting problems of one record per unique sequence.

So?

Having one record per original source reference seemed to be the consensus at the end of the meeting. However, this raises another question, which is what actually sits in a germline database list of sequences.

@bussec’s answer to this was having a second layer of records which aggregate the per-reference layer. We can talk about this after seeing if @w.lees or anyone else has an alternate solution to the issues raised above.

Apologies if I am re-visiting points that have been made already, here or on other threads. Focusing on @ematsen’s “One record per original source reference”. The point regarding publications seems to be implying that publications may not be necessary. I need to be convinced on that. It is true that a bad publication serves little purpose, but seeing detail of how sequences have been generated has been critical to previous evaluations of ‘confidence levels’. I also think we need to encourage publications that identify BCR/TCR genes. A problem that we had for many years was that people came to see such reports as unimportant. By insisting on publications, we counter that view.

On the point of ‘confidence levels’, is this an evaluation of the confidence of the observation in the original source reference, or is this a measure of overall confidence, taking in additional evidence? If it is the former, this is an evaluation that I am not aware is being worked on at present. If the latter, the confidence level will change over time. Or have I missed something? I thought that confidence levels, as well as ‘functionality’ or ‘rearrangeability’ would be a separate layer of the system.

I would also add a comment on the ‘one record per unique sequence’ problem. I know leader sequences, RSS etc have been mentioned occasionally in various threads. Although most reported sequences only include the coding region, we should be encouraging the reporting of these elements. And one day, this will lead to a new kind of diversity in the germline repertoire. Not only are identical sequences possible at different loci, but we have the possibility of identical coding regions being associated with varying RSS or other regulatory elements. I can imagine attention returning to these elements as we struggle to understand the highly variable utiilization frequencies of different genes. Should the database be designed so that it is ready for that?

Thanks for your thoughts, @a.collins!

First, I should make it clear that “one record per original source reference” means that a given sequence may have many records in the database. We can think of each of these records as being bits of evidence that contribute to the overall judgement on including that sequence into a high-confidence germline set. Do you think that we should exclude any non-published evidence, even if sequences are publicly available?

For this consideration it was the former. Do you not think that a particular paper could be assigned a confidence based on methods used?

You are right that the per sequence rather than per source confidence in a sequence would be a separate layer, only hinted at here with the aggregate records.

And I love your comment about additional information on top of the coding sequence! By having a more flexible configuration afforded by “one record per original source reference” we can accommodate that.

Thanks for the clarifications @ematsen! On the issue of whether or not a study needs to be published, I think we should do whatever we can to encourage this. It obviously is critical if a thorough evaluation of evidence is to be made. But there is a second reason why I think it would be useful. For most of the last 20 years, there has been little incentive to publish germline sequence studies, and we should adopt policies that turn this situation around. Demanding that a study be published if a sequence is to be taken into a database is not an incentive. On the other hand, the incentive will be there if we encourage people to cite publications that report sequences, instead of or in addition to just citing a paper describing the database. At present, the IMGT database is cited hundreds of times, but the papers that originally reported the sequences are long forgotten. Not only is this a little unfair, but if more people were aware of the original publications, we could have a very lively debate about the likelihood that the sequences were accurately reported.

1 Like

I think this is a good discussion, because it really focuses on what the germline set is for, and who is going to use it. For my part, I see the germline set (as defined currently in the schema) as the product of the curator’s work - in other words, the ‘second layer of records which aggregate the per-reference layer’. To that end, it has just a single reference to an original source sequence, which you can think of as the primary evidence for the gene described in the record. It doesn’t attempt to include details of everything the curators took into account when deciding to include this gene- I’d assumed that curators would vary in their approach, and that such details would best be described in a publication, or in release notes.

We probably do need to provide a place where inferred sequences can be deposited, as we know they can’t be deposited in Genbank. But I think we should keep this separate from the germline set. If we try to put all the source sequences into the germline set, it will accumulate many duplicates which don’t have biological significance or interest, and I think this would make it more difficult for downstream tools to consume it.

Thanks, @w.lees-- you are right that this is about who the database is for.

Namely, is the database meant to be useful for the curators in addition to the consumers? I could imagine it being nice for the curators to be able to use it to keep track of the evidence for a given gene, and also for people who want to understand the basis for a given “level of evidence” call for a given germline gene. Maintaining release notes is great, but might also be a burden.

I’d love to hear from some potential curators on this point, such as @a.collins, @ctwatson, and @bussec.

I have made some small updates to the schema definition in response to comments posted in the document:

  • ORCID and PubMed ID added to author name and citation respectively, where these items exist.
  • field labels 5_UTR and LEADIN changed to 5’UTR and L-REGION to match the labels used in the IMGT Ontology, As I noted in the document, I suggest that it is worth using the ontology as a starting point, as the terms are defined and quite widely used. Over time we can adapt and extend the definitions as necessary. The full set of labels is at http://www.imgt.org/ligmdb/label and background/citations are at http://www.imgt.org/IMGTindex/ontology.php .

Does anyone have views on the adoption of the IMGT labels as a starting point?

@a.collins, With respect to the comment in your last mail, the schema can accommodate any other labels we need in addition to the ones we already have (5’UTR, L-REGION, FR1, CDR1, FR2, CDR2, FR3, CDR3, FR4), provided that they are annotations which occur at most once per sequence. If we wanted to add annotations, for example to label hotspots, that can occur multiple times in a sequence, we’d need to construct something else. That’s certainly do-able, but is not going to be easy to read or manage in a spreadsheet-type form.

Are there other labels in addition to the ones listed that we should add at this stage?

Actually, on reflection, I think we could handle annotations that could occur multiple times in a sequence fairly easily if we needed to.

Let me first respond to @w.lees, and then I wil try to take up the challenge from @ematsen!

I do think it worth building in the capability to record the recombination signal sequences. And thinking of RSS brings to mind another ‘expansion’ that should probably at least be considered briefly at this stage of the game. All discussion to date has been around IGHV, but the database will eventually have to expand to IGHD and IGHJ. I have no expertise on databases and their design, so my question might be really naive. Does we separately standardize the database format for each gene type? So to articular how my mind wandered from RSS to here, we need 3’ RSS for IGHV, 3’ and 5’ RSS for IGHD and 5’ RSS for IGHJ.

To return to @ematsen’s challenge to potential curators…I don’t think it will be possible to fully present the evidence in favor of a sequence in a database record. As we review the IMGT repertoire, the fact that some evidence is provided in the IMGT database in support of the existence of their named sequences is very helpful, and such evidence is similarly to be found in the VBASE2 database. But the quality of each piece of evidence is only evident after a forensic examination. It will be important that people be able to reasonably easily access information that explains or documents why a sequence is accepted in the way it is. (Such acceptance will probably include a category that are ‘accepted’ as sequences to treat with suspicion.) The rules that guide decision making will have to be very clear, and prominently displayed, but somewhere, submitted sequences will have to be available. @w.lees wrote of publications emerging from the work of the curators, and that is probably also a topic that needs discussing. When new alleles are accepted by the curators, should they report this in a peer-reviewed publication? This might be done once each year, but let’s not forget that this would be a pretty boring task to do, year after year after year. These would not be publications to build a reputation on. So will curators step forward and write such papers, or is another system needed? And of course here we are considering the challenges with a focus that is very much on human sequences, with occasional mentions of mice and macaques. Curating all species will be a mind blowing task if lots of people start cranking out sequences for tens then hundreds then thousands of species.

Thanks @a.collins. Yes we are attempting to define a schema that will cover V, D and J and also B- and T-cell, so we should certainly consider them all.

For the time being I have added V-RS, 3’D-RS, 5’D-RS and J-RS, these being the fields that the IMGT ontology defines for the cases you mention. There are sub-fields as well: for example V-RS is divided into V-HEPTAMER, V-SPACER and V-NONAMER. Altogether, there are >60 descriptive labels in the IMGT ontology, and we may well wish to add to the list over time. I think the important thing, right now, is to make sure we understand the overall principles - for example, that we need a different format for each gene type - and to identify any key fields that we expect people to work with in certain circumstances - for example the FR and CDR fields are key for sequence parsing, and RSS is important to curators.

Let’s suppose that we hold two lists in our system: a list of inferred genes, and a list of original submitted sequences. What information do we want to hold in the system (as opposed to in journal articles, release notes or whatever) regarding the evidence for an inferred gene? At its simplest, this could just be a list of the submitted sequences that are deemed to be evidence. But, from your note, Andrew, it seems to me that we might also want to record some notes explaining why, and some metrics that reflect confidence - the output of the ‘forensic examination’. Perhaps we also want to list submitted sequences which were considered, but were rejected for one reason or another. I’m asking because we need to establish whether we can comfortably fit the supporting evidence into the list of inferred genes. The alternative would be a separate file per inferred gene, holding all the information and notes on supporting sequences, examination, and so on. A condensed list of inferred genes, suitable for use by tools such as parsers, would be assembled from the set of such files.

Thanks for the clarification @w.lees. Some of the group working on evaluation of the existing repertoire and working on ways to incorporate inferred sequences favor a red-orange-green system based upon evidence. I have just sent around a proposal involving 14 rules. I would like to let the group consider my email for a day or two before posting it to B-T.cr. These rules would probably answer some of your questions. I have suggested that inferred sequences would enter at the red level, while all existing IMGT sequences would begin at orange. Many of the IMGT sequences would then be demoted to red, as the rules were applied. The inferred sequences would need to be seen in an accepted proportion of all sequences, in association with different D and J. There would therefore be a set of VDJ sequences that would need to be documented somewhere, to support the decision. As more reports of inferences came in, a sequence could climb from red to orange, and then green. So ultimately three sets of VDJ sequences could be associated with a germline sequence.

In response to one of your points, I have suggested that if a submitted sequence is rejected, it would not be recorded. Hopefully the rules would be sufficiently clear, adn the process sufficiently transparent that virtually all submitted sequences would be accepted.

Yes, please make it useful for the curators :slight_smile: I would expect that the criteria of the germline evaluation group will be able to operationalize the curation process to a large extent. Decisions of a curator that are not covered by / not in line with these rules, will require free text documentation. Right now I do not see a scenario in which this information could not be directly attached to a source or segment (aggregation) record. Therefore the release notes (i.e. free text not in the DB) could be kept to a minimum. Please let me know if I am missing something.

I think I would prefer this.

We also have to think about how these additional features (e.g., RSS, Spacers, etc) factor into our allele calls. At present, there is so little data on these within IMGT, that they tend to have little to no effect on allele names. But there are several examples for which an “allele” in IMGT, say *01 of geneA is associated with more than one RSS. Should these be split into two alleles? Or remain as one allele with an annotation/metadata entry that includes two RSS variants?

I was a bit hesitant whether a requirement for formal publication would be too restrictive, but I agree with your point about creating incentives for publication. So if we assume that publication will be a requirement, what would we count as a publication? Should a manuscript have undergone peer-review and have a PMID or would we also accept stuff that has not undergone formal review but is permanently accessible via a DOI (e.g. bioRxiv or other pre-print servers)?