Draft criteria for inclusion of inferred alleles into a germline database

The following set of criteria for including an inferred gene into a database came out of a late-night wine-infused discussion at the AIRR meeting. Thus, it’s not a fully formed proposal, but it could be a starting point for discussion:

  • Sequence must be full length, and included with a peer-reviewed publication
  • The number of unique, productive, un-mutated sequences containing the novel segment must have absolute # significantly above background (≥10 sequences)
  • Constitute a significant % of the total unique sequences in the repertoire (≥0.01% of the total unique un-mutated repertoire)
  • Constitute a significant % of the total unique rearrangements to the closest known segment in the repertoire (≥10% of the total unique reads)
  • Have a significant association with multiple known J genes (which do not contain mutations) (≥2 J genes)
  • Matching 3’ and 5’ UMI or alternative test to exclude chimeric amplicons
  • Must be ≤3 other alleles of closest gene in the (inferred) individual genotype
  • The novel segment must be found in ≥2 independent biological replicates or sequencing runs (including public data sets)

There’s plenty to not like here, but perhaps @ctwatson or @tbkepler would like to express their reservations?

I don’t think it’s helpful to simply list a set of criteria without any explanation. How about starting with a statement of the goal, the errors that one is trying to avoid and how each criterion solves a particular sub-problem?

How should we weigh false positives vs false negatives?

At the moment we have neither statistical models nor systematic validation studies. I think we need to return to this topic when one or the other is in hand.

I agree in part. Yes, we should define goals and the errors we wish to avoid. We must be alert to the separate issues of false positives and false negatives.

But do we set the germline database challenge aside until statistical models or systematic validation studies are published? An issue we are trying to address is that abundant polymorphisms are being inferred, and these are an important resource. They may or not ultimately be confirmed and placed in the top tier repertoire database, but they could be in a second tier - inferred sequences that have gained a level of acceptance. Along the lines we see in VBASE2, for example, or our own 5 level evaluation of the IMGT IGHV repertoire.

Let us not forget that the repertoire most people are using - the IMGT repertoire - includes any sequence that IMGT could lay their hands on from the early years of antibody sequencing. There is no modelling or validation behind it. I don’t know how long it may take to develop the kinds of models and validation studies that Tom has in mind, but I suspect they won’t happen quickly. We need an interim solution.

1 Like

I like @tbkepler’s the idea of a statistical framework. Presumably the idea of this framework would be to, given a collection of rearranged sequences, indicate a level of confidence in a given inferred germline gene. As much as I love the spirit of this idea, it seems to me that any such framework would have to (at least implicitly) model the long-term evolutionary process of germline diversification, in addition to all of the complexities of repertoire sequencing. This seems like a tall order.

In such situations it seems perfectly suitable to use leave-out experiments as a way of calibrating our sensitivity and specificity. Although there are statistical issues with having the data ahead of time, we are increasingly rich with both repertoires and new germlines sequenced from genomic DNA. Thus we will be able to test on new data sets.

This is a very important concern, but this is a user-level choice. Our role is to provide a convenient tiered structure that will enable this choice to be made.

1 Like

Dear all, I suggest to read my publication on the VBASE2 generation process where we described a procedure that defines the criteria of sequences to be included in the VBASE2 database.

As the publication was published in NAR, the text is open and you can extract parts of the text if it helps.

One of the key features was that we used three Classes, please see Generation of Vbase2 database

My reservations: there will probably always be false positives :slight_smile:

But my reservations may not actually have any practical implications for most of the community. I fully support the development of such a set of criteria, and as @tbkepler and others have mentioned, the development of models, etc to assess how well these criteria perform. In fact, I think we have to test these criteria, if we want them to be taken seriously. However, there will always be at least one big caveat to any such set of criteria (in my opinion, anyway): we don’t really know where novel alleles reside in the genome. There are many examples, and I’m sure many more to be discovered, for which allelic coding sequence alone can mislead, to the point that we convince ourselves that the allele belongs to one gene or another, when in fact our only evidence for such an assignment is phylogenetic similarity to other known alleles in the database. For someone like me who happens to care about the genomics of this story, this has practical implications, but for many others, I would imagine to doesn’t matter much at all. So I think I can leave it there…In the end, I think we just simply have to be honest about what we are doing here. A system that involves “tiers” or “levels” of gene/allele designations is pretty honest…