There is general agreement that there should be acceptance of the existence of germline alleles that have been inferred from VDJ rearrangements, and that these sequences should therefore be named. It is also generally agreed that an inferred sequence is not the same as a sequence that has been identified by genomic sequencing. How should the certainty that a gene exists be indicated? There are two obvious ways. The sequence can be annotated, or the certainty can be indicated in the gene name.
A system of annotation can be illustrated by a paper we published some years ago. It was an evaluation of the IMGT human IGHV germline gene repertoire. At that time there were 226 IGHV that were designated as functional by IMGT. We concluded that 104 of the sequences included sequencing errors, ambiguities, were truncated, or had other problems that should lead to their removal from the repertoire. We classified the 226 sequences using a 5 level system. Level 1 sequences were unquestionable. Level 5 sequences were nothing but trouble. So each sequence was annotated to indicated our confidence in the sequences, from L1 to L5 (see: Wang Y, Jackson KJ, Sewell WA, Collins AM. Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error. Immunology & Cell Biology 2008; 86: 111-5)
The 5 level system was needed in part because our analysis focused on a relatively small dataset. At the time it seemed huge - over 4000 VDJs, collected from public sequence databases. But the number was of a size that it made it hard to be absolutely certain of the existence of sequences that were highly similar. Could a one nucleotide difference between two sequences be a consequence of sequencing error, for example? With today's large datasets, it should be possible to have a 3 level system. Totally certain = Level 1; Inferred but not confirmed by genomic sequencing = level 2; very problematic sequences = level 3. There could obviously be other reasons for a sequence being in level 2 as well. Under such a system, a sensible VDJ repertoire analysis would utilize level 1 and 2 sequences, and set aside level 3 sequences.
The same outcome could be achieved through the use of a different nomenclature, and this is in part what you see if you visit the IgPdb database of inferred polymorphisms. We have called inferred polymorphisms 'putative polymorphisms', and have given them unofficial IMGT-like names such as IGHV 2-5*p11.
This is a little like the CD nomenclature (eg CD4, CD8), though that follows rigorous rules that were developed over thirty years ago. Any fully accepted cell surface molecule with a CD name has been identified through the use of two separate monoclonal antibodies. If a molecule has only been identified using a single mAb, it is given a 'workshop' designation eg CDw129.
If nomenclature was to carry information regarding certainty, I think 3 levels would be needed. Certain, unconfirmed and the very uncertain. 'p' remains a pretty good indicator of unconfirmed sequences. 'r' could be used to highlight rejected rubbish such as IGHV3-30*r05. If this approach was adopted, the most certain sequences would have to also have an indicator, so that readers would know they were dealing with a modified nomenclature - IGHV3-30*c01.
I am not arguing for p, r and c, or even arguing for the nomenclature solution rather than the annotation solution. I just think we need to begin this discussion.
So what do you think?
Perhaps you are in favor of the status quo?