Should antibody gene names indicate the certainty that the genes exist?

There is general agreement that there should be acceptance of the existence of germline alleles that have been inferred from VDJ rearrangements, and that these sequences should therefore be named. It is also generally agreed that an inferred sequence is not the same as a sequence that has been identified by genomic sequencing. How should the certainty that a gene exists be indicated? There are two obvious ways. The sequence can be annotated, or the certainty can be indicated in the gene name.

A system of annotation can be illustrated by a paper we published some years ago. It was an evaluation of the IMGT human IGHV germline gene repertoire. At that time there were 226 IGHV that were designated as functional by IMGT. We concluded that 104 of the sequences included sequencing errors, ambiguities, were truncated, or had other problems that should lead to their removal from the repertoire. We classified the 226 sequences using a 5 level system. Level 1 sequences were unquestionable. Level 5 sequences were nothing but trouble. So each sequence was annotated to indicated our confidence in the sequences, from L1 to L5 (see: Wang Y, Jackson KJ, Sewell WA, Collins AM. Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error. Immunology & Cell Biology 2008; 86: 111-5)

The 5 level system was needed in part because our analysis focused on a relatively small dataset. At the time it seemed huge - over 4000 VDJs, collected from public sequence databases. But the number was of a size that it made it hard to be absolutely certain of the existence of sequences that were highly similar. Could a one nucleotide difference between two sequences be a consequence of sequencing error, for example? With today’s large datasets, it should be possible to have a 3 level system. Totally certain = Level 1; Inferred but not confirmed by genomic sequencing = level 2; very problematic sequences = level 3. There could obviously be other reasons for a sequence being in level 2 as well. Under such a system, a sensible VDJ repertoire analysis would utilize level 1 and 2 sequences, and set aside level 3 sequences.

The same outcome could be achieved through the use of a different nomenclature, and this is in part what you see if you visit the IgPdb database of inferred polymorphisms. We have called inferred polymorphisms ‘putative polymorphisms’, and have given them unofficial IMGT-like names such as IGHV 2-5*p11.

This is a little like the CD nomenclature (eg CD4, CD8), though that follows rigorous rules that were developed over thirty years ago. Any fully accepted cell surface molecule with a CD name has been identified through the use of two separate monoclonal antibodies. If a molecule has only been identified using a single mAb, it is given a ‘workshop’ designation eg CDw129.

If nomenclature was to carry information regarding certainty, I think 3 levels would be needed. Certain, unconfirmed and the very uncertain. ‘p’ remains a pretty good indicator of unconfirmed sequences. ‘r’ could be used to highlight rejected rubbish such as IGHV3-30r05. If this approach was adopted, the most certain sequences would have to also have an indicator, so that readers would know they were dealing with a modified nomenclature - IGHV3-30c01.

I am not arguing for p, r and c, or even arguing for the nomenclature solution rather than the annotation solution. I just think we need to begin this discussion.

So what do you think?
Perhaps you are in favor of the status quo?

Thanks for the discussion and the background to this question, @a.collins.

I feel that names, once assigned, should stay the same. This means that future results will be comparable to past ones without translating the names. The level of evidence should be described in a metadata file.

In addition, this will allow us to use a richer and/or variable scheme for annotating sequences with levels of evidence.

1 Like

I mostly agree with @ematsen, but I can imagine a system that has it both ways. We might do something like IGHV3-30*01-c/p/r, where the quality designation can change but not the allele number. So *08 might be ‘r’ while *09 is c and *10 is still p. In this way, the quality designator functions mostly like an annotations layer while still being included in the name…

Having evidence levels is a great idea, though I’m not sure that we want to code it within the name. Wouldn’t we expect over time that sequences in level 2 would move up to level 1 as they become verified? This could cause problems as the name changes as @ematsen mentions. It could cause computer programs to fail unexpectedly, i.e. programs are often written to check if names are the same. In order to work properly, all VDJ programs would need to know how to interpret/parse the name to remove that quality designation, before comparing if two genes are the same.

At the same time, I could understand in an article how the authors might want to communicate that certainly level. This issues makes me think of the mouse nomenclature topic discussion. Maybe we need to think about two forms of nomenclature, (1) nomenclature used in databases, files and etc for computer processing where we have stricter guidelines and (2) nomenclature used in publications where the name is also used to convey various pieces of annotation?

In VBASE2 all sequences received a simple number and short letter code in the front to make it unique. The details did not matter (although you can find a lot of non randomness in the numbers, like numbers indicate sometimes where on the chromosome the V gene is located if at the time the location was known).
Importantly, this unique number then points to all other names and sequences linked to this unique name.

1 Like

In VBASE2 we labelled the entries by classes, one two and three. The class definition then says if a gene exists (class 1 and class 2). The reason why this is separate from the name is that with time more evidence can be collected and genes may be found at a later time point. I remember one case (v186.1 and v186.2) in which evidence for one gene that existed vanishes (v186.1) and only v186.2 remains…

There are many sequences that are present in the IMGT repertoire, but are absent from VBASE2. I am thinking here of dubious sequences like the many IGHV3-30 alleles. I am puzzled about how manual curation by IMGT led to the inclusion of such dubious sequences, whereas automatic searching by VBASE2 did not lead to their inclusion. Can you explain how you filtered out such sequences. Did you require more than one genomic sequence for inclusion as a class 2 sequence?

IMGT wanted to capture all sequences that are in the gene bank/embl database. If smaller sequences did not have a gremlin sequence yet, those sequences were named ad new sequences. Also small number of mutations already led to a new V gene name. In VBASE2, a minimum length of a gremline sequence without evidence from a rearranged sequence has to be covered, in order to be worth an entry. Rearranged sequences required more then one entry and a minimum coverage as well. They also must come from independent experiments (or papers).

Hi Werner,

The VBASE system as you describe it could easily be managed as you looked back at reported sequences. So, if you saw a sequence, you would look for a confirming sequence generated from an independent study. How was this system supposed to work subsequent to this establishment phase of the database? If a new sequence was reported, did you have a ‘holding’ bin in which the sequence would be placed until confirmation was found? Did this bin hold all the previously reported singletons?

This returns me to a point I have made elsewhere in the Germlines discussions, on criteria for inclusion in a new database. It seems to me that there needs to be a set of sequences that remain unnamed, but easily accessible. These would be sequences for which there is insufficient evidence to include them as germline sequences. In the early VBASE days, such a dataset might not have been necessary. The full set of reported sequences was so small at that time that I imagine each new reported sequence could easily be aligned against all previously reported sequences.

I am thinking of starting a topic called history of VBASE2 with the rules we set up to construct the database. Thinking about the rules at the time, for me it was very important to have a lean database that only contains the sequences with enough evidence.
One of the rules is that the sequences used for the construction of VBASE2 must be present in a sequence database with an accession number (not necessarily with a proper annotation).

During the construction you of course end up with a lot of single sequences that do not fit to others (yet) or are too short and thereby to ambiguous to be linked to one entry.
In the new world, I would not change these rules. Of course I extract all the single sequences and keep them somewhere and they are looked at when a new version is computed, but they are not used within the database.

One of the question I would have in the new world, how many sequences are still submitted to the public databases at the moment and would there be a need for a new database subset where new NGS sequence data can be submitted to be used for the gremline database construction. If a new database is needed, may be it would be worth to discuss the minimal standard of information to be added to each sequence entry

A new database that utilizes NGS antibody data - for example in demonstrating allelic functionality or frequency of usage - would have a distinct evidential basis that is different from IMGT - in particular if such evidence was a necessary condition for inclusion.
I am very much in favor of making functional/expression evidence a necessary criteria, however in so doing we will encounter a number of issues that affect those still focusing on the traditional genome based method of mapping Ig segments.

For example, if someone generates a genomic IGH segment from a series of individuals and extracts V gene segments from these sequences how do we weight such candidate V alleles compared to, for example, V alleles identified from expressed antibody libraries?

At present IMGT weights genomic sequence far more than NGS derived sequences but I’m not sure I see the justification for this. I find genomic data hugely beneficial, however I think that relying on genomic data is both time-consuming and does not provide functional/expression evidence.

I’m really thinking in terms of how one would go about creating a usable database in the space of a few months - which was the timescale mentioned previously - and this is not even taking into account the question of defining the criteria for inclusion of novel alleles.
I mean getting a ‘stage 1’ database that reaches the functionality of IMGT.
Adding novel alleles to that database is really stage 2, but we have to reach stage 1 before we move on to that level.

For example, I think it is generally agreed that there are many sequences present in IMGT that are never encountered during NGS analysis. A new database that strips these sequences from the reference set and provides evidence for those remaining is, I think, an achievable goal within the timescale we have previously discussed (this would be the ‘stage 1’ database).

BTW, I agree with Werner that individual accession numbers are important.

If unconfirmed germline genes (or even the likely bad ones) are included in a database, then such information becomes very important in interpreting the search results. Being able to view that information right in the name itself would be very convenient for users. However, in reality, since the current IMGT naming is a standard already (at least for human genes), it probably is not a good idea to have any style different from IMGT naming system which I understand does not support the level of germline gene certainty. It looks to me that information needs to be described elsewhere and then it would be the responsibility of the search software to report that information, which does not seem to be a big deal.

“I am very much in favor of making functional/expression evidence a necessary criteria, however in so doing we will encounter a number of issues that affect those still focusing on the traditional genome based method of mapping Ig segments.”

Re this statement by @martin_corcoran, I would argue that this could place limitations on a database. While I agree with the idea that “functional/expression evidence” is extremely important (maybe even most important), to me it seems more useful (and less restrictive) as a metadata element in schemes like those suggested by others in this thread (e.g., by @ematsen). We know there are at least a few examples for which genes/alleles are expressed in one individual, and not another, but present at the genomic level in both. If we imagined a scenario in which an allele/gene was observed at the germline level, because it was amplified from genomic DNA, but not by rep-seq, and we decided to “exclude” this allele because it did not have “expression-level” information, future analyses could easily be impacted by its exclusion.

I believe we want to create a database that is ultimately as inclusive as possible, but that utilizes robust criteria for assessing the inclusion of novel genes/alleles from different sources of data. While I think most of us feel that this should not require IMGT-level germline/gDNA confirmation, requiring “expression evidence”, in my opinion, swings us to the other side of the spectrum. A complete database, and one that is most useful for the broader community (again, in my opinion), is one that considers information/data from various sources that can contribute to richer metadata, and ultimately a better understanding of the genes we are all interested in studying. I think we need to be careful in our effort that we do not stray too far from standards and norms that are used by other communities in the genetics field. Terms like “gene” and “allele”, do in the end, have definitions that are understood universally in the context of a genome.

With respect to weighting “novel” gene/alleles, I’m not sure we have to necessarily at face value give more weight to either genomic or expression evidence. Weighting in my opinion should have more to do with the amount of support for a given gene/allele.

Name changes could also cause the human brain to fail. If anyone has spent much time reading literature from the 1970’s-90’s, it is often extremely painful to figure out what genes articles from this era are referring to. While I like the idea of gene/allele names containing information on their “level”, we should probably be careful about how we do this.

I think at the moment, I lean more toward the opinion of @ematsen – let the metadata do the talking.