Sponsored by the AIRR Community

Refining one criterion for inclusion of inferred alleles: full length

Although there are larger issues to address, I would like to drill down on one detail in the discussion of criteria fro inclusion of inferred sequences into a germline database. But first I would widen that topic to include discussion of previously reported sequences, and what is needed for those sequences to reach the highest category of acceptance within the AIRR community.

There are many truncated IGHV within the recognized human IGHV dataset. Rules for these sequences need to be debated, as well as rules for inferred sequences.

I think pretty obviously, to reach the highest level of acceptance, a sequence must be full length. But we need to decide what that means. We would like leader sequences and RSS, but I don’t think that should be required. Full length sequences must start at the first codon - and if we focus here on human IGHV, we are pretty certain we know all or nearly all of the genes, so we know the first codons. But where does the sequence end?

IGHV2-5*03 is only complete to codon 87 (and also lacks the first 9 codons). This should be assigned to the lowest bin in a new database.

IGHV1-69*03 is complete to codon 100. It still lacks codons from FR3, and so I think it should also move to the reject bin.

IGHV2-7004 was complete up to codon 103. We reported an inferred extension as IGHV2-70p14, and through Corey and Felix’ study, that sequence is now accepted as IGHV2-70D*04.

We also extended IGHV2-510 as IGHV2-5p11. The reported *10 allele encoded codon 105 and the first nt of 106. We chose to give the longer sequence a new p-type name, as we could not be certain that we were extending *10 or whether we had found a new allele. With the reporting of the African sequences by Catherine Scheepers et al, we see many other putative polymorphisms that differ at the very ends of the sequences. This is unsurprising given the role of CDR3 aa in antigen binding.

So, from all this, can we arrive at more detailed recommendations?

I would suggest that a sequence must certainly be complete to the end of FR3, to be considered full length. Should it be complete to codon 106? I think that would probably be appropriate. If a sequence could later be extended beyond 106, perhaps it would not require a new name? Any thoughts?

And a final issue here…when we consider how we should approach the validation of inferred alleles, we will also need to consider how we arrive at an agreed end to the sequences, given variable levels of exonuclease activity.

before we decide on allele naming schemes don’t we have to know more about how the germline genes are organized in their un-recombined form? How many are partial, how many are full length, and to what extent ‘functional’ (i.e. can participate in the creation of an expressed and binding receptor) and ‘full’ sequences are equivalent?

Let me put my point more clearly. Some accepted alleles are clearly partial, but it is not always so clear. When is a sequence partial, and when do you accept it as full length? Many sequences that once appeared full length have had one, two or three additional nucleotides added to them over the last few years. As the repertoire has not been ‘versioned’, this had not even been noticed by most people.

Second point: if we accept inferred alleles, we will have to develop clear guidelines about deciding the final nucleotide. The number of sequences that are needed to infer (most of) a sequence with confidence could be fewer than the number needed to infer the 3’ end of the sequence.

An aspect to consider is that incomplete germline V-genes could potentially contribute to immune responses via gene conversion. One solution might be to annotate partial V-genes in the databases along with a flag or indicator that they are incomplete, similar to what IMGT does for psuedogenes now.
.

I am thinking more of sequences that are complete, but are not fully reported. There are dozens of sequences of this kind in the IMGT repertoire. Most are obvious, because they may be missing 30 or more nucleotides. Others are harder to spot, because there may only be a couple of nucleotides missing. This is hardly the most pressing issue, but if we are trying to reflect upon how we put together the best possible germline gene datasets, we have to think about this.

To briefly illustrate, according to IMGT, IGHV4-59*01 ends GCGAGAGA. So does *02.

*03, *04, *05 and *06 end with GCG.

*07 ends with GCGAGA

All this might be correct, but I doubt it, and in fact over the years, extensions have been made to many sequences.

I still don’t see how we can answer this question without knowing to what extent in actual germline genes partially copies exist…

I agree that it is possible that one allele ends sooner than an otherwise identical sequence. I would say the two sequences are two separate alleles that need separate names, and evidence is needed for the existence of both of them. That evidence might include an early report of the shorter sequence, leading it to be present in the IMGT repertoire, but I don’t think that would be sufficient evidence to accept the certainty of the shorter sequence.

It is certainly true that there are many short sequences in the present IMGT human IGHV repertoire. Some of them are probably real short sequences, but I am sure that many of them are not.

We now have access to so many large data sets that if we set our collective minds to it, we could easily find evidence to clarify these issues. Some short sequences might wrongly slip into a category of less certain alleles, but they would not disappear, and over time the mis-classification would be addressed.

I decided to see if I could find examples of IMGT extending allele sequences, and I didn’t have to look far. Using the Immunoglobulin Factbook (published 2001), we can see the IMGT reference sequences from that time, and compare them with today’s reference sequences. (There were few if any changes between 2001 and 2010, though I can’t document this.)

I started with the first V1 gene - IGHV1-2) - and the *04 allele has been extended by two nucleotides. The sequence was originally defined from the sequence with accession no. Z12310. Corey Watson’s work allowed this sequence to grow by two nucleotides.

So is this changes documented? I don’t think so.

Changes to the IMGT germline database (and other) updates can be seen at http://www.imgt.org/IMGTinformation/creations/

The site says that ‘major changes’ are documented. Some major changes are documented. Others are not. Certainly the inclusion of alleles arising from Corey Watson’s work is described, but the inclusion of 17 new human IGHV sequences, resulting from our work with Papua New Guinean samples, is not mentioned. These sequences were assigned names by IMGT in 2011.

In searching for that documentation, I also found a page (http://www.imgt.org/IMGTindex/IMGT-NC.php) which reports the current rules for consideration of new alleles. NGS data will not generally be considered, but…

“Sequences from NGS are accepted only for known alleles if they complete
the germline genomic sequence in 5’ or in 3’ (a few alleles
may have incomplete sequences in 5’ or 3’ if they were retrieved from
the literature before IMGT/GENE-DB was established).”

I guess this is the justification for extensions, though I am only aware of this being done using Corey’s sequence. Perhaps we should submit NGS data, and make a host of extensions, but to return to Uri’s point, is it possible that the shorter sequences are alleles in their own right?

It is important to have strict criteria for a “gremlin” V gene and truncated sequences on their own would not classify. I suggest that only complete V gene sequences are included and the truncated sequences are only used to build up evidence for a “true” germline sequence. Over time, with more and more sequence information available, the missing gremlin genes will eventually turn up as full length sequences