Although there are larger issues to address, I would like to drill down on one detail in the discussion of criteria fro inclusion of inferred sequences into a germline database. But first I would widen that topic to include discussion of previously reported sequences, and what is needed for those sequences to reach the highest category of acceptance within the AIRR community.
There are many truncated IGHV within the recognized human IGHV dataset. Rules for these sequences need to be debated, as well as rules for inferred sequences.
I think pretty obviously, to reach the highest level of acceptance, a sequence must be full length. But we need to decide what that means. We would like leader sequences and RSS, but I don’t think that should be required. Full length sequences must start at the first codon - and if we focus here on human IGHV, we are pretty certain we know all or nearly all of the genes, so we know the first codons. But where does the sequence end?
IGHV2-5*03 is only complete to codon 87 (and also lacks the first 9 codons). This should be assigned to the lowest bin in a new database.
IGHV1-69*03 is complete to codon 100. It still lacks codons from FR3, and so I think it should also move to the reject bin.
IGHV2-7004 was complete up to codon 103. We reported an inferred extension as IGHV2-70p14, and through Corey and Felix’ study, that sequence is now accepted as IGHV2-70D*04.
We also extended IGHV2-510 as IGHV2-5p11. The reported *10 allele encoded codon 105 and the first nt of 106. We chose to give the longer sequence a new p-type name, as we could not be certain that we were extending *10 or whether we had found a new allele. With the reporting of the African sequences by Catherine Scheepers et al, we see many other putative polymorphisms that differ at the very ends of the sequences. This is unsurprising given the role of CDR3 aa in antigen binding.
So, from all this, can we arrive at more detailed recommendations?
I would suggest that a sequence must certainly be complete to the end of FR3, to be considered full length. Should it be complete to codon 106? I think that would probably be appropriate. If a sequence could later be extended beyond 106, perhaps it would not require a new name? Any thoughts?
And a final issue here…when we consider how we should approach the validation of inferred alleles, we will also need to consider how we arrive at an agreed end to the sequences, given variable levels of exonuclease activity.