Really great discussion! I will speak mostly to points regarding nomenclature. As a genomicist, and one who has historically approached IG from a genomics perspective, my viewpoint is that we should ultimately be striving to move toward genomic assignments for “germline” segments. In my opinion, this is truly the only time-tested way to define a gene/locus as an entity for which a known position exists in a genome – inference can never achieve this goal, and as I think @bussec alluded to, for example, will always leave the assignments of “100% identical gene segment duplications” up in the air. But I whole-heartedly agree with @a.collins that genomics studies in IG across taxa are likely to be few and far between for the foreseeable future, and will continue to lag behind repertoire-based analyses. So, as I think everyone has argued, we have to have a system that allows for inferential methods like TIgGER to make a contribution (and maybe even dominate in the short term) to publicly used IG gene/allele sets. But in saying this, I would certainly argue that any naming scheme has to have flexibility and any new gene/allele discovered in a repertoire and catalogued as “novel” should have the ability to later be assigned back to a position in the genome (I guess as Layer 2 annotations proposed by @bussec.
We have seen first hand already in human that position-based numbering is really not a long term solution, particularly as “novel” genes are discovered. There are now several examples where genuine gene duplicates are in fact treated as just that, noted by a “D” by IMGT instead of a position-based number (e.g., IGHV1-69D, IGHV3-23D, IGHV3-43D, and IGHV3-64D). But from a phylogenetic standpoint, these are hardly different from, as an example, IGHV3-30, IGHV3-30-3 and IGHV3-30-5, other than the fact that the latter set of genes received their names in the infancy (or maybe teenage years) of IG nomenclature. Then you have IGHV1-69-2, which was previously referred to as IGHV1-f. Unlike IGHV3-30, IGHV3-30-3, and IGHV3-30-5, which we know are very close paralogues, IGHV1-69 and IGHV1-69-2 are not the result of a recent duplication event, even though the name might imply this. So, I guess I’m arguing that the current scheme is perhaps already neither extendable nor stable. Now, if we start to also think about alleles, the waters get muddier in a hurry. How do I tell if an allele resides at IGHV1-69 or IGHV1-69D? I can’t, unless I see this allele in genomic DNA sitting at a given position in the genome. So one concern I would have moving forward is when we are thinking about inference of novel segments/alleles from a repertoire dataset, and how we might begin naming these, when do we say we have a new “gene” vs. “allele”. Do either positional nor sequential numbering schemes have a solution for this problem? I would be curious to know what others think?
Also a few other short random thoughts. First, re @bussec:
“Gene segment names must not contain information on assumed functionality (e.g. the “pg” (pseudogene) tag used in some mouse Igh-V segments). The initial assumptions might be wrong and different alleles of a segment can differ in functionality.”
So true, and also, a given “allele” may have different functionality depending on where it resides in the genome, again thinking in the context of gene duplicates.
Also, re “vgenerepertoire.org”, I again agree with @a.collins, the use of WGS/shotgun datasets for V gene inference is very shaky and should be treated with EXTREME caution. It is important to bear in mind that in many instances, whole genome sequence data for a given species could be using DNA from multiple samples. Regions like IG loci are very challenging to assemble and make sense of even using shotgun reads derived from a single haploid sample, let alone several pooled diploid genomes – annotating genes vs. alleles in such a scenario could not possibly be anything but a nightmare. I would steer clear from such an approach, and don’t believe it offers much beyond repertoire analysis at this stage in the game.