I agree with Mats on the importance of facilitating human readability and biological interpretation. So despite my criticisms, this is why the IMGT nomenclature has appealed to me, as well as to so many others. This gets me wondering whether the style of IMGT’s mouse nomenclature has a future. So for this iteration of my thoughts, I want to describe a system that marches towards such names.
Going back to Christian’s post about UIDs and HR-GSDs…. We don’t want a system that is too complicated, but for the moment, can I propose a system that would be more complicated than Christian’s.
If we consider the existing IMGT nomenclature, it really has two tiers, and this may be a feature that needs to be retained. This would lead to an additional tier in Christian’s scheme.
Although the application of the two tiers in the IMGT nomenclature is not consistent or obvious to the casual observer, it basically corresponds to mapped and unmapped sequences. In the mouse, these have names like IGHV1-201 for mapped sequences, and IGHV1S5301 for unmapped sequences.
If all mouse genes were placed in a database and given UIDs, the sequences would be associated with metadata describing the strain from which the sequence was obtained. If a sequence was reported from another strain, this would be added to the UID database.
From this database, species and strain-specific data sets could be extracted. At first they would be given provisional (unmapped) names, perhaps of the consecutive number kind. Once the IGH locus had been mapped, a new positional nomenclature could be given. (I would also favor the medium information density & usability HR-GSD kind.)
Given the difficulty identifying allelic variants of mouse genes, I would propose that only the B6 locus would presently be given the final HR-GSD names. I suspect 129/Sv genes correspond to B6 genes, but it is too soon to be sure.
Soon, the BALB/c locus will have been explored, and a second set of HR-GSD names will emerge. For the moment however, the BALB/c genes would have a simple non-positional interim nomenclature. Other strains would also be assigned the same kind of lower-tier names. Perhaps these names would be distinct for each strain, but I’m not sure what I think about that now, despite my previous answer to Scott. Could they share UIDs, but have distinct interim names?
In time, the IGH locus of another strain might be completely sequenced, revealing a third distinct kind of IGH locus.
Over time, it might be agreed that the locus of a strain could be overlayed on either the B6 or BALB/c (or the third, fourth…) locus. If a new strain could be linked, say, to the B6 locus, the genes of that strain would acquire the final set of names with B6 gene names or allelic variants of the B6 genes. We know from the differences between the Matsuda and the Watson human sequences that we would also have to be prepared for insertions and deletions.
The BALB/c locus now includes inferred sequences. Inferred sequences would have a place in the UID database, and such sequences could be assigned lower-tier names. But inferred sequences would ultimately be replaced with genomic sequences at the time that the final HR-GSD names were assigned.
We would also have to decide how to deal with anomalies like musIGHV211 and musIGHV269. These were seen in Johnston’s assembly of the B6 IGH locus, but not in the assembly that was the basis of the IMGT nomenclature.