Mouse IGHV gene nomenclature

a.collins · August 4, 2016, 2:08am

If you broadly accept the findings of our 2015 BALB/c mouse study, you will agree that there is major problem with the existing nomenclature for inbred mouse genes.

The details of the paper are: Collins AM, Wang Y, Roskin KM, Marquis CP, Jackson KJ. The mouse antibody heavy chain repertoire is germline-focused and highly variable between inbred strains. Philosophical transactions of the Royal Society of London Series B, Biological sciences 2015; 370:20140236.

The mouse nomenclature comes from the C57 genome reference sequence. It is a quasi-positional nomenclature. Unlike the human nomenclature, names do not point to the position of the genes within the genome, but rather it refers to positions within the genome of each family of IGHV genes. So IGHV2-1 is the first gene of the V2 family, within the genome, and IGHV3-2 is the second V3 family gene within the genome. Sequences from other strains have been named, I think, on the basis of similarity to C57 genes. Within the IMGT database, many BALB/c sequences are shown as mapped, but it is unclear what this means, and what data supports the mapped position. The BALB/c genome has just been released, so we may soon find out more. (It may be that mapping relates to the Retter 2007 study of the 3’ half of the 129S1 IGH locus, but that should be confirmed.

If there are something like 100 C57 IGHV genes and 160 BALB/c genes, it is almost impossible to imagine that correspondence can be found between sequences, so that BALB sequences can be given names according to the C57-based nomenclature. If that is the case, what is to be done?

Should a second quasi-positional nomenclature be defined for BALB/c genes? This would require tags being added to the existing names, giving something like c57IGHV1-101 and balbIGHV1-101. It might be easier to head down that road if we knew what other treats are in store for us as other inbred and wild strains of mice are investigated.

Should a non-positional nomenclature, like that of VBASE2, be adopted? If this was used to name BALB/c genes, would C57 names be left in place?

Should BALB/c sequences that presently have C57-based names be renamed?

schristley · August 4, 2016, 6:20pm

From an IT perspective, I tend to dissuade from putting too much semantic information into identifying names, unforeseen exceptions always seem to pop up over time that complicate what started as a simple design. Seeing that we have to maintain a certain amount of annotation data with each sequence, it is reasonable to have a “breed” or “strain” field. Computational tools can easily use that field. I note that NCBI has not done a very good job in handling this issue, researchers seem to have taken upon themselves to tag the breed in the title, e.g. “Mus musculus breed: BALB/c”.

Based upon your analysis Andrew, would you consider the two breeds to have minimal overlap, that we might essentially organize the genes as separate organisms/species in the germline database? For example, with a gene that is identical, should be it be stored twice in the germline database, once for BALB/c and once for C57, or should it be stored once with a reference to both breeds? I ask this from a biological perspective, not a computational one. If they are stored twice, that let’s each have its own history and annotation separate from the other.

a.collins · August 5, 2016, 12:00am

Hi Scott,

I tend to agree with you, but want to stimulate discussion on this! If you think about the
VBASE2 nomenclature, as useful as it looks at first glance - musIGHV554
etc, this locks in the mus prefix for Mus musculus. What happens with
other Mus species? No simple prefix could account for a multitude of
species. And I think that we will see many people start to produce simple descriptive studies of germline genes of all sorts of species in the very near future. These would be neat short research projects that would be perfect for the likes of honours students, and we have thousands of species to play with! So yes I agree that annotation will probably be necessary.

The lack of overlap between C57 and BALB/c is really extraordinary,
with only 5 shared IGHV that we can identify. It may have something to
do with the bizarre early breeding histories of these mice, and their
possible derivation from different sub-species. I think that a database
should provide strain-specific sets for inbred laboratory animals, and
so the five shared sequences should have separate entries in different sets. There is no sense in including additional germline genes in an analysis of an inbred strain. It can only lead to incorrect alignments of mutated sequences.

A mouse database can move immediately in this direction, and it should be no problem to develop sequence sets for all major strains, as it is so easy to infer the repertoire of an inbred
strain. If we can agree guidelines or rules on the way germline genes should be inferred, we will be in business!

Andrew

schristley · August 8, 2016, 4:13pm

Hi Andrew,

You make a good point and as I gave this more thought, I realized that it isn’t sufficient for a technical solution that will satisfy the bioinformaticians, we also need some simple rules to make it easier for biologists. While I was initially against semantic information in the name, I now think it is okay and likely even required. The scenario that popped into my head was the thought of writing a paper. If gene names were simply IGHV5, IGHV4, etc., then every paper would have to write long sentences such as “We compared the IGHV5 gene in BALB/c breed of Mus musculus against the IGHV4 gene in the C57 breed”, essentially every time a gene is mentioned in a paper, it would need a full descriptor “IGHV5 gene in BALB/c” or “IGHV4 gene in C57” (totally made up example but you get the idea). Researchers are gonna automatically adjust the gene names just so they can be more succinct in their writing, which will lead to non-standard nomenclature and a future mess. I’m on board with simple name rules, and if we can have them contain all of this information then they would be very useful as descriptors.

Clearly defines the species
Clearly defines the strain/breed of the species
Clearly defines the gene, gene type, gene family, allele, etc.

Anything else you can think of?

There are some standard databases that could be utilized, e.g.

There is the NCBI taxonomy number
There are the 3 letter species designations in KEGG (though this may not be good enough for all species).

Using the NCBI taxonomy number , a fully descriptive name might be:

10090.balb.IGHV5

Though this seems a bit unwieldy.

As for the bioinformaticians, this doesn’t need to be harder for them because we can satisfy both at the same time. We can allow semantic names but we also required that information to be stored in metadata, so programs don’t attempt to “parse” the name, they process the metadata fields instead.

Scott

bussec · August 8, 2016, 9:49pm

Hy Scott, hy Andrew

I have also tried to come up with reasonable approaches to the nomenclature issue for some time, so I will just share my current iteration of thoughts with you below. In general, this is about a universal nomenclature, but I think mouse is excellent test-bed. A good part of this is inspired by Ensembl, meaning that I consider a distinction between a database identifier (UID, see below) and the biological designation (HR-GSD, also below) to be extremely helpful and that it is reasonable to compartmentalize designations on the species level. The other parts come mainly from my observations on how non-computational biologist tend to work. I furthermore assume that people performing cross-species comparisons are advanced users who work directly with the UIDs.

Each germline segment must have a single, permanent and unique ID (UID), which by itself does not encode any additional information (similar to a Genbank gid).
All metainformation (species, strain, locus, physical position, orientation, “functionality” information, clan, family, sequential number, level of certainty) is primarily linked to this UID.
In addition, each germline segment must have a human-readable germline segment designation (HR-GSD). Optimally this designation would have three features:

High information content [bit]
Medium information density [bit/character] (i.e. manual decodeability for non-specialists. There are probably systematic studies on this, but I usually get screamed at when we go above ~2.5 bit/character. People also dislike positional encoding, especially if there is only one separator character)
Usability [character] (10 characters would be optimal, 12 is a hard limit)

As it is already obious from the units, you can only optimize two of them at the cost of the third (you can also imagine them as three corners of a triangle):
High information content & Usability (not manually decodeable, e.g. a UID or a hash → “G8HJ6K34LK”)
High information content & Medium information density (too cumbersome to use, e.g. "MusmusB6Igh-V-345-b-func-A)
Medium information density & Usability (potential collisions, e.g. “normal” gene designation “Ighv-345b”)

Since the HR-GSD should primarily be used by non-specialists, I consider the third option (medium information density & usability) to be the logical choice.

Within one species HR-GSD must be collision-stable, i.e. the same HR-GSD must never be assigned to two different UIDs. However, a HR-GSD can be deprecated.
The HR-GSD should be consistent with the general nomenclature guidelines of the host species.
While the HR-GSD for different species should encode the same information types, but it can use diverging (species-specific) formats: e.g. IGHV-345*01 in human and Ighv-345^b in mice.
The species itself is should not be part of the HR-GSD:

It contains a lot of information (> 50k vertebrate species → ~16 bit)
If necessary, the binomial name can still be used to prefix the HR-GSD in text or in a table
many nomenclature guidelines are species-specific, they do not expect the species to be included in the name
as already mentioned, full binomial names would be necessary to make a proper distinction between the different species of the Mus genus, again adding a lot of to the length

The designation system should not have any known limitations, assuming sane boundaries rounded up (100k vertebrate species * 10 loci/species * 1000 segments/locus)
As discussed before, the HR-GSD must not encode any relative segment position on the locus as this will result in trouble sooner or later. I am still undecided between either a pure consecutive numbering system (like used in VBASE) and a “clan-family-sequential” system. I have a slight preference for the first one, since it is simple and differs substantially from the current scheme, so that there would be little risk for misinterpretation. On the other hand I know that many people like similarity-based groups and if someone would come up with a good cross-species grouping, I would be in for this, too.

a.collins · August 9, 2016, 11:35pm

Scott and Christian’s posts have given us much to think about, and they led me to return to Scott’s earlier question. If B6 and BALB/c mice share a sequence, should the sequence be separately recorded (and named) for each strain.

I wrote that I thought future databases should facilitate access to strain-specific data sets, and that the shared sequences should be separately recorded. That would involve five genes that I know of in the B6 and BALB/c strains. But such a strategy could not sensibly be extended to all inbred strains. We would not want 100 different named sets of genes!

If we are agree that the B6 and BALB/c IGH loci are fundamentally different, we might agree that they need distinct nomenclatures. But 129/Sv has similarities to BALB/c. It is conceivable that these three strains represent two distinct loci, and as I said earlier, these may relate to two subspecies of Mus musculus that were used to derive the strains. Perhaps a third distinct set will be found, and maybe later a fourth.

Could we approach nomenclature with this kind of expectation? In other words, we now accept we have two patterns, and so develop two sets of names. As more inbred strains are characterised, we check against these two. If they depart substantially (whatever we decide that means), a third set of names is drawn up. And so on. In such a scenario, a name like Ighv-345superscriptb would indicate IGH locus type b, rather than BALB/c.

Downloading from such a database would facilitate strain-specific analysis, and type-specific analysis. That is, if you are working with a strain of known immunogenotype, you could download the set of known rearrangeable genes for that strain. If you were working with a partially known strain, like 129/Sv that has had half its IGHV genes documented, you might download the BALB-type dataset. And if you were working with an uncharacterised strain, you would download the full set of mouse sequences, as a sensible starting point.

I imagine there would be no attempt to associate allelic variants between types (B6 vs BALB). There might not be any attempt within types either (BALB vs 129), though time will tell. Just as IMGT has inferred sequences to be allelic variants, if there was sufficient correspondence between the loci of, say, BALB and 129, we might be tempted. Or we could go down the VBASE way, with consecutive numbering as sequences are identified. If that was the path we followed, linkage of sequences as allelic variants of one another would only happen after careful mapping, and would still not be indicated in the sequence names.

mats.ohlin · August 10, 2016, 11:20am

Hi Christian and others,

This discussion is really of extraordinary importance. What is the common origin of the germ line-encoded repertoire? Probably the field will evolve so extensively (in line with Andrew’s comments) in the future that a standardizing committee will have to be prepared to modify the nomenclature from time to time. That is probably better than to stick to a fixed outline independently of new information (compare the problem with impossible germline alleles given their origin in i single individual’s germline found in IMGT database). What remains important though, I believe, is that human readability (and ease of human interpretation) remains a priority in the naming convention. The question is what’s most important for human readability. In my view (others may certainly disagree depending on their major research interest) it is more important to try to name allele variants so as to define their common physical position in the locus and their subgroup. This might be achievable (based on our current understanding) in many instances in the human locus but seems (as outlined elsewhere in this discussion) an impossible task if considering mice as a single entity. Again a nomenclature committee would need to revise current states from time to time and recommend subgrouping of every species’ Ig loci (with its own naming system) based on evolving knowledge. To maintain a linear order of the different genes in their names seems to be of lesser importance given the difficulties in maintaining such a system over time (in agreement with Christians opinion). However, a purely consecutive numbering system does not facilitate human readability and biological interpretation at all, and I would currently not prefer such a system as the information content of such names is too limited. Importantly, the use of “personalized” gene sets (like those created by TIgGER or similar approaches) should likely be recommended in many cases (except in the case of inbred strains) as comparison with global databases is create a large numbers of errors in the analysis.

a.collins · August 11, 2016, 9:15am

I agree with Mats on the importance of facilitating human readability and biological interpretation. So despite my criticisms, this is why the IMGT nomenclature has appealed to me, as well as to so many others. This gets me wondering whether the style of IMGT’s mouse nomenclature has a future. So for this iteration of my thoughts, I want to describe a system that marches towards such names.

Going back to Christian’s post about UIDs and HR-GSDs…. We don’t want a system that is too complicated, but for the moment, can I propose a system that would be more complicated than Christian’s.

If we consider the existing IMGT nomenclature, it really has two tiers, and this may be a feature that needs to be retained. This would lead to an additional tier in Christian’s scheme.

Although the application of the two tiers in the IMGT nomenclature is not consistent or obvious to the casual observer, it basically corresponds to mapped and unmapped sequences. In the mouse, these have names like IGHV1-201 for mapped sequences, and IGHV1S5301 for unmapped sequences.

If all mouse genes were placed in a database and given UIDs, the sequences would be associated with metadata describing the strain from which the sequence was obtained. If a sequence was reported from another strain, this would be added to the UID database.

From this database, species and strain-specific data sets could be extracted. At first they would be given provisional (unmapped) names, perhaps of the consecutive number kind. Once the IGH locus had been mapped, a new positional nomenclature could be given. (I would also favor the medium information density & usability HR-GSD kind.)

Given the difficulty identifying allelic variants of mouse genes, I would propose that only the B6 locus would presently be given the final HR-GSD names. I suspect 129/Sv genes correspond to B6 genes, but it is too soon to be sure.

Soon, the BALB/c locus will have been explored, and a second set of HR-GSD names will emerge. For the moment however, the BALB/c genes would have a simple non-positional interim nomenclature. Other strains would also be assigned the same kind of lower-tier names. Perhaps these names would be distinct for each strain, but I’m not sure what I think about that now, despite my previous answer to Scott. Could they share UIDs, but have distinct interim names?

In time, the IGH locus of another strain might be completely sequenced, revealing a third distinct kind of IGH locus.

Over time, it might be agreed that the locus of a strain could be overlayed on either the B6 or BALB/c (or the third, fourth…) locus. If a new strain could be linked, say, to the B6 locus, the genes of that strain would acquire the final set of names with B6 gene names or allelic variants of the B6 genes. We know from the differences between the Matsuda and the Watson human sequences that we would also have to be prepared for insertions and deletions.

The BALB/c locus now includes inferred sequences. Inferred sequences would have a place in the UID database, and such sequences could be assigned lower-tier names. But inferred sequences would ultimately be replaced with genomic sequences at the time that the final HR-GSD names were assigned.

We would also have to decide how to deal with anomalies like musIGHV211 and musIGHV269. These were seen in Johnston’s assembly of the B6 IGH locus, but not in the assembly that was the basis of the IMGT nomenclature.

ctwatson · August 24, 2016, 11:56am

Great conversation! I am trying to let all of this sink in…

@bussec, out of curiosity, because you mention getting your inspiration from Ensembl, would you envision UIDs for “transcript” and “gene”?

ctwatson · August 24, 2016, 12:09pm

So, I hate to be “that guy”…but I guess I am “that guy”.

ctwatson · August 24, 2016, 12:37pm

Maybe this is where genomics is important, or at least still provides a useful framework for thinking about our problem…To me, I believe we have to decide how we want to think about “strains”. This is an oversimplification, but as an example, we might think of them as being akin to “species”, or “individuals in a population”. I, of course, understand that they are neither, but this distinction seems important to me. If we look at all other genes or loci in the mouse genome, it is my understanding the mouse genetics community treats strains more like “individuals in a population” – when I say this, I’m thinking about SNP genotyping across the genome, in which all genotyped strains are in the context of the reference genome (i.e., C57BL/6). In other words, all newly discovered variants are treated just as that, variants or alleles, not distinctly named entities based on the strain they were observed in (I’m riffing here, so feel free to tell me I’m wrong, if I am). So, if we followed this scheme, we would not be naming genes/alleles based on which strain they came from, but instead, we would be naming them in the context of a reference strain (in this case, it happens to be B6). I realize, I’m basically putting Andrew’s points and others in different words…but I am doing this, because I think we need to consider what the future has in store for us as a community.

At present, we’re more or less all here, because as a community we have recognized that we are rich in expression data, and poor in genomics data. It is my feeling, that if the inverse were true, we would be having a much different conversation right now, or maybe not much of a discussion at all. We are here because we recognize the need to come up with solutions for how we incorporate data from expression studies, without having complimentary genomic data in hand. Andrew brought up human haplotype comparisons, which (again in my mind) would be akin to mouse strain haplotype comparisons. In humans, in the case of a copy number variant that includes an insertion of a “novel” gene on one haplotype, but not another, this gene receives a new name/allele designation, and has a place in the context of the reference assembly, allowing for alleles of this gene to be observed again in another individual in a future study. This I think should be how we ultimately approach the problem we are discussing for mouse IG. But I recognize, that with only repertoire data at our disposal, we might not have this luxury for the time being. However, this is unlikely to be the case for long. It is only a matter of time before high quality assemblies exist for all of these strains…so, without out rambling on further, I would stress that we have to proceed with caution here to not lock ourselves into a system that makes sense today, next month, next year, but not in 5 years from now.

a.collins · August 25, 2016, 12:52pm

I think that the problem of strains may be made more difficult by the complex and largely unknown breeding histories of inbred mouse strains. I am unable to explain the dramatic differences we have seen between B6 and BALB/c strains, other than by suggesting that their genes are derived from different sub-species of the house mouse. I know I may be wrong about this, but if I am right, there may be no hope of ever matching the B6 and BALB/c gees. This makes me think we should aproach inbred mouse strains differently. They may not coorespond to individuals within a population, but rather individuals within 2 or more populations. I suspect the different strains will ultimately cluster around 2, 3 or maybe more ancestral sub-species. This would presnt very different challenges to what we face with human variation.

ctwatson · August 25, 2016, 1:03pm

I see. So we should approach naming conventions with this in mind. If we assume there are three sub-species, do you believe we will be able to ultimately nail these down?

a.collins · August 28, 2016, 1:52am

That is what I am thinking, though the only evidence for this is the extreme divergence of C57 and BALB/c mice in the IGHV locus. Not only are the functional gene numbers hugely difference, but if you look at the differences between the two strains, just focusing on the genes that IMGT recognises, they are much more divergent than random human haplotypes.

Of course, you might see the similarities between 129/Sv and BALB/c as equally suggestive of the sub-species hypothesis. Certainly it shows that strains can be very similar as well as very different.

bussec · September 5, 2016, 9:00pm

I initially thought about this, but since a “transcript” would typically refer to a rearranged locus, it would require additional annotation (mainly the CDR3), which I consider to be beyond the scope of segment nomenclature. This does of course not mean that we should not have a standardized way on how to annotate such transcripts, but I guess that this is already worked upon by the people of the repository group (like here or here). Or where you thinking about “sterile” (i.e. non-rearranged) transcripts?