Hi Andrew,
You make a good point and as I gave this more thought, I realized that it isn’t sufficient for a technical solution that will satisfy the bioinformaticians, we also need some simple rules to make it easier for biologists. While I was initially against semantic information in the name, I now think it is okay and likely even required. The scenario that popped into my head was the thought of writing a paper. If gene names were simply IGHV5, IGHV4, etc., then every paper would have to write long sentences such as “We compared the IGHV5 gene in BALB/c breed of Mus musculus against the IGHV4 gene in the C57 breed”, essentially every time a gene is mentioned in a paper, it would need a full descriptor “IGHV5 gene in BALB/c” or “IGHV4 gene in C57” (totally made up example but you get the idea). Researchers are gonna automatically adjust the gene names just so they can be more succinct in their writing, which will lead to non-standard nomenclature and a future mess. I’m on board with simple name rules, and if we can have them contain all of this information then they would be very useful as descriptors.
- Clearly defines the species
- Clearly defines the strain/breed of the species
- Clearly defines the gene, gene type, gene family, allele, etc.
Anything else you can think of?
There are some standard databases that could be utilized, e.g.
- There is the NCBI taxonomy number
- There are the 3 letter species designations in KEGG (though this may not be good enough for all species).
Using the NCBI taxonomy number , a fully descriptive name might be:
10090.balb.IGHV5
Though this seems a bit unwieldy.
As for the bioinformaticians, this doesn’t need to be harder for them because we can satisfy both at the same time. We can allow semantic names but we also required that information to be stored in metadata, so programs don’t attempt to “parse” the name, they process the metadata fields instead.
Scott