Sponsored by the AIRR Community

How best to collect and share germline gene repertoires?

I see your point but I strongly believe this has to be done in collaboration with the the established nomenclature system otherwise it will likely be messy and also require an extensive organization to keep track of entries. I am not convinced that two separate systems should exist in parallel. There might be a provisional system for alleles that have not yet been entered into the standard (human) nomenclature but eventually all confirmed alleles should end up there and those names be used. Compare the CD-nomenclature for cellular antigens. As antigens get CD recognition, old names are often based out.

This is a timely subject as the VDJServer team has been thinking how to appropriately version and provide provenance for the germline database that we use. Currently, our database is stored and maintained in-house, but it really should be made public in some way. We start with the IMGT sequences, and then run a series of post-processing steps to add additional annotation to those sequences. I definitely like the idea of having a shared germline database with standard set of annotations.

4 posts were split to a new topic: Draft criteria for inclusion of inferred alleles into a germline database

For humans, the IMGT nomenclature is so widely known and used that I think it should be retained as the foundation for the future. Whether or not that should include the IMGT fix for dealing with Corey’s sequences is another matter. I suspect that CNVs and structural variation may be so great that the D designation for duplicated genes will not help. The IMGT positional nomenclature, if it worked perfectly, would allow heterozygosity at a gene locus to be instantly recognizable, but it already fails on that score. It should be seen as quasi-positional, with the understanding that the determination of heterozygosity may sometimes be problematic, or even impossible.

The mouse IMGT nomenclature is surely dead in the water. I would retain the nomenclature for B6 mice, but until we have much more knowledge of the genes in other strains, it will be impossible to know whether or not that nomenclature can be developed for other strains. I would certainly say that it will be quite unsuited to the BALB/c and NOD repertoires, and if a new repertoire is to be developed for these and other strains, it should be independent of reported gene positions.

We need input from people working with other species if we are to resolve issues with those species, but still, here are a couple of my thoughts…

As a human/mouse researcher, I have not looked closely at other species, but the situation is confusing. The rat and platypus seem to have positional nomenclatures, but most other species just have a relatively short list of unmapped sequences. The ‘mapping’ of the rat and platypus sequences comes from analysis of genome reference sequences, but since the assemblies of the genome sequences for these species did not address the special challenges of BCR/TCR genes, they cannot be accepted as accurate. Most if not all rat sequences were separately identified, before being mapped using the rat genome reference sequence, so the mapping might be queried, but the sequences could be real. In the case of the platypus, not only could the mapping be queried, but since there is no corroboration for the sequences themselves, they too could be queried.

IGHV genes for other species have been reported, but are not yet found in the IMGT database. I would prefer to see such sequences named using a new non-position based nomenclature. But where do the responsibilities of the AIRR community lie? Should we and can we be concerned about germline genes in the elephant? Are there other organizations who we might partner with to advance AIRR studies in these species?

Perhaps its time to initiate discussion on related topics.

Is it possible to develop guidelines for the acceptance of germline genes identified in genomic reference sequences? How can we disseminate the idea that assemblies that don’t address the special challenges of BCR/TCR genes will not be sufficiently accurate? Sequences that come from such dubious assemblies may still have some value, as long as a caveat is attached to each sequence. So is there any place for such sequences, or should they be kept out of any future germline database?

Are guidelines necessary for the identification of germline genes by HTS of unrearranged IGHV? Our solution to this problem, when analysing 454 data of unrearranged sequences from Papua New Guineans, was to require 3 forward and 3 reverse complement reads. That was a means of addressing 454 errors. Will different guidelines be required as each new platform is rolled out?

wrt other species:

They didn’t annotate any BCR genes, but it might be worth reaching out to see what/who they know who might be interested in helping us…

Note that the discussion of criteria for inclusion of inferred alleles has been moved to its own topic:

Let me offer a 5. Something else. What about having a set of programs which transforms the IMGT database into an AIRR germline database? By this, I mean we sidestep the license issue of redistributing IMGT’s database, but we don’t have to create a separate alternative. Also, those programs could incorporate additional alleles, genes, etc and put all the data into an AIRR compliant format. So instead of distributing a germline database, we distribute a reproducible process to create an AIRR compliant germline database. The steps might be:

  1. User downloads database files from IMGT.
  2. Run program to parse and annotate IMGT data into a standard AIRR germline format.
  3. Run program which incorporates AIRR approved alleles, genes, etc.
  4. Run program to include user’s local alleles, genes, etc.

To make this reproducible, these steps could be put into a Dockerfile so that all the programs are compiled and installed properly. The generated docker image could then be used by computational tools.

Happy to see this here.

The VBASE2 database is constructed automatically by a computer program.
May be within this group, the process can be revisited and then run at regular intervals to update the new database.
One has to look into the resources required.

P.S.
I must confess that the DAS server is not working anymore. There were to many changes in the DAS Server at the EBI that made it impractical to always implement the latest version. I have to look into this again and may be the procedures are now much more straight forward.

Hi Werner,

It’s great to have someone with your experiences join the discussion.

I have been interested for many years in the differences between VABSE2 and IMGT. VBASE2 has always seemed much more comprehensive, and I can see now that this may have resulted from the automatic searching for sequences.

Looking back, I am interested in whether you might wish for any changes in the
rules that governed acceptance of sequences into VBASE2, and looking forward, I would be really interested in your thoughts on how to manage sequences that have ben inferred from VDJ rearrangements.

We can look at the rules we used one by one and discuss if changes to the rules are beneficial.

For sequences that have been inferred from VDJ rearrangements only, first of all we put them in class III (and some sequences moved later to class I, once the germline genes were found. We required independent rearrangements from independent experiments, ideally from independent publications.

One point to remember is that we generated VBASE2 at a time when Sanger Sequencing was used. It is important to access the influence of sequence technology in the process.

I am pleased to see that now many more people are interested in this topic. Many years ago we were only a few…