Sponsored by the AIRR Community

How best to collect and share germline gene repertoires?

@ctwatson-- a fantastic post. Thank you!

As @bussec said before, positional numbering is inherently problematic.

My feeling is that there should be a one-to-one correspondence between unique gene sequences and gene names, and all other information should be contained as metadata about those sequences. Like you say, even the distinction between genes and alleles is a fine point in some situations.

Here are some points that came up during a discussion, I think mostly from @bussec and @w.lees.

  • Inferred germline genes, such as those reported by TIgGER, are not sequences that have come off of a sequencer. They are inferences, and as such cannot be deposited in databases such as GenBank or EMBL. Thus we need a means of storing those (discussed above) that won’t disappear.
  • Novel germline genes are typically minor variants of ones that have been seen before. Does it make sense to use a format that is explicitly designed to store differences, such as Variant Call Format? Personally I feel that the sequences are short and that it would be more trouble than it’s worth, but it’s a reasonable question and merits discussion.
  • What annotations would people like to see for germline genes? Apparently there are some appropriate tags in EBI (mentioned by @bussec).

To follow up on point 3:

Here is the current version of the INSDC feature table. The standard allows annotation of V, D and J segments, V and C regions, germline and rearranged states, Igh switch regions and N-nucleotides. RSS sites could be annotated using the “misc_recomb” key. Since the format is only for nucleic acids, we would have to find a second format to describe the protein features (like FWR/CDR).

IMGT seems to use the EMBL format, with extensions to the INSDC table: here’s an example. The extensions seem to work quite naturally, although most of them are at codon level.

The following paper can serve as a model for one direction for a community-curated germline database:

A little background: people are continually running phylogenetic tree software to infer parts of the tree of life. The idea of this paper is to enable all of those little inferences to be combined into one master tree, hosted at http://www.opentreeoflife.org/.

I emphasize that this is a much harder problem than sharing germline genes. Trees disagree with one another in complicated ways, and resolving those differences is a challenging inferential problem in itself. However, I think that we can learn a lot from their workflow and their goals.

In the repertoire case, each study would correspond to an inferred germline gene repertoire from a single individual. Upon merging that study to master, a process would be spawned that would compare that germline gene repertoire to previous ones and register any new germline genes observed. It would also increment the number of observations of each already known germline gene.

They have a “curator application”, which is cool and would be great, but to begin with one could just interact with GitHub directly.

2 Likes

A couple of points here, I think it is important to remember that many species other than human and mouse lack comprehensive databases and so the germlines identified for these will not be minor variants of known genes. We will therefore require a more robust naming convention that takes account of these cases.
Also, identifying germlines from NGS data does not necessarily mean the sequences are purely inferred from consensus sequences. NGS data also contain real sequences that can therefore be submitted to Genbank.

Re the use of VCF files. There is a move in the genomics field toward the use of reference graphs for representing variation at hypervariable loci, particularly for the assembly of genomic data and discovery of novelty. Here is a nice example of what I am referring to: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4449272/figure/F2/

(from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4449272/)

Great post. One thing though about nomenclature. We probably need to consider the fact that there is a “Immunoglobulins (IG), T cell Receptors (TR) and Major Histocompatibility (MH) Nomenclature Sub-Committee” within IUIS (closely linked to IMGT (sic!)) that deals with germline gene assignments. Jamie knows more about it as she’s a member. I suppose that we will need to involve this group if any proposals with respect to new germline genes are to get wider acceptance. It is not to anyone’s advantage if there are competing nomenclatures. In my view this would indeed be a waste of resources and it will only add to the confusion (compare the many different amino acid numbering schemes that are/have been in use). By
working together with an establish infrastructure we will furthermore not be forced to establish another one (or rely on voluntary forces that for a diversity of reasons may not be available in the long run).

This is a great point, and we will always want to link closely with established nomenclature. But any data with more sequences than IMGT is going to need to name them. Say we tried to use IMGT-like nomenclature for these extended sequences, but later they extend their collection of sequences. This might collide with our names in an unfortunate way.

Thus any expanded sequences is going to require some new naming convention. It just depends where it starts.

I see your point but I strongly believe this has to be done in collaboration with the the established nomenclature system otherwise it will likely be messy and also require an extensive organization to keep track of entries. I am not convinced that two separate systems should exist in parallel. There might be a provisional system for alleles that have not yet been entered into the standard (human) nomenclature but eventually all confirmed alleles should end up there and those names be used. Compare the CD-nomenclature for cellular antigens. As antigens get CD recognition, old names are often based out.

This is a timely subject as the VDJServer team has been thinking how to appropriately version and provide provenance for the germline database that we use. Currently, our database is stored and maintained in-house, but it really should be made public in some way. We start with the IMGT sequences, and then run a series of post-processing steps to add additional annotation to those sequences. I definitely like the idea of having a shared germline database with standard set of annotations.

4 posts were split to a new topic: Draft criteria for inclusion of inferred alleles into a germline database

For humans, the IMGT nomenclature is so widely known and used that I think it should be retained as the foundation for the future. Whether or not that should include the IMGT fix for dealing with Corey’s sequences is another matter. I suspect that CNVs and structural variation may be so great that the D designation for duplicated genes will not help. The IMGT positional nomenclature, if it worked perfectly, would allow heterozygosity at a gene locus to be instantly recognizable, but it already fails on that score. It should be seen as quasi-positional, with the understanding that the determination of heterozygosity may sometimes be problematic, or even impossible.

The mouse IMGT nomenclature is surely dead in the water. I would retain the nomenclature for B6 mice, but until we have much more knowledge of the genes in other strains, it will be impossible to know whether or not that nomenclature can be developed for other strains. I would certainly say that it will be quite unsuited to the BALB/c and NOD repertoires, and if a new repertoire is to be developed for these and other strains, it should be independent of reported gene positions.

We need input from people working with other species if we are to resolve issues with those species, but still, here are a couple of my thoughts…

As a human/mouse researcher, I have not looked closely at other species, but the situation is confusing. The rat and platypus seem to have positional nomenclatures, but most other species just have a relatively short list of unmapped sequences. The ‘mapping’ of the rat and platypus sequences comes from analysis of genome reference sequences, but since the assemblies of the genome sequences for these species did not address the special challenges of BCR/TCR genes, they cannot be accepted as accurate. Most if not all rat sequences were separately identified, before being mapped using the rat genome reference sequence, so the mapping might be queried, but the sequences could be real. In the case of the platypus, not only could the mapping be queried, but since there is no corroboration for the sequences themselves, they too could be queried.

IGHV genes for other species have been reported, but are not yet found in the IMGT database. I would prefer to see such sequences named using a new non-position based nomenclature. But where do the responsibilities of the AIRR community lie? Should we and can we be concerned about germline genes in the elephant? Are there other organizations who we might partner with to advance AIRR studies in these species?

Perhaps its time to initiate discussion on related topics.

Is it possible to develop guidelines for the acceptance of germline genes identified in genomic reference sequences? How can we disseminate the idea that assemblies that don’t address the special challenges of BCR/TCR genes will not be sufficiently accurate? Sequences that come from such dubious assemblies may still have some value, as long as a caveat is attached to each sequence. So is there any place for such sequences, or should they be kept out of any future germline database?

Are guidelines necessary for the identification of germline genes by HTS of unrearranged IGHV? Our solution to this problem, when analysing 454 data of unrearranged sequences from Papua New Guineans, was to require 3 forward and 3 reverse complement reads. That was a means of addressing 454 errors. Will different guidelines be required as each new platform is rolled out?

wrt other species:

They didn’t annotate any BCR genes, but it might be worth reaching out to see what/who they know who might be interested in helping us…

Note that the discussion of criteria for inclusion of inferred alleles has been moved to its own topic:

Let me offer a 5. Something else. What about having a set of programs which transforms the IMGT database into an AIRR germline database? By this, I mean we sidestep the license issue of redistributing IMGT’s database, but we don’t have to create a separate alternative. Also, those programs could incorporate additional alleles, genes, etc and put all the data into an AIRR compliant format. So instead of distributing a germline database, we distribute a reproducible process to create an AIRR compliant germline database. The steps might be:

  1. User downloads database files from IMGT.
  2. Run program to parse and annotate IMGT data into a standard AIRR germline format.
  3. Run program which incorporates AIRR approved alleles, genes, etc.
  4. Run program to include user’s local alleles, genes, etc.

To make this reproducible, these steps could be put into a Dockerfile so that all the programs are compiled and installed properly. The generated docker image could then be used by computational tools.

Happy to see this here.

The VBASE2 database is constructed automatically by a computer program.
May be within this group, the process can be revisited and then run at regular intervals to update the new database.
One has to look into the resources required.

P.S.
I must confess that the DAS server is not working anymore. There were to many changes in the DAS Server at the EBI that made it impractical to always implement the latest version. I have to look into this again and may be the procedures are now much more straight forward.

Hi Werner,

It’s great to have someone with your experiences join the discussion.

I have been interested for many years in the differences between VABSE2 and IMGT. VBASE2 has always seemed much more comprehensive, and I can see now that this may have resulted from the automatic searching for sequences.

Looking back, I am interested in whether you might wish for any changes in the
rules that governed acceptance of sequences into VBASE2, and looking forward, I would be really interested in your thoughts on how to manage sequences that have ben inferred from VDJ rearrangements.

We can look at the rules we used one by one and discuss if changes to the rules are beneficial.

For sequences that have been inferred from VDJ rearrangements only, first of all we put them in class III (and some sequences moved later to class I, once the germline genes were found. We required independent rearrangements from independent experiments, ideally from independent publications.

One point to remember is that we generated VBASE2 at a time when Sanger Sequencing was used. It is important to access the influence of sequence technology in the process.

I am pleased to see that now many more people are interested in this topic. Many years ago we were only a few…