How best to collect and share germline gene repertoires?

ematsen · February 24, 2016, 3:12pm

How best to assemble and share germline gene repertoires? I’m actively soliciting ideas as part of the “Tools and Resources” category of the AIRR work to organize and coordinate immune profiling research, so there’s some momentum here to take action.

The need: IMGT is a great resource for germline genes, though it’s clear that there are more germline alleles than are present in that resource, such as from Papua New Guinea and South Africa, and there may be alleles present in IMGT that are in error. It’s good for such a widely-used resource to move carefully and have high standards for acceptance, though perhaps there is something left to be desired here in terms of responsiveness. I note that downstream analyses of repertoires do depend crucially on good germline gene sets.

The options seem to be:

Try to coax IMGT to version their database and update it more frequently.
Build a human-curated IMGT alternative, which is the approach taken by Andrew Collins with his UNSWIg database (which appears to be down at the moment).
Infer alleles directly from data as with the TIgGER program (though this does depend on having a germline database to start with).
Use data from genome (not repertoire) sequencing projects, as with http://vgenerepertoire.org/
Something else.

Thoughts?

ematsen · February 24, 2016, 3:16pm

My proposal for “something else” is a database requiring minimal human curation, but lots of automatic curation by software. The key feature would be an API that could be used by annotation software to report back what germline genes were observed. It might work as follows:

Cast a wide net for a starting set of alleles, including IMGT, UNSWIg, vgenerepertoire, etc.
Users run annotation software, such as those listed here, which reports back which alleles were found in annotations.
Users also run software that infers alleles directly from data, such as TIgGER, which also reports back that certain new alleles were found.
This information is automatically aggregated by the database, which maintains a running list of known alleles and the support for them, as well as periodic versioned releases of a simple flat file of germline genes that can then be used by germline gene software.

There are a variety of issues here, including privacy concerns and the need for each data set to just be reported once. Or perhaps this is just too complicated of a solution for a problem that will just go away as we get better at sequencing genomes.

ematsen · March 1, 2016, 2:11pm

Another database (H/T TIgGER authors):

a.collins · March 2, 2016, 3:19am

Continuing the discussion from How best to collect and share germline gene repertoires?:

Erick, already this forum has proven its worth to me, by drawing my attention to a resource I did not know about - http://vgenerepertoire.org/. And this reassures me that the AIRR initiative will focus on all Ig sequences, not just the human. I think that is very important, as there is likely to be a mountain of data coming from non-human species in the next few years. This could be problematic, and AIIR and perhaps this forum can hopefully help ensure that the data that is generated is good data.

Antibody genes have been reported for many species in recent years, based on analysis of genome reference sequences, and I think that vgenerepertoire.org will be filled with such sequences. I haven’t yet been able to do a personal evaluation of the vgenerepertoire data, but I think we need to be very cautious of most reported germline gene sequences. I do not believe that a genome reference sequence is an appropriate source of data regarding antibody genes, unless the antibody gene loci were sequenced and assembled independently of the rest of the genome. Assemblies of short reads will not be able to a accurately represent these loci.

a.collins · March 2, 2016, 3:39am

A feature of VBASE2 that I like is its nomenclature, and nomenclature may need to be an issue addressed by AIRR. The IMGT nomenclature has the approval of the Human Genome organization, and IMGT has found fairly simple solutions to the challenges to the underlying logic of the nomenclature that arose from recent studies. The situation with the mouse is more complicated.

Our recent mouse study identified such divergence between the IGHV genes of the BALB/c mouse and the C57BL/6 strain that I think it will be difficult for the IMGT mouse nomenclature to survive. This is because it is no longer possible to be certain of the relationship between C57 and BALB/c genes that are presently considered allelic variants of one another. If there are 50% more IGHV genes in the BALB/c strain than in the C57 strain, their sequences may never be paired, and the apparent logic of the IMGT nomenclature breaks down.

VBASE2 has a nomenclature that does not attempt to describe relationships between sequences. Instead, each sequence is given a unique number, along with identifiers showing the species and gene type. eg musIGHV233. The three classes of sequences is also a useful idea. VBASE explains it this way: “Class 1 sequences are supported by a genomic sequence and a rearrangement. Class 2 contains sequences with genomic evidence only and class 3 holds sequences which have been found in rearrangements only.” A strategy of this kind would allow the incorporation of seqeunces that are inferred from VDJ rearrangements. I consider it to be essential for such sequences to be included for two reasons. Firstly, the evidence for inferences from hundreds and even thousands of VDJ rearrangements can be very convincing. And secondly, such inferences are likely to be most if not all we have to consider, until technologies change again, or research interests evolve. Genomic studies of antibody genes will for the time being probably be few and far between.

ematsen · May 23, 2016, 9:43pm

I’m reanimating this thread because of an interesting discussion that we recently had on a conference call with a number of contributors to the forum, with a number of ideas from @dooley @rarnaout @caschramm @bussec, among others.

There was general consensus that nobody is going to try to start a curated resource like IMGT. Thus, what can we do that is somewhat decentralized?

The most basic thing is simply to put the data somewhere that people can see it. GitHub has advantages for this because it is naturally versioned and people can see diffs. @bussec has done this beautifully over at https://github.com/b-cell-immunology/sciReptor_library. @dooley pointed out that one can upload data to iPlant and mint a DOI for it.

The second tier would be that plus a directory of what resources are where, as well as some standards for sharing. For example, one could imagine that we all agree on a file format, and then there is some machine-readable file that describes where resources are in such a way that a computer program could use that directory to automatically go out and grab things. This file could be updated by pull requests on GitHub.

The third tier is similar but all of the data is in one GitHub repository. One could imagine that each update to the database is a pull request, but this pull request would trigger a job that would do some basic consistency checks, for example that people aren’t adding sequences that are already in the database, and that everything is properly formatted.

None of this would have human curation, and thus one would invariably have incorrect sequences pulled into this web of information. But, at least there could be a consistent means of fetching those sequences and having names for them.

Thoughts?

dooley · May 23, 2016, 11:24pm

You could actually just leave the data in github and use the CyVerse science APIs to expose that github account as a storage repository and then generate a DOI for it via that interface. That gives a storage-agnostic way to publish data.

Of course, that’s by no means the only way to get a DOI for a github repo. Zenodo has a pretty straightforward mechanism as well. Referencing and citing content - GitHub Docs

w.lees · May 24, 2016, 10:52am

There’s metadata that we need to collect in a germline library as well as the gene sequence itself. For example the paper describing IgBLAST (Ye et al) says that the IMGT and NCBI sequences in its libraries have been pre-annotated to identify FR/CDR boundaries (I am not sure exactly how this works for custom libraries but I think it uses the data from its librariies for the speciified organism). The same purpose is satisfied in IMGT, I think, by the IMGT-aligned germline sets.

Identifying required metadata and agreeing how it should be coded will be important if we want to work towards general purpose analysis tools that aren’t restricted to built-in libraries or organisms.

bussec · May 26, 2016, 5:45pm

I had a discussion with some members of our lab to go through this group’s ideas for an germline segment reference sequence infrastructure. Below is a summary of the critical points that came up. In addition, I tried to put our goals (as I understood them) in more general terms. Please feel free to critizise:

####Goals####

The proposed repository standard should make AIRR germline segment sequence data available in a way that ensures

reproducibility (of analysis)
extensibility (of the underlying datasets)
flat hierarchy (of sites publishing data sets)

####Infrastructure####

In general, distributed version control systems (DVCS) are expected to provide a reasonable technical basis to build repositories adhering to the proposed standard. Although we are of course neither fixed to a given DVCS software (e.g. Git) let alone a service provider, I will refer to this setup as the “Github” solution to remain consistent with our previous discussions.

###Data layers####

(please note that these layers are different from the repository tiers that @ematsen proposed)

Layer 1 / plain sequence data: Plain sequence data should not be hosted by individual sites but instead remain (or be deposited) in central databases (e.g. Genbank, ENA). The primary reason for this is the good and long-term funded infrastructure of these databases, which is unlikely to be matched by an individual site. A second point is that plain sequence data is not very amenable to ‘diff’-ing, which usually works line based, thereby precluding one of the central advantages of a DVCS.
Layer 2 / lists of segments: These lists would assign a segment name [also see below] to a sequence in a database. This could be e.g a gid or a physical position on a genome assembly. AIRR specific segment annotation (e.g. CDR/FWR boundaries) would be located on this layer.
Layer 3 / curated annotations: Layers 1 and 2 deliberately do not include any centralized curation. Thus the responsibility to use this data to generate a complete reference data set for analysis is shifted to the user. While most of us will likely consider this an important feature, I would also expect that many researchers who are less deeply involved in AIRR will often be interested in “just running a quick analysis”, without too much hazzle on which reference set to use. I consider it therefore crucial to have a third annotation layer, which would allow sites to aggregate level 2 data and add a further annotation layer.

####Segment nomenclature####

We are facing a lot of nomenclature issues for germline segments, especially given the segment vs. allele issue (as discussed above).

Gene segment names must honor the established nomenclature rules for a species, otherwise it is unlikely that they will ever be officially adopted.
Gene segment names must not contain information on assumed functionality (e.g. the “pg” (pseudogene) tag used in some mouse Igh-V segments). The initial assumptions might be wrong and different alleles of a segment can differ in functionality.
Numbering gene segments according to their position in locus does not constitute a extendable and stable mechanism. It will probably always run into troubles when new locus alleles are added.
As @a.collins mentioned, sequential numbering of newly described segments is both extendable and stable. However, in the absence of a centralized “numbers authority”, we would have to find mechanisms to avoid collisions/double-assignments. Hashing the sequence might be a solution, but would not be able by itself to distinguish 100% identical gene segment duplications.

ematsen · May 26, 2016, 11:03pm

This is great! I agree with everything.

Some further thoughts:

I note that the Layer 1 requirement excludes alleles that are inferred by software such as TIgGER.
If I understand correctly, you would support flat files in Github for Layers 2 and 3?
I like hashing for sure, though I’m not sure if people will stand for such long and nonsensical sequence names. I would propose having a naming scheme as part of Layer 2, along with a table of synonyms.

ctwatson · June 1, 2016, 4:53pm

Really great discussion! I will speak mostly to points regarding nomenclature. As a genomicist, and one who has historically approached IG from a genomics perspective, my viewpoint is that we should ultimately be striving to move toward genomic assignments for “germline” segments. In my opinion, this is truly the only time-tested way to define a gene/locus as an entity for which a known position exists in a genome – inference can never achieve this goal, and as I think @bussec alluded to, for example, will always leave the assignments of “100% identical gene segment duplications” up in the air. But I whole-heartedly agree with @a.collins that genomics studies in IG across taxa are likely to be few and far between for the foreseeable future, and will continue to lag behind repertoire-based analyses. So, as I think everyone has argued, we have to have a system that allows for inferential methods like TIgGER to make a contribution (and maybe even dominate in the short term) to publicly used IG gene/allele sets. But in saying this, I would certainly argue that any naming scheme has to have flexibility and any new gene/allele discovered in a repertoire and catalogued as “novel” should have the ability to later be assigned back to a position in the genome (I guess as Layer 2 annotations proposed by @bussec.

We have seen first hand already in human that position-based numbering is really not a long term solution, particularly as “novel” genes are discovered. There are now several examples where genuine gene duplicates are in fact treated as just that, noted by a “D” by IMGT instead of a position-based number (e.g., IGHV1-69D, IGHV3-23D, IGHV3-43D, and IGHV3-64D). But from a phylogenetic standpoint, these are hardly different from, as an example, IGHV3-30, IGHV3-30-3 and IGHV3-30-5, other than the fact that the latter set of genes received their names in the infancy (or maybe teenage years) of IG nomenclature. Then you have IGHV1-69-2, which was previously referred to as IGHV1-f. Unlike IGHV3-30, IGHV3-30-3, and IGHV3-30-5, which we know are very close paralogues, IGHV1-69 and IGHV1-69-2 are not the result of a recent duplication event, even though the name might imply this. So, I guess I’m arguing that the current scheme is perhaps already neither extendable nor stable. Now, if we start to also think about alleles, the waters get muddier in a hurry. How do I tell if an allele resides at IGHV1-69 or IGHV1-69D? I can’t, unless I see this allele in genomic DNA sitting at a given position in the genome. So one concern I would have moving forward is when we are thinking about inference of novel segments/alleles from a repertoire dataset, and how we might begin naming these, when do we say we have a new “gene” vs. “allele”. Do either positional nor sequential numbering schemes have a solution for this problem? I would be curious to know what others think?

Also a few other short random thoughts. First, re @bussec:

“Gene segment names must not contain information on assumed functionality (e.g. the “pg” (pseudogene) tag used in some mouse Igh-V segments). The initial assumptions might be wrong and different alleles of a segment can differ in functionality.”

So true, and also, a given “allele” may have different functionality depending on where it resides in the genome, again thinking in the context of gene duplicates.

Also, re “vgenerepertoire.org”, I again agree with @a.collins, the use of WGS/shotgun datasets for V gene inference is very shaky and should be treated with EXTREME caution. It is important to bear in mind that in many instances, whole genome sequence data for a given species could be using DNA from multiple samples. Regions like IG loci are very challenging to assemble and make sense of even using shotgun reads derived from a single haploid sample, let alone several pooled diploid genomes – annotating genes vs. alleles in such a scenario could not possibly be anything but a nightmare. I would steer clear from such an approach, and don’t believe it offers much beyond repertoire analysis at this stage in the game.

ematsen · June 1, 2016, 7:32pm

@ctwatson-- a fantastic post. Thank you!

As @bussec said before, positional numbering is inherently problematic.

My feeling is that there should be a one-to-one correspondence between unique gene sequences and gene names, and all other information should be contained as metadata about those sequences. Like you say, even the distinction between genes and alleles is a fine point in some situations.

ematsen · June 1, 2016, 7:51pm

Here are some points that came up during a discussion, I think mostly from @bussec and @w.lees.

Inferred germline genes, such as those reported by TIgGER, are not sequences that have come off of a sequencer. They are inferences, and as such cannot be deposited in databases such as GenBank or EMBL. Thus we need a means of storing those (discussed above) that won’t disappear.
Novel germline genes are typically minor variants of ones that have been seen before. Does it make sense to use a format that is explicitly designed to store differences, such as Variant Call Format? Personally I feel that the sequences are short and that it would be more trouble than it’s worth, but it’s a reasonable question and merits discussion.
What annotations would people like to see for germline genes? Apparently there are some appropriate tags in EBI (mentioned by @bussec).

bussec · June 2, 2016, 9:23pm

To follow up on point 3:

Here is the current version of the INSDC feature table. The standard allows annotation of V, D and J segments, V and C regions, germline and rearranged states, Igh switch regions and N-nucleotides. RSS sites could be annotated using the “misc_recomb” key. Since the format is only for nucleic acids, we would have to find a second format to describe the protein features (like FWR/CDR).

w.lees · June 6, 2016, 5:04pm

IMGT seems to use the EMBL format, with extensions to the INSDC table: here’s an example. The extensions seem to work quite naturally, although most of them are at codon level.

ematsen · June 6, 2016, 5:32pm

The following paper can serve as a model for one direction for a community-curated germline database:

A little background: people are continually running phylogenetic tree software to infer parts of the tree of life. The idea of this paper is to enable all of those little inferences to be combined into one master tree, hosted at http://www.opentreeoflife.org/.

I emphasize that this is a much harder problem than sharing germline genes. Trees disagree with one another in complicated ways, and resolving those differences is a challenging inferential problem in itself. However, I think that we can learn a lot from their workflow and their goals.

In the repertoire case, each study would correspond to an inferred germline gene repertoire from a single individual. Upon merging that study to master, a process would be spawned that would compare that germline gene repertoire to previous ones and register any new germline genes observed. It would also increment the number of observations of each already known germline gene.

They have a “curator application”, which is cool and would be great, but to begin with one could just interact with GitHub directly.

martin_corcoran · June 8, 2016, 12:15pm

A couple of points here, I think it is important to remember that many species other than human and mouse lack comprehensive databases and so the germlines identified for these will not be minor variants of known genes. We will therefore require a more robust naming convention that takes account of these cases.
Also, identifying germlines from NGS data does not necessarily mean the sequences are purely inferred from consensus sequences. NGS data also contain real sequences that can therefore be submitted to Genbank.

ctwatson · June 10, 2016, 12:41pm

Re the use of VCF files. There is a move in the genomics field toward the use of reference graphs for representing variation at hypervariable loci, particularly for the assembly of genomic data and discovery of novelty. Here is a nice example of what I am referring to: Figure - PMC

(from Improved genome inference in the MHC using a population reference graph - PMC)

mats.ohlin · June 21, 2016, 6:08pm

Great post. One thing though about nomenclature. We probably need to consider the fact that there is a “Immunoglobulins (IG), T cell Receptors (TR) and Major Histocompatibility (MH) Nomenclature Sub-Committee” within IUIS (closely linked to IMGT (sic!)) that deals with germline gene assignments. Jamie knows more about it as she’s a member. I suppose that we will need to involve this group if any proposals with respect to new germline genes are to get wider acceptance. It is not to anyone’s advantage if there are competing nomenclatures. In my view this would indeed be a waste of resources and it will only add to the confusion (compare the many different amino acid numbering schemes that are/have been in use). By
working together with an establish infrastructure we will furthermore not be forced to establish another one (or rely on voluntary forces that for a diversity of reasons may not be available in the long run).

ematsen · June 22, 2016, 8:58pm

This is a great point, and we will always want to link closely with established nomenclature. But any data with more sequences than IMGT is going to need to name them. Say we tried to use IMGT-like nomenclature for these extended sequences, but later they extend their collection of sequences. This might collide with our names in an unfortunate way.

Thus any expanded sequences is going to require some new naming convention. It just depends where it starts.