Germline Set format - The Way Forward

w.lees · September 4, 2016, 4:03pm

The Germline DB WG has asked for a summary of the position on this at its next meeting, which will be at the end of the month. With that in mind I have trawled through all the relevant threads in the forum and have put an initial draft together in this post. Could people please help develop this - what issues should be added? Which of the points on the list can we discuss and resolve now? And does the overall development approach seem sensible?

Thanks

William

Purpose of the Germline Set Format
The format is intended to publish germline sets that can be imported by analysis tools, specifically:

Germline parsers:

- Facilitate use with IgBlast (community effort?)
- Encourage authors of [other parsers] (List of V(D)J annotation software) to import our format

Tools that infer novel germline genes or alleles
Tools that infer haplotype?
[Repertoire browsers] (VDJviz: a versatile browser for immunogenomics data)?
Any other tools we should add to this list?

Proposed Development Approach

Agree the purpose - hopefully this month, in this thread
Define a first version schema and align iteratively with database definition - hopefully do the first iteration this month
Define an initial file format (there could be others, over time)
Deploy the format on the Germline DB website
Publish a reference implementation that imports to IgBLAST

Schema-level Content
Progress so far

Key outstanding points

Fields and their naming convention (@laserson) need to be aligned with the DB schema definition
How should we handle multiple names for an identical sequence? Needs further consideration in terms of the biological process (@a.collins, @cwatson), and overall approach in our DB and website.
Ensure the overall format is neutral to gene naming convention (@mats.ohlin) - need to keep in mind as we understand what conventions might arise
Needs to be extensible to leader and constant regions (@javh). Should there be there associated metadata and naming, for example for isotypes and subtypes?

Additional fields to consider for incorporation
(the overall framework is extensible. I think the main question at the moment should be whether these fields would be useful for the tools we aim to support)

Functionality (@werner.muller)
GO terms for species etc (@werner.muller)
Inference class (@werner muller). Should we include other supporting evidence?
gene family, gene, allele (@schristley)

File Format or Formats

Data compression does not seem a pressing priority for us, given that the datasets will be relatively small
We need an intuitive layout that is easy for people to work with and hard to misinterpret (@psathyrella)
We should consider using a well-known standard framework, e.g. JSON, that is well supported and allows for later extension (@laserson)
Genbank format another possibility: it is well understood in the field (@werner.muller)
We need to consider metadata format, for which there are also precedents (@laserson)

bussec · September 14, 2016, 8:54pm

We should be able to distinguish between an alias (i.e. different name for the same segment) and a homolog (i.e. different segment with the same sequence). This distinction might occasionally be difficult, especially when there is no genomic map. Nevertheless, a homolog always should have an own DB record. The level of confidence for the homolog status could be encoded in a class, similar to the inference class.

Constant regions and there allelic variants should be included, since there is a reasonable chance that they might be biologically relevant.

Functionality and inference class would clearly be useful from my point of view. Gene family, gene and allele might be redundant since they often will be a component of the segment name, but if included they would save the parsing of the segment names and allow for easier annotation of “legacy” segment designations.

caschramm · September 14, 2016, 9:26pm

This raises it’s own problems, though. A very common application will be annotate the germline origins of recombined sequences, and no annotator can distinguish between homologs with identical sequences. I suppose a well-built tool could automatically scan the input GermDB and merge duplicates, but there are many existing tools that could get tripped up by this type of ambiguity.

I do understand why it would be important to distinguish homologs. It seems this is a bit of a catch-22…

w.lees · September 15, 2016, 12:20pm

@bussec, @caschramm Thanks very much for your comments. I have:

Included constant region as a gene type in the Sequences section, allowing constant region sequences to be included.
Added Functionality and inference class fields to the Sequences section. I have assumed that we will use @werner.muller 's definition of inference class, and the IMGT definition of Fucntionality, but please let me know if there are other views.
Included gene family, gene and allele as fields in the sequence class.

w.lees · September 15, 2016, 12:24pm

@caschramm we could add a homolog field, which would state explicitly which other gene this gene is a homolog of. The intention would be that a highly functional parser would recognize that they are identical, and indicate this in the report. Less functional parsers would simply treat them as independent genes. Would this be worthwhile?

caschramm · September 16, 2016, 6:28pm

Yeah, this seems fine. Like I said, there’s no perfect way to handle it without relying on the analysis software to recognize and parse the homologs in some way. Having them flagged explicitly is definitely better than having to check identity manually!

ctwatson · November 21, 2016, 4:22pm

@w.lees, @bussec, @caschramm, @ematsen…

Sorry, I am really late to this party. I have made a few comments to Willam within the schema table, and thought that I would add them here, particularly re the discussion of homologs/paralogs and duplicate sequences in the DB.

I’m not sure I agree or see the need to make a distinction between “identical sequences” and paralogs. And in many cases, I would suspect that these have the same underlying explanation? It would be good to investigate in the human database how often we see “identical sequences”, and what the causes might be. I suspect another reason might be due to sequence lengths that lead to a shorter and longer sequence with identical bases in the overlapping portions to be assigned different names. Perhaps figuring this out would help us make the most useful metadata field to capture this?

ctwatson · November 21, 2016, 4:39pm

@bussec, what do you mean exactly by this “homology always should have an own DB record”? You mean the “gene” itself presumably? e.g., IGHV1-69 and IGHV1-69D are two distinct entities, but an allele encoded by one or both of these two genes, if described in expressed repertoire data, would not be, in that it could not be assigned with certainty to either locus/gene based on the repertoire sequence alone.

bussec · November 22, 2016, 10:25pm

Sorry for the “homolog”, as I was talking about duplications within a species this should of course have been “paralog”. And yes, every segment/gene that has been mapped to a distinct location on the genome should have an own record.

@ematsen, @cswarth, @wsdewitt and I had an extended discussion about this and the analog situation in which all three sequences (2x genomic, 1x expressed) are identical. The solution that we came up has two layers, an aggregation layer representing segments that are assumed to exist and a source layer, representing the metainformation for the individual sequence in the general database (i.e. INSDC). Each aggregation record refers to one or more source records (i.e. evidence that the segment exists). Importantly, a single source record can be referred to by multiple aggregation records if it has no genomic mapping associated with it. An aggregation record cannot refer to multiple genomically mapped source records, if those are non-overlapping. Finally, aggregation records are aware of paralogs. I am in the process to come up with a visualized scheme for this, as it might be easier to understand

In this scheme, new sequences that would fully match existing sequences (i.e. gaps at the ends are ok, but no internal replacements or indels), would be added as additional evidence for an existing segment. A new aggregation record would be created for non-matching sequences and matching sequences with non-overlapping genomic mappings. So in the situation you describe (2x genomic, 1x expressed, all non-identical to each other), you would end up with 3 aggregation records. We have not yet discussed about thresholds for automated linkage below 100 %ID.

ctwatson · November 23, 2016, 12:30pm

Thanks @bussec, I look forward to studying your overview diagram.

Could I ask a favor of you? Perhaps it would be good to actually use a real example as a use case for setting up your diagram. I keep bringing up IGHV1-69 as an example for this paralog issue. In the case of IGHV1-69, there are at least 3 alleles that are known to reside at either one of the duplicates. I attach a figure to demonstrate a little about what we know for the locus architecture and variation. In this example, what would be represented by an aggregation vs. source record? IGHV1-69.example.pdf (174.8 KB)

Thanks for walking me through this! I eventually will understand.

w.lees · November 23, 2016, 2:48pm

Corey, that’s a really interesting diagram.

Is it always the case that duplication haplotypes will have different names, like 1-69 abd 1-69D?

ctwatson · November 23, 2016, 3:10pm

@w.lees, great question. Not it is not.

The use of “D” is a relatively new thing they came about when IMGT began dealing with our sequence data, for better or worse It was originally applied to human IGK.

It applies only to IGHV3-23, IGHV1-69, IGHV2-70, IGHV3-64, and IGHV3-43, but these are hardly the only examples of close paralogs, in my opinion.

If we take IGHV3-30, IGHV3-30-3, and IGHV3-30-5 (and really even IGHV3-33), for example. These genes have been given different names (for various reasons that require a longer explanation), and don’t use the “D” designation, but in fact are also close paralogs (see: http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=locus&species=human&group=IGH/haplotypes), occuring in a complex region comprised of large 25 KB tandem segmental duplications.

I would argue strongly that we don’t really have a clear picture about how alleles fall out on these 4 loci, but there are examples where catalogued alleles of IGHV3-30 and IGHV3-30-3 are the same. In many cases, it is likely that alleles here are incorrect (as per Collins’ work), but in addition, I would say it is a non-trivial task, even with genomic data to say you know which is IGHV3-30-3 vs. 3-30, 3-30-5, etc. The same would be true for IGHV4-31, IGHV4-30-2, and IGHV4-30-4, I would presume.

Other examples would be the genes IGHV4-39 and IGHV4-b (know called IGHV4-38-2), IGHV4-61 and IGHV4-59…

If you really want to think about a challenging genomics problem check out IGK – http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=locus&species=human&group=IGK – nearly the entire V gene set is an inverted tandem duplication. And we have shown that there is most certainly allele sharing and gene conversion happening between V genes within the two duplicated blocks.

bussec · November 23, 2016, 6:21pm

The diagram can be found here, please feel free to comment in the document:

Sorry @ctwatson, but this is still a hypothetical example, but I would be happy to pitch this using your IGHV1-69 example.

ctwatson · November 23, 2016, 6:39pm

No worries, this gets the job done. I just poured some coffee, so looking forward to going over this! Thanks!!!

w.lees · November 24, 2016, 2:19pm

To me, the advantage of the dual approach is that it lets curators put annotations against observed sequences in a single place. The example shows the value of this in being able to assign a confidence value to an observed sequence, and being able to highlight a particular region of the observed sequence as being of interest. Given the complexity that you have outlined, @ctwatson, we are going to end up with multiple references to many observed sequences, and the alternative to the dual approach would require those multiple references to be kept in synch.

I think it is going to take some creativity to build a scheme that allows the gene and the source records to be edited conveniently in the git-based CSV-type format we have been talking about, but in terms of the data, this gets my vote as the best approach, and I think that’s the primary consideration right now.

a.collins · December 1, 2016, 4:14am

Hi @w.lees, what are your thoughts on versioning? Reflecting upon changes that IMGT made to the mouse IGHD gene repertoire makes me realise that we not only should know what version of a germline set, for example, is being used in an analysis. We also need to be able to track changes between versions, and to understand the reasons why changes have been made. A simple solution would be to include a text field in the Germline Set record, which itemises changes between the version of the record and the previous version. But I wonder whether this would always lead someone to the information they were looking for.

To focus on an IMGT example, on 11/2/16 murine IGHD6-101 became IGHD6-102, and IGHD6-102 became IGHD6-101. It might be a typo, but the IMGT website states that IGHD6-202 also became IGHD6-101. The functionality of the genes was also changed - from pseudogenes to functional genes. To take another example, human IGHV2-507 and IGHV2-510 were removed from the IMGT repertoire in December 2013. A final one, on 28/7/15 two nts were added to the 3’ end of IGHV3-30-3*01. How could such changes be tracked?

w.lees · December 1, 2016, 10:31am

@acollins,

This is a very good question and not one that we have as yet discussed in detail as a group. We’re proposing to store the information in Github, which is a strongly versioned repository (@ematsen and @cswarth have worked on a prototype and I’m sure will want to add to this thread). I’ve assumed that we will manage every data item - in other words every germline description, every reference to source information, and every germline set - under version control. Changes to any item will be tracked by the system, any change will require the person making the change to give a reason, and it will be possible to reconstruct the history, showing what was changed and by who, along with the reason and date of each change.

We’ll be able to control who is allowed to make changes. I think we probably need to do this at the species level, so that one set of people can make changes to all records associated with human, and another set of people with mouse, for example. We could also set up an approval process, which would require any changes to go through some kind of review and acceptance before being finalised. If there was a group of people associated with the mouse germline, they could be provided with a list of changes to approve at a monthly meeting, and the comments of that group could be added to the change record. This is just a suggestion, but I hope it gives an idea of the sort of thing we could implement.

I assume that changes would accumulate over time as work in progress, and that every now and then there would be a release of a germline set that reflected a whole bunch of changes. At any point in time, we should be able to show people visiting the site both the current release, and all the changes that have been made since then - so, depending on their interests, they can download our current stable release germline set and look at its underpinnings, or a ‘bleeding edge’ version that reflects the very latest thinking. I also think it’s important that people can download prior releases, in order to reproduce prior results, and to enable people to work on a stable set of definitions during the course of an analysis.

Erick, Chris, I hope that this is in line with your thinking and that I am not taking things too far.

WIlliam

w.lees · December 1, 2016, 10:48am

Germline Set Schema v2.pdf (617.3 KB)

To keep people updated, here is a document that was distributed (but not discussed) at the Germline DB WG yesterday. It incorporates @bussec’s ‘dual layer’ approach and extends his examples to cover the full set of data types. It also draws a clear distinction between the underlying data structure (shown on slide 1) and the files I envisage that we would publish (shown on slide 4). The overview is intended to reflect our current consensus thinking. I hope it does, but please post any comments here, or ask questions if there are things that are not clearly explained. Apologies for not posting as a Google doc, but the layout was a little bit beyond Google’s abilities.

If, at this stage, we are reasonably happy with this approach as a group, I will publish a new schema definition that sets out the fields in each data item (the detail behind slide 1) and the contents of each file (the detail behind slide 4). The fields themselves are understood fairly well at this stage through our group revew of the earlier schema definitions - so this is largely a matter of arranging things to reflect the new structure set out in the overview document.

caschramm · December 1, 2016, 8:36pm

One thing I’m not following: If the Gene Description layer is going to be denormalized (as per your last slide), then doesn’t that defeat the purpose (as per @bussec) of splitting out the Source Records into their own layer?

w.lees · December 2, 2016, 11:30am

@caschramm

Here’s how I see this working. Apologies for spelling this out at some length.

WIthin OGRDB we will hold the data as set out on slide 1. In other words there will be a table for each of the 5 record types on slide 2, and the data will be normalised. In particular, per Christian’s point, the source records will be in separate tables to the gene descriptions, which means we only hold one reference to a source item, even if it is referred to from several gene descriptions (for example as a possible paralog).

People will generally engage with the data by downloading files from OGRDB. If they want to make changes to the data they will modify the files and upload them. Consider, for example, a curator working on a gene description. It seems sensible to provide them with a gene description file which contains the gene description record itself, plus the source records that it refers to. It will be much easier for people to work with that, than to give them a list of every source record in the system, and expect them to cross reference between that list and the gene description.

How would this work for edit and upload? Subject to the checking and review discussed earlier on this thread, people would be able to make changes to that gene description file and push it back to OGRDB. OGRDB would reflect those changes in the underlying tables. Suppose, for example, that they change the confidence rating associated with a source. OGRDB will modify the rating in the source record table, and at that point, any gene description referencing that source will ‘see’ the updated rating, because all references to the source are linked through that sinlge record. The data, as held by OGRDB, is always held in normal form. The point I was trying to make with the note on the last slide was that you might not realise that, if you look at the file formats. It’s an issue I haven’t addressed sufficiently clearly in the past - this exercise started off as an attempt to define a file format for parsers, and morphed into something that was much more focussed on underlying data structure.

I hope this is clear now, and if it would help the explanation to modify the slides please let me know.

By the way I am wondering whether I should create a more formal entity-relationship diagram to go alongslide the diagram in slide 1. If people would like to see that, please let me know and I will put it on the list.

thanks

William