Standardizing the format of a germline set

jianye · August 20, 2016, 12:04am

I looked at the spread sheet but I only see some descriptions, not any actual content (like showing sequence or other descriptive information…not sure if I am looking at the right place). Anyway, I thought the amino acid translation frame information needs to be added so one can determine the in-frame/out-of-frame rearrangement information since not all V and J sequence translation start at position 1 (for example, when your germline gene is partial at 5’). The J gene also needs to have field indicating CDR3 boundaries.

What is the purpose for having the alignment information?

Regarding the name of the sequence, I think it’s important to have different names for two different genes when their sequences are identical…since they are truly different genes and being able to differentiating which is which is relevant for studying genetic locations. That sequence redundancy issue can be addressed by having two different names sharing one sequence. If you look at the sequence files (in FASTA format) at NCBI, there are exactly cases like this…one entry might have multiple names in the sequence definition line but has only one actual sequence.

w.lees · August 22, 2016, 6:19pm

Erick,

Thanks for your very helpful comments on the document, which I think I’ve addressed now.

For the identical duplicates - we could simply allow multiple sequence records, if necessary referring to the same deposited sequence. As written, the specification doesn’t prevent that, but we could explicitly mention the possibility. Another possibility would be to have an ‘alternate names’ field, which would indicate that there was a preferred name to use, but that one or more others should be listed by the parser as comments or in a separate field. I am not sure which would be more appropriate,

I put the preambles separately because I thought there was a possibility that they could be contributed by different authors. For example one might wish to publish a limited germline set, perhaps relating to a particular ethnic group or subspecies, while retaining alignments and annotations that have already been published. Or provide Chothia numbering for an existing germline set. Or perhaps alignments might be generated automatically by the database. Perhaps it’s a point that would benefit from discussion when we meet next?

This is supposed to be a logical schema, so I think the right thing to do is remove the example alignment and simply state that the alignment will be defined in some way. Initially I felt that was a bit weak but we shouldn’t get drawn in to debating the format which I feel is likely otherwise! I’ve made that change.

w.lees · August 22, 2016, 6:47pm

Jianye,

Thanks very much for your comments.

With respect to ‘actual content’ - I’m hoping (with your help!) to build this into a full list of the information that the parser needs to parse the repertoire - but I’m trying to focus on the information so that we can agree on that, before getting into the definition of formats and so on - which makes it tricky to provide examples right now. I hope that’s ok.

The translation frame is covered by ‘Codon Frame’ in the Sequences section (I had previously called this ‘Register’ which Erick pointed out is not a well-used term, so hopefully it is clearer now that it has a better name). I have added the CDR3 boundaries - meaning that the Regions section now needs to cover J genes as well as V genes.

The purpose of the alignment information is to allow the sequences to be aligned in accordance with a numbering scheme. This could be IMGT, Kabat or Chothia for example.

ematsen · August 29, 2016, 11:32am

Thanks, William. Regarding multiple names for identical sequences, can we invoke some higher powers, say @ctwatson and @a.collins? Should I assume that you all are going to want to name the same sequence two different things because of where it sits in the genomic sequence?

As I’ve said it would be much more logical to have one name per sequence from a bioinformatics point of view. Perhaps one could encode information about the RSS and genomic position into an expanded definition of a haplotype that would allow multiple copies of the same (named) sequence to appear.

ctwatson · August 29, 2016, 1:04pm

I think this definitely deserves some time for discussion on our next call, once we get an update from Andrew. I don’t think we have talked explicitly about where our definition of “allele” will start and stop. Given that we are making the argument that from this day forward it will be permissible to call/infer novel germline alleles from expressed rep data, these will not include information beyond the coding sequence. But we know that identical coding sequences can occur at multiple places in the genome, or even reside in the same location but adjacent to different RSS or promoter variants, etc.

Given the former point (ie, a large portion of data will be coding only), and @ematsen preference to have one name per sequence because this is more logical from a bioinformatics point of view, perhaps the best way forward is to reduce identical coding sequences to one ID, but attach associated information about them that would include possible genomic positions/loci, and regulatory variant information.

I think we should discuss this on our call this week (or next month, depending on time), as I think it has a big impact on how we think about the data in the germline DB.

a.collins · August 31, 2016, 1:19am

The only thing I would add to Corey’s post is that this will also depend on our relationship with IMGT and whether or not we build out database to facilitate a return to partnership with IMGT in the future. Here I am talking about the relationship between any new nomenclature and the IMGT nomenclature. The issue probably needs to be considered species by species. For the human, there must be around 150 sequences at present that are named, yet their place in the genome is unknown. And as I keep saying, in the case of 100 sequences, their place can’t be known because they are not real IGHV sequences. So do we ask for more of new sequences than was asked in the past?

If inferred sequences have an indication of their status, such as *p12 (discussed elsewhere), and if there was an understanding that the names of putative polymorphisms were provisional, many of the problems would be resolved.

When I think of the mouse, I think of retaining the IMGT nomenclature for B6 mice and mice that seem related to that strain, but perhaps adopting an interim VBASE2-type nomenclature for other strains, until the complete locus for, say, BALB/c-like strains was determined. Then, a positional nomenclature might also be adopted. In the interim nomenclature, there would be one name for identical sequences, but in the ultimate nomenclature, the sequence would acquire two or multiple names.

I know there could be opposition to the idea of an interim nomenclature, but I can’t really see how else we can proceed with teh BALB/c mouse.

javh · September 1, 2016, 8:07pm

Just one minor point: Whatever standard is used should also accommodate the constant and leader region sequences, with some way to link the leader and V segment sequences. It looks like what @w.lees put together will handle that.

ematsen · September 7, 2016, 1:19pm

I don’t think that this got addressed, did it? Is this in scope for one of the sub-working-groups?

ctwatson · September 7, 2016, 1:36pm

You are correct. I decided to not bring it up because we had used up our hour. We should discuss next time for sure when each of the subgroups reports back to the larger group. I would think it is something that probably 3-4 of the groups keep in mind, but I think it is probably most pertinent to the sub-groups taking on the “Evaluation of Existing Alleles” (@martin_corcoran) and “Germline Set Format Standardization” (@w.lees).

Would you agree?

ematsen · September 7, 2016, 2:25pm

@ctwatson: sounds great.

Here is a new question for people who will be curating these germline sets: what is a file format, complying with the schema, that you can imagine maintaining as new alleles are added and removed?

Is it self-evident that a spreadsheet format is the only one that will make everyone happy?

Important clarification: I am asking here about the comprehensive format will that will implement the schema and contain all of the information we are going to want. This information will get parsed to build the website and downloads in formats more suitable for use by computer programs (which will contain a subset of the information, such as those suggested above).

[Hoping to hear from the regulars on this thread, as well as perhaps @martin_corcoran and @laserson?]

laserson · September 7, 2016, 4:04pm

This schema looks great! One question I have is about the need for multiple files. You’ve basically proposed a relational schema with 3 tables, where there is a 1-1 relationship between the tables using the same unique primary key. What’s the advantage of multiple files then? I’m afraid it would only increase the complexity of working with the files, distributing them, configuring tools, etc. Couldn’t all the data be put into a single file? It could also be managed on something like GitHub so that every single change could have a justification tied to it.

ctwatson · September 8, 2016, 6:39pm

spreadsheet seems most likely to make everyone happy.

w.lees · September 13, 2016, 8:53am

@laserson

The schema I proposed is a logical schema: I’m not at this point suggesting an organization into files because I thought it best to agree the logical level first. But I don’t see any issue with encoding the information as a single file and I agree there are good reasons for doing so.

The division into three tables is a key one and it would be good to understand how people see this. The thought that went through my mind was this: there are really three spheres of inference that are made in defining a germline set: in principle, they can be made independently, and it would be useful to identify them separately so that one can see exactly where the inferences originated. Each could be supported by an academic paper, for example. The three spheres are: the selection of the genes in the gene set; the naming of those genes; and their standardized numbering. I thought that the separation would help people who wished to work in just one of these areas, and also help to avoid standoffs over the introduction of new naming schemes or new genes in the future.

If it’s easier to understand and fits in better with the way we think of the data, it could be condensed into a single table, with three records in the preamble listing the origin/authorship of these three areas. The only slight complication with that is establishing exactly which fields in a record ‘belong’ to each area, but that’s manageable. Really that’s just a normalization of the data, so the important question is which description is easiest to understand and most helpful to people.

Alternatively, it might be that I am making too much of the distinction between the three areas, and that in the end the germline set will always be put together by a single group or algorithm, so that a distinction is unnecessary. In some respects, it depends what we can expect from the germline database. On the other hand, if we want a parser to be able to include in its report specific details of the selection of genes, the naming convention, and the numbering scheme, the information had better be carried in the germline set.

Thoughts welcome!

martin_corcoran · September 13, 2016, 9:29am

I would like to suggest that we include leader sequence and 5’UTR regions in our germline set schema - for two reasons.
First, the availability of this upstream sequence can be vital for optimal primer design for groups seeking to create new primer sets for library production and who want to preserve an unaltered V segment (hence framework 1 and CDR1 region primers are inadequate).
Second, the upstream sequences of V alleles are an additional source of information available for those seeking to design novel assignment tools.

w.lees · September 15, 2016, 11:35am

@martin_corcoran Thanks for this. I have added a note against the sequence field saying that the sequence should cover these regions wherever possible, and I have added fields 5UTR and LEADIN to the alignment section.

w.lees · September 15, 2016, 11:38am

@bussec and Chris Warth - many thanks for your comments on the google sheet, which I hope I have addressed now.

werner.muller · September 17, 2016, 7:49am

I would like to make general comments (again)

I suggest that one entry in the gremline set should contain the gremline sequence of a variable V gene segment on top

together with

cross references to entries in other databases

evidence

based on data present in public databases

(may be add a possibility to add own data not submitted to public databases)

It should link to the gremline region if the genome location is known

(may be include the germline sequence of a window of the gene including the promotor region, when I did sequence analyses, most of the time a region of about 800bp per V gene segment would be sufficient).

Should point to rearranged examples of the gremline sequence if it is a functional gremline sequence

What information should be added to rearranged examples? Here is now the chance to also add information on antigen specificity, combinations of both V gene segments when known etc. Such data would allow in the future to create maps which V gene segments are used in which antigen specific responses. In case of TCR segments, information on MHC restriction and peptide specificities (s)could also be added.

Big question, should somatically mutated variants also added?

Metadata should be added (like type of variable gene segment, species etc as discussed)

Important is that
the entries are crosschecked to avoid duplications

a condensed format gremline V gene database is available only containing the unique name and the sequence with a link to the complete entry is available.

Ideally subsets specific for particular types of variable gene segments and or for particular species should be generated as well.

one more comment:
At IMGT we discussed early in the project if we should start a specialised public database which would accept antibody and T cell receptor sequences on behalf of the big databases (GENEBANK, EMBL) and start a separate public database, moving rearranged sequences from the general databases to the specialised database and create special entries with gremline sequences to the big databases. This idea did not get anywhere, may be could be discussed again.

For HLA sequences the scientific community agreed that all HLA sequences are submitted to a specialised HLA sequence database where also the sequences are checked and named and only then publicly available. For TCR and antibody genes, such a system was never installed.

w.lees · September 17, 2016, 2:24pm

Werner, thanks very much for this detailed and thoughtful input. I think we should include these data elements in the full schema, but not in the reduced information that we send to parsers. Does that make sense to you?

William

werner.muller · September 17, 2016, 3:07pm

I can imagine that there will be several parsers with “different flavours”, at least one should capture the complete information.

caschramm · September 19, 2016, 8:37pm

All of this strikes me as important and useful; but other than simply documenting that a gene is, in fact, used in functional rearrangements, it seems to move quite rapidly away from the current goal of a germline database…