Standardizing the format of a germline set

w.lees · September 20, 2016, 11:25am

I can see a couple of ways that we can go from here. We know from @ematsen’s objectives for the website that it will be possible to download the entire content of the database for a species. We could just say that’s what we offer, or we could additionally offer a condensed format that will be sufficient for at least the majority of parsers, which could be easier to work with.

I’m inclined to think that it is worth providing the condensed format, because if we make the data simple and accessible, it’s more likely that implementers will support it. Is this a good approach, or would it be better just to offer a single download containing all the information?

laserson · September 20, 2016, 11:06pm

Just a few comments:

I would propose that for “Field” we choose the computer-friendly versions up front. So for “Gene Family”, for instance, we could use gene_family. I also propose that to keep consistency with the Formats group, the field names are lowercase in snake_case.

I think that @werner.muller’s suggestions are great (with @caschramm’s caveats). Even with those additions, it still doesn’t seem to me that the full database will be very large. So I would propose that we hold off from defining a “condensed” version of the database until we know that we need one. It should hopefully be easy enough to parse the full version. And related to this…

I can imagine that there will be several parsers with “different flavours”, at least one should capture the complete information.

I strongly think we should use a file format that already has standard parsers in most common programming languages, such as JSON or CSV. This would alleviate the need for implementing custom parsers, which would eliminate a potential source of bugs or misbehavior.

caschramm · September 21, 2016, 3:50pm

I think there has to be an option to download a simple fasta file, which presumably discards most of the (important!) metadata we are talking about here. Perhaps that’s what @w.lees meant by a condensed format?

laserson · September 21, 2016, 3:58pm

In that case, I’m not happy about it but totally understand its necessity

ematsen · September 21, 2016, 9:19pm

This is definitely part of our thinking for the website. The whole database is available as a big JSON file which formalizes @w.lees’ scheema, but then offer a dead-simple FASTA/CSV download with the sequences and some simple information about them.

w.lees · September 23, 2016, 1:54pm

Thanks to @laserson @werner.muller @caschramm and @ematsen for the points raised this week.

Following discussion on the backend design thread, it looks as though there will only be a need for a small number of naming schemes and numbering schemes within each species. For example, for naming, we may wish to have an ‘AIRR name’ and a ‘Legacy name’ against each germline, but probably no other names will crop up. For numbering, similarly, we may wish to list an IMGT numbering scheme, Kabat and Chothia, but again, probably no others, or at most a very small number.

That being the case, the structure I proposed, with independent numbering and naming sections that can be authored by different teams and split out at will, is looking a bit over-complicated for the task. Over the weekend I propose to de-normalise it, moving instead to a single table with fields for alternate names and numbering. At the same time I will add the additional fields that @werner.muller has proposed, and the naming convention that @laserson has proposed. I will do another comb-through for additional fields at the same time. If anyone has concerns about this approach please let me know and we can hold off and discuss, but it seems to reflect the way we are all thinking.

We also seem to be moving towards consensus in JSON as a file format, by the way. It looks as though implementation has already started on that basis (this is fine by me, by the way, it’s always struck me as a good choice, even though I’ve really been focussed on the logical level). If anyone has issues with JSON as a file format for the ‘full’ germline set, please post something here so that we can resolve - otherwise I think it becomes our choice.

There is also a desire for a simple cut down germline set for parsers as @caschramm highlighted. It feels to me as though the priority right now is to get the full schema defined. It should be easy enough to pick out the fields for the cut-down version, and to provide it in a number of formats if that proves necessary. Again if anyone has concerns about putting the priority on the full schema right now, please post here and we can discuss, otherwise I suggest we come back to the cut down version and its format(s) once the full schema is ok with everyone.

Thanks

William

laserson · September 23, 2016, 3:41pm

We also seem to be moving towards consensus in JSON as a file format, by the way. It looks as though implementation has already started on that basis

Btw, where is this starting? I would propose that we work on implementations at the AIRR community GitHub org: https://github.com/airr-community

Let me know if you need access.

w.lees · September 27, 2016, 6:09pm

A little later than planned, I have now published a revised schema with simplified structure. After some thought, I am proposing that we have one record per germline sequence, but make it a compound record that includes one or more field delineations - where a ‘field delineation’ is the set of information necessary to identify where the different fields are according to the IMGT scheme, Kabat scheme or whatever. ‘Field delineation’ is the term used by NCBI on the IgBLAST website, and it strikes me as a better term than ‘alignment’ which I have used to date. I hope the term is ok with the community.

The compount record format is no problem for JSON and I think can easily be flattened into multiple sets of columns in a CSV format, or multiple sets of name/value pairs in a FASTA format if we wish to include all the information in that format. I hope you will agree that it provides an acceptable compromise between fexibility and simplicity of format, but please let me know your thoughts.

@werner.muller and @mikhail.shugay I have included the items from your posts - specifically genome co-ordinates and evidence. @werner.muller I have not as yet included cross references to other databases. I do at the moment have fields for a single ‘database of record’ where the gene sequence is deposited. Which other databases should we include? I suppose we could make this a list of name-value pairs if necessary, if we wish to keep it flexible, but it would be useful to have examples so that we can check what kind of references would be necessary.

To address these comments I made in a parallel thread I have included sequence_status and deprecation_reason to give us a little more flexibility in categorizing and subsetting gene sets.I have provided for two alternate gene names, in addition to the canonical name, so that we could, for example, list the current human gene name as well as a novel one should we introduce a new naming scheme.

To address potential issues with update conflicts I have proposed that the record for every gene should have a version number. That way, if a spreadsheet or other file with edits is uploaded, it will be easy to tell whether other authors have updated the record in parallel.

I have renamed all fields in snake_case. I would love to display them in courier when I refer to them in the file, but it doesn’t seem possible to have mixed fonts in a cell.

Please post comments in this thread, or questions if anything is unclear, or comment in the document if you prefer. I am particularly keen to know if I have captured everything required for deposition, evidence and so on.

p.s. I have made some small changes this morning. In particular I have attempted to clarify what is meant by a ‘database of record’ - in most cases, I assume that this would be the AIRR DB, but having the field makes it possible for us to extend to others if we wish to. I have also added an explicit field for a Genbank ID, should the sequence have been deposited in Genbank. With reference to the question I asked in this post about references to other databases, this means we are now referring explicitly to Genbank and Ensembl.

Thanks!

William

w.lees · September 29, 2016, 8:39am

The Human Germline Genes subgroup has proposed guidelines which have some bearing on the schema:

Sequences must be reported in a peer reviewed journal.
Sequences must not include ambiguities
A single cDNA derived sequence cannot be the sole evidence for a germline sequence
The database must include full-length sequences.
We exclude all sequences generated by six studies that have high sequencing error rates.

For (1) I have included a list of citations in the gene record. I’ve also added a list of citations for the gene set itself.

For (2) and (4), the schema is more permissive at the moment and we can consider whether it should be tightened it up to impose these restrictions. My inclination, at the moment, is not to do so. The same guidelines might not be applicable to other species, where the germline is not so well described. In any case, I think the discrimination should be made by the curators of the gene set: for example, as discussed in that post, there is sometimes uncertainty over the final nucleotides of the V-gene. Also, the schema might be used in other contexts where the restrictions are not imposed. For the same reason, I have made the citations fields optional rather than mandatory.

As always, please let me know whether you support this approach or have a different view.

Thanks

William

ematsen · October 2, 2016, 5:24pm

Hello there @w.lees-- this is great, and we’re taking a look at it now. However, the link you put in doesn’t allow comments. Could you paste a link that does? Thanks.

w.lees · October 3, 2016, 8:20am

Sorry, I have updated the message with a link that should allow comments.

schristley · October 4, 2016, 5:07pm

This looks great @w.lees, I’ve compared your schema against what VDJServer uses for its germline database and I think you’ve covered all of the fields that we have.

One thought I had was about the alternative/synonym names, instead of having two fields, would it be better to have a list in case there are more?

w.lees · October 6, 2016, 4:53pm

Thanks @schristley. I thought the two columns for alternative names might work better, because the curators could choose to use one column, say for IMGT names, the other column for some other set of names. So the column refers to a specific set. We could achieve the same effect with name/value pairs in a list, but it would not be very human readable. If necessary we can add more columns at a later date, but the number of different sets of names will hopefully be small.

Does this make sense?

William

ematsen · October 11, 2016, 8:35pm

We had some AIRR folks in Seattle for the Immune Profiling meeting, and had a lot of productive discussions about germline database structure, in particular with @bussec, @cswarth, Charles Linkem (Adaptive), and @wsdewitt. @cswarth put together some notes that I’m pulling apart, editing, and posting here.

The discussion quickly led to a central question: what defines a “record” in the database? This doesn’t seem to be quite clear in the schema.

We considered two principles around which to organize the information:

One record per unique sequence

Having one record per unique sequence seems like the obvious thing for bioinformaticians, however, there is a pitfall: identical sequences at different loci. This is important information.

One could aggregate all of the locations into a list for that sequence record, but note that each location can have its own references and its own levels of confidence. This leads to nesting complications that might argue for the second option…

One record per original source reference

An alternative is to have one record per original source reference. Each record would be annotated with all the attributes for that particular germline sequence, including:

citation to published reference, or information leading to the dataset from which this allele was inferred
genome position, if known
segment name, with synonyms
confidence level in that observation of the sequence

[One record per unique position on the genome]

This doesn’t really get traction because one can’t express inferred alleles, one loses the uniqueness of sequences, but this has all of the nesting problems of one record per unique sequence.

So?

Having one record per original source reference seemed to be the consensus at the end of the meeting. However, this raises another question, which is what actually sits in a germline database list of sequences.

@bussec’s answer to this was having a second layer of records which aggregate the per-reference layer. We can talk about this after seeing if @w.lees or anyone else has an alternate solution to the issues raised above.

a.collins · October 11, 2016, 10:53pm

Apologies if I am re-visiting points that have been made already, here or on other threads. Focusing on @ematsen’s “One record per original source reference”. The point regarding publications seems to be implying that publications may not be necessary. I need to be convinced on that. It is true that a bad publication serves little purpose, but seeing detail of how sequences have been generated has been critical to previous evaluations of ‘confidence levels’. I also think we need to encourage publications that identify BCR/TCR genes. A problem that we had for many years was that people came to see such reports as unimportant. By insisting on publications, we counter that view.

On the point of ‘confidence levels’, is this an evaluation of the confidence of the observation in the original source reference, or is this a measure of overall confidence, taking in additional evidence? If it is the former, this is an evaluation that I am not aware is being worked on at present. If the latter, the confidence level will change over time. Or have I missed something? I thought that confidence levels, as well as ‘functionality’ or ‘rearrangeability’ would be a separate layer of the system.

I would also add a comment on the ‘one record per unique sequence’ problem. I know leader sequences, RSS etc have been mentioned occasionally in various threads. Although most reported sequences only include the coding region, we should be encouraging the reporting of these elements. And one day, this will lead to a new kind of diversity in the germline repertoire. Not only are identical sequences possible at different loci, but we have the possibility of identical coding regions being associated with varying RSS or other regulatory elements. I can imagine attention returning to these elements as we struggle to understand the highly variable utiilization frequencies of different genes. Should the database be designed so that it is ready for that?

ematsen · October 12, 2016, 12:43am

Thanks for your thoughts, @a.collins!

First, I should make it clear that “one record per original source reference” means that a given sequence may have many records in the database. We can think of each of these records as being bits of evidence that contribute to the overall judgement on including that sequence into a high-confidence germline set. Do you think that we should exclude any non-published evidence, even if sequences are publicly available?

For this consideration it was the former. Do you not think that a particular paper could be assigned a confidence based on methods used?

You are right that the per sequence rather than per source confidence in a sequence would be a separate layer, only hinted at here with the aggregate records.

And I love your comment about additional information on top of the coding sequence! By having a more flexible configuration afforded by “one record per original source reference” we can accommodate that.

a.collins · October 12, 2016, 3:04am

Thanks for the clarifications @ematsen! On the issue of whether or not a study needs to be published, I think we should do whatever we can to encourage this. It obviously is critical if a thorough evaluation of evidence is to be made. But there is a second reason why I think it would be useful. For most of the last 20 years, there has been little incentive to publish germline sequence studies, and we should adopt policies that turn this situation around. Demanding that a study be published if a sequence is to be taken into a database is not an incentive. On the other hand, the incentive will be there if we encourage people to cite publications that report sequences, instead of or in addition to just citing a paper describing the database. At present, the IMGT database is cited hundreds of times, but the papers that originally reported the sequences are long forgotten. Not only is this a little unfair, but if more people were aware of the original publications, we could have a very lively debate about the likelihood that the sequences were accurately reported.

w.lees · October 12, 2016, 10:47am

I think this is a good discussion, because it really focuses on what the germline set is for, and who is going to use it. For my part, I see the germline set (as defined currently in the schema) as the product of the curator’s work - in other words, the ‘second layer of records which aggregate the per-reference layer’. To that end, it has just a single reference to an original source sequence, which you can think of as the primary evidence for the gene described in the record. It doesn’t attempt to include details of everything the curators took into account when deciding to include this gene- I’d assumed that curators would vary in their approach, and that such details would best be described in a publication, or in release notes.

We probably do need to provide a place where inferred sequences can be deposited, as we know they can’t be deposited in Genbank. But I think we should keep this separate from the germline set. If we try to put all the source sequences into the germline set, it will accumulate many duplicates which don’t have biological significance or interest, and I think this would make it more difficult for downstream tools to consume it.

ematsen · October 14, 2016, 8:31pm

Thanks, @w.lees-- you are right that this is about who the database is for.

Namely, is the database meant to be useful for the curators in addition to the consumers? I could imagine it being nice for the curators to be able to use it to keep track of the evidence for a given gene, and also for people who want to understand the basis for a given “level of evidence” call for a given germline gene. Maintaining release notes is great, but might also be a burden.

I’d love to hear from some potential curators on this point, such as @a.collins, @ctwatson, and @bussec.

w.lees · October 15, 2016, 10:12am

I have made some small updates to the schema definition in response to comments posted in the document:

ORCID and PubMed ID added to author name and citation respectively, where these items exist.
field labels 5_UTR and LEADIN changed to 5’UTR and L-REGION to match the labels used in the IMGT Ontology, As I noted in the document, I suggest that it is worth using the ontology as a starting point, as the terms are defined and quite widely used. Over time we can adapt and extend the definitions as necessary. The full set of labels is at http://www.imgt.org/ligmdb/label and background/citations are at http://www.imgt.org/IMGTindex/ontology.php .

Does anyone have views on the adoption of the IMGT labels as a starting point?

@a.collins, With respect to the comment in your last mail, the schema can accommodate any other labels we need in addition to the ones we already have (5’UTR, L-REGION, FR1, CDR1, FR2, CDR2, FR3, CDR3, FR4), provided that they are annotations which occur at most once per sequence. If we wanted to add annotations, for example to label hotspots, that can occur multiple times in a sequence, we’d need to construct something else. That’s certainly do-able, but is not going to be easy to read or manage in a spreadsheet-type form.

Are there other labels in addition to the ones listed that we should add at this stage?