Standardizing the format of a germline set

ematsen · October 11, 2016, 8:35pm

We had some AIRR folks in Seattle for the Immune Profiling meeting, and had a lot of productive discussions about germline database structure, in particular with @bussec, @cswarth, Charles Linkem (Adaptive), and @wsdewitt. @cswarth put together some notes that I’m pulling apart, editing, and posting here.

The discussion quickly led to a central question: what defines a “record” in the database? This doesn’t seem to be quite clear in the schema.

We considered two principles around which to organize the information:

One record per unique sequence

Having one record per unique sequence seems like the obvious thing for bioinformaticians, however, there is a pitfall: identical sequences at different loci. This is important information.

One could aggregate all of the locations into a list for that sequence record, but note that each location can have its own references and its own levels of confidence. This leads to nesting complications that might argue for the second option…

One record per original source reference

An alternative is to have one record per original source reference. Each record would be annotated with all the attributes for that particular germline sequence, including:

citation to published reference, or information leading to the dataset from which this allele was inferred
genome position, if known
segment name, with synonyms
confidence level in that observation of the sequence

[One record per unique position on the genome]

This doesn’t really get traction because one can’t express inferred alleles, one loses the uniqueness of sequences, but this has all of the nesting problems of one record per unique sequence.

So?

Having one record per original source reference seemed to be the consensus at the end of the meeting. However, this raises another question, which is what actually sits in a germline database list of sequences.

@bussec’s answer to this was having a second layer of records which aggregate the per-reference layer. We can talk about this after seeing if @w.lees or anyone else has an alternate solution to the issues raised above.