Draft criteria for inclusion of inferred alleles into a germline database

ematsen · July 6, 2016, 5:55pm

I like @tbkepler’s the idea of a statistical framework. Presumably the idea of this framework would be to, given a collection of rearranged sequences, indicate a level of confidence in a given inferred germline gene. As much as I love the spirit of this idea, it seems to me that any such framework would have to (at least implicitly) model the long-term evolutionary process of germline diversification, in addition to all of the complexities of repertoire sequencing. This seems like a tall order.

In such situations it seems perfectly suitable to use leave-out experiments as a way of calibrating our sensitivity and specificity. Although there are statistical issues with having the data ahead of time, we are increasingly rich with both repertoires and new germlines sequenced from genomic DNA. Thus we will be able to test on new data sets.

This is a very important concern, but this is a user-level choice. Our role is to provide a convenient tiered structure that will enable this choice to be made.