As part of the Software Working Group, we are moving towards developing a collaborative validation exercise. We’ve talked a little about various aspects of how we might do this on this thread, but given some recent conversations I wanted to back up and see if we could establish consensus on the large-scale ideas of what this might accomplish.
In my mind, the overall goal of validation should be to estimate the expected accuracy of a tool on an actual biological data set sampled from a real subject (human or otherwise). Here, the definition of accuracy depends on what exactly is being tested; more on that below.
The principle underlying the tests is that for any given observed set of immune receptor sequences, biological or simulated, there is a true underlying process that has led to those observed sequences. These processes are V(D)J rearrangement, clonal expansion and somatic hypermutation (for B cells), then sequence prep and sequencing. Again, our goal is to estimate deviation of the inferred processes to the underlying processes.
This goal and principle, then, guide us on data set choice for validation experiments. For biological validation data sets for which we know (perhaps partial information about) the history, such as mice with a reduced collection of germline genes, the rest of the process should as much as possible replicate the process happening in wild-type animals. Similarly, when simulating sequences on the computer, we should devise simulation procedures that mimic our understanding of the underlying processes as well as possible.
For example, I would imagine that a high-quality in silico BCR simulation process would consist of
- A subject’s allelic genotype is simulated by drawing germline alleles from the complete set of germline alleles
- Naive cell BCRs are simulated using that genotype using a realistic in silico rearrangement process, including realistic per-gene trimming and insertion distributions
- Experienced cell BCRs are simulated by a process of diversification and nuculeotide substitution; this should include known biases of the mutation process such as context/position sensitivity
- Sequencing errors are simulated
When assessing accuracy, we measure the deviation of inferences on the simulated sequences compared to the true elements that went into the sequences. For example, compare inferred allelic genotype to the actual genotype, compare inferred per-sequence rearrangement event to the true rearrangement, and compare inferred clonal families to the true clonal families.
By repeating this among many simulated subjects, with large sets of sequences, we can get an idea of how these tools perform on real data.
I know that this perspective isn’t universally shared. @jianye, would you like to present an alternate perspective?