Immune repertoire sequence data sets are large and complex. Hence a lot of effort has gone into finding ways of characterizing repertoires. One way to do so is with summary statistics, here ways of turning a pile of sequences into collections of numbers that can be easily compared.
An important application of summary statistics is in comparing simulations to data. When we are simulating data to benchmark performance of immune repertoire sequence analysis tools, we must ensure that the simulated data recapitulates characteristics of real data sets. We can do this using summary statistics by evaluating if the summary statistics of the simulated data has similar values to that of the real data. In this way we can benchmark the benchmark data sets.
Here I’ve started a list of summary statistics that might be useful. I’ve focused on BCR repertoires, because they have the additional complication of being affinity matured. One could pick a set of TCR summary statistics by just taking those that are relevant to the TCR case.
When considering summary statistics, my general sense is that just more is better, at least initially. Later on, we can do something to look at the most discriminative summary statistics, or do principal components in this space to aggregate them. We should aspire to find statistics that are robust to noise.
Of course, I’d love to hear any general feedback on this approach.
“Naked” sequences (only error-corrected & paired-end assembled)
- pairwise distance distribution
- distributions of distance to closest, second closest, etc.
VDJ-aligned sequences
- joint distribution of germline gene use
- distribution of distance from naive to mature
- distribution of distance to second closest naive sequence (perhaps with a different V gene?)
- distribution of junction lengths
- distribution of distances between CDR3s
Clustered sequences
- cluster size distribution
- estimated total clonal diversity (many estimators following work of Efron and Thisted)
- mutation models (e.g. S5F)
- selection estimates (e.g. synonymous/nonsynonymous rate)
Trees
- tree balance (Mooers and Heard)
- graph theoretical features (many papers of @ddw and Ramit Mehr)