I want to share our preliminary work on developing statistical classifiers for repertoires. The main idea in our paper is to score snippets (6-mers) of CDR3 sequence by their biochemical features with a “detector” function and to aggregate the scores into a single value that can represent a diagnosis. We believe this is an important step toward utilizing the information contained in each individual sequence instead of relying on summary statistics of repertoires (i.e. diversity scores).
Neat paper!
From your Abstract:
…prior methods to date have been limited to focusing on repertoire-level summary statistics, ignoring the vast amounts of information in the millions of individual immune receptors comprising a repertoire. We have developed a novel method that addresses this limitation by using innovative approaches for accommodating the extraordinary sequence diversity of immune receptors and widely used machine learning approaches.
And from the Conclusions:
Our method is the first to apply statistical learning to immune repertoires to aid disease diagnosis, learning repertoire-level labels from the set of individual immune repertoire sequences.
You might be interested in these papers:
- Thomas et al. Bioinformatics, Volume 30, Issue 22, 15 November 2014, where Atchley vector encoding of TCR sequence kmers is used to classify immunization exposure
- Emerson et al. Nature Genetics 49, 659–665 (2017), where a viral infection classifier built from ~100 million TCR sequence features from ~700 subjects performs with AUROC > 0.9 in cross validation and in a separate validation cohort.
- Dash et al. Nature 547, 89–93 (06 July 2017) and Glanville et al. Nature 547, 94–98 (06 July 2017), each of which presents methods to classify epitope specificity of TCR repertoires.
Thanks.
Perhaps we should have phrased that sentence (and maybe another one like it) differently. The last two papers came out after we had submitted for peer-review.
I think our approach is distinct in that it really highlights using sequence level features, not features that summarize a cluster or that summarize a repertoire.