Statistical classifiers for diagnosing disease from immune repertoires


I want to share our preliminary work on developing statistical classifiers for repertoires. The main idea in our paper is to score snippets (6-mers) of CDR3 sequence by their biochemical features with a “detector” function and to aggregate the scores into a single value that can represent a diagnosis. We believe this is an important step toward utilizing the information contained in each individual sequence instead of relying on summary statistics of repertoires (i.e. diversity scores).


Neat paper!

From your Abstract:

…prior methods to date have been limited to focusing on repertoire-level summary statistics, ignoring the vast amounts of information in the millions of individual immune receptors comprising a repertoire. We have developed a novel method that addresses this limitation by using innovative approaches for accommodating the extraordinary sequence diversity of immune receptors and widely used machine learning approaches.

And from the Conclusions:

Our method is the first to apply statistical learning to immune repertoires to aid disease diagnosis, learning repertoire-level labels from the set of individual immune repertoire sequences.

You might be interested in these papers:



Perhaps we should have phrased that sentence (and maybe another one like it) differently. The last two papers came out after we had submitted for peer-review.

I think our approach is distinct in that it really highlights using sequence level features, not features that summarize a cluster or that summarize a repertoire.