Most predictive models assume the features are arranged into rows and columns, like a spreadsheet, but many kinds of data do not conform to this structure. Sequences are one example of a different kind of data, which is why this data is usually stored in a text document, not a spreadsheet. To build predictive models for sequences and other non-conforming features, we have developed what we call dynamic kernel matching (DKM).
We apply DKM to two datasets of T-cell receptors.
- The first dataset is what we call the antigen classification problem. From sequenced T-cell receptors, we predictively model which of six disease antigens the T-cell receptor binds.
- The second dataset is what we call the repertoire classification problem. From sequenced T-cell repertoires, we predictively model CMV serostatus. This model is built without knowledge of the CMV specific T-cell receptors.
The full results are in a manuscript we’ve submitted for peer-review, but I thought this community may want the code, which is at https://github.com/jostmey/dkm.
Cheers