Sponsored by the AIRR Community

Database on published TCR sequences with known antigen specificity

Right now I’m mining published studies to extract TCR sequences with known HLA/antigen specificity.
I’ve organized everything using a quite straightforward architecture (see this repository):

  • All data is stored in a tab-delimited format, arguably the easiest to operate manually
  • Papers that potentially contain annotated TCR sequences stored as issues, submission managed via pull requests (I plan to implement a simple form for generating them), CI checks for consistency

Please share your thoughts:

  • Do you think such database will be a useful resource for the community?
  • Do you know of any similar initiatives?

Any comments on the design would be most appreciated.

3 Likes

@mikhail.shugay I think this is a great idea, especially if it encourages people to contribute unpublished TCRs. I presume that one application of such a database would be prediction of antigen specificity? If so, it would be useful to have some ground rules about how to test performance (e.g. using 90:10 crossvalidation) to help comparability between different algorithms.

Benchmarking TCR specificity algorithms is of course the ultimate goal. Doing that via crossvalidation assumes the algorithm is not specificity prediction, but TCR clustering one. As for a prediction algorithm, I think running it against the database and TCR:ag permutations will be the first step.

However right now the goals are far more humble: doing some basic meta-analysis and annotating RepSeq samples. The later is extremely exciting as it is a new dimension of analysis that can be applied to published RepSeq studies, just imagine having all RNA-Seq papers with no GO enrichment analysis performed.

Contributing new sequences will be quite easy once a web interface for submissions is done, github interface is not that biologist-friendly. Now I’m focused on previously published papers as there is a ton of information, even “public” clonotype studies are the tip of the iceberg. The problem is the great diversity in reporting style, needless to say some papers require running image processing software.

@mikhail.shugay I think this is a great idea, and would certainly be of use to the community. I have had similar ideas, but for a BCR database, and have made a start collating these from the literature in a fairly crude manner. I have been thinking about the best ways to share this database, and also to encourage people to submit their own sequences (trawling the literature, and manually typing in sequences is not ideal), but have not started developing anything yet. Do you think it would be possible to make a version of your database where BCR’s could be submitted? I have found my database to be very useful for initial annotation of our RepSeq datasets collected following vaccination, and increases the amount of interesting data we can extract from them.

I was thinking about it, and it seems to me that there is a number of existing antibody databases (some of them are private of course) and it will cause some complications:

  • Extending the current database layout with a column encoding hypermutations (only differences from germline part) and leaving MHC columns blank
  • Currently there are identification/verification fields used to assess the confidence score for a given record (e.g. tetramer sorting, the frequency in sorted population, etc), what should one add for antibody assays?
  • Developing algorithms to query database. I’m currently working on TCR/CDR3 matching algorithms and this seems to be a really complex task. Matching and accounting for hypermutations and really long antibody CDR3 sequences would be an additional level of complexity
    So while an open database of antibody sequences is a greet idea, lots of work is needed to make it really useful. Hopefully when we get enough experience with TCR database we can move on to antibody data

Bump with an updated version of online database browser (still in beta)

Suggestions welcome!

4 Likes