I have a next-generation sequence dataset of paired HIV and antibody repertoire sequences, and I am wondering where is the best place to deposit these sequences? Ideally I’d like them to all be in the same repository. Specifically the dataset consists of:
-human variable region IgH sequences from peripheral blood. Amplified from RNA.
-HIV env C2/V3 region sequences, also from peripheral blood.
Do people typically use SRA, or is there somewhere else that is better?
I am also wondering if I need to worry about patient identifiability when uploading antibody repertoire data? Can one be identified by their antibody repertoire?
I think that a great standard was set by Bryan Briney in this paper
That is, put the raw info on SRA, and have the compressed preprocessed data somewhere easy to get. He used GitHub, which is great, though Zenodo would be just as good.
Both types of raw data can go in the SRA; you can even use the same BioProject ID to keep them easily linked together. For the antibody sequencing, it is even better if you can deposit metadata meeting the MiAIRR standard:
For more, see the docs page and checkout the CAIRR pipeline.
For processed data, GitHub, Zenodo, and FigShare are all reasonable options, but even better would be VDJServer and/or iReceptor. (Though you have to ask @Brian_Corrie about the latter, as I don’t think there’s a way for users to upload directly.)
As far as identifiability: this is a matter of some debate. The Briney paper @ematsen linked above claims that repertoires from different individuals can be easily distinguished from one and other, and we’re already able to infer a reasonably accurate personalized germline repertoire from typical AIRR-seq data. So there are people who specifically deposit in dbGaP instead of the unrestricted SRA. (The Boyd lab at Stanford is a prominent example.) But most of us are still using the unrestricted SRA. The choice can also depend on how the donor were consented, so I would recommend discussing with others at your institution and seeing if there is a standard policy.
@caschramm you are correct, we curate public data in our repository. We typically get these public data from studies that have published their “raw” data in SRA. Unfortunately, we are unable to provide a mechanism for users to curate their own data in our repository (we just don’t have the resources for that).
Instead, we provide what we call the “Turnkey Repository” (beta) which is an AIRR Compliant set of dockerized containers that researchers can download and install locally that will hopefully make it easy for research groups to store and share their own AIRR-seq data. The docker containers consist of a database container (Mongo), a web service container that queries that database (using a Web APIR), and a data loading container that helps load AIRR-seq data into the repository…
Once you have such a repository, you can then let us know about it, and we can add your repository to the iReceptor Scientific Gateway (our web portal for searching such repositories). The iReceptor Gateway uses the Web API to query the network of repositories (what we call the AIRR Data Commons) that are linked to the scientific gateway and federate the resulting AIRR-seq data (and its metadata) for further analysis…
Thanks very much for the responses. I will look into our cohort’s permissions, and figure out which course of action is appropriate. I will also report back when the data is up, should anyone be interested!
TLS (GenBank) is the official AIRR recommended location for processed/annotation AIRR data (linked to the appropriate SRA project with raw data through BioProject/BioSample records). Further information on how to go from an AIRR TSV file to GenBank/TLS submission files is here: