Looking for BCR datasets containing nonproductive recombinations

caschramm · June 6, 2016, 8:24pm

Continuing the discussion from Evaluation datasets etc:

Thanks to @javh for the list of publically available HTS data. Most of these are sequenced from mRNA, so are unlikely to capture the nonproductive repertoire.
However, some used gDNA, allowing the nonproductive repertoire to be captured as well:

Boyd, S. D. et al. Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing. Sci. Transl. Med. 1, 12ra23 (2009). SRA Accession: SRP001460
Jiang,N. et al. (2013) Lineage structure of the human antibody repertoire in response to influenza vaccination. Sci. Transl. Med., 5, 171ra19. SRA Accession: SRA058972
Ohm-Laursen and Barington (2007) Analysis of 6912 unselected somatic hypermutations in human VDJ rearrangements reveals lack of strand specificity and correlation between phase II substitution rates and distance to the nearest 3’ activation-induced cytidine deaminase target. J Immunol. 178(7):4322-34. EMBL AM076988–AM083316

However, the first two use VH-internal primers (FR1 or FR2) and the last only looks at a single V gene (VH3-23). There’s also a really nice paper looking directly at the nonproductive allele in mice:

Any one aware of other data sets?

javh · June 6, 2016, 9:53pm

In addition to what @caschramm has listed, these are all the public BCR data sets sequenced from gDNA that I’m aware of:

Jackson,K.J.L. et al. (2014) Human Responses to Influenza Vaccination Show Seroconversion Signatures and Convergent Antibody Rearrangements. Cell Host Microbe, 105–114.

dbGaP Accession: phs000760.v1.p1
Roche 454, gDNA

Kaplinsky, J. et al. Antibody repertoire deep sequencing reveals antigen-independent selection in maturing B cells. Proc. Natl. Acad. Sci. 111, E2622–E2629 (2014).

BioProject Accession: PRJNA248676
Illumina MiSeq 2x150, gDNA

Michaeli, M. et al. Immunoglobulin gene repertoire diversification and selection in the stomach - from gastritis to gastric lymphomas. Front. Immunol. 5, 1–14 (2014).

BioProject Accession: PRJNA206548
Roche 454, gDNA

Parameswaran,P. et al. (2013) Convergent Antibody Signatures in Human Dengue. Cell Host Microbe, 13, 691–700.

BioProject Accession: PRJNA205206
Roche 454, gDNA

Roskin, K. M. et al. IgH sequences in common variable immune deficiency reveal altered B cell development and selection. Sci. Transl. Med. 7, 302ra135–302ra135 (2015).

dbGap Accession: phs000934.v1.p1
Illumina MiSeq and Roche 454, mRNA and gDNA

Wang,C. et al. (2014) Effects of aging, cytomegalovirus infection, and EBV infection on human B cell repertoires. J. Immunol., 192, 603–11.

dbGAP Accession: phs000666.v1.p1
Roche 454, gDNA/mRNA

Wang,C. et al. (2014) B-cell repertoire responses to varicella-zoster vaccination in human identical twins. Proc. Natl. Acad. Sci. U. S. A.

dbGAP Accession: phs000817.v1.p1
Roche 454, gDNA/mRNA

I’ve been periodically updating my list. For now I’ve put it here:
https://bitbucket.org/javh/airrseq/wiki/PublicAIRR-seq

Apologies for the plain-text-ness of it. I may move it into a more searchable format at a later date.

Edit: Updated the URL to the new location.

caschramm · June 7, 2016, 1:35pm

Thanks so much, Jason!

caschramm · June 7, 2016, 1:38pm

“You do not have access to the wiki”

javh · June 7, 2016, 3:48pm

Doh! How profoundly unhelpful of me… Access should be public now. Sorry about that.

bussec · June 7, 2016, 9:29pm

Brief comment on this: In general nonsense-mediated mRNA decay (NMD) should destabilize Ig/TCR transcripts with out-of-frame rearrangements. However, the constant regions of IgK and IgL are only encoded by a single exon, hence these loci should not be affected as the stop codon will often only arise in the last exon (not in the J segment). Looking at our single-cell data (mouse) we typically find 25-30% of the Igk or Igl transcripts to be out-of-frame, while the number for Igh is around 15-20%. I never did the statistics whether the difference is significant, but my main point here would be, that NMD seems to be less effective than one would assume.

caschramm · June 8, 2016, 7:18pm

This is interesting to me… for our human bulk mRNA preps, we get 3-5% of reads for which V and J can both be assigned have out-of-frame junctions and another 3-5% have stop codons. This looks pretty consistent between heavy and light chains, though I haven’t checked systematically. I wonder if this is a function of the species or of the prep…

caschramm · June 8, 2016, 7:24pm

I wonder if this may be useful to break down by species… maybe the list can be reposted in a wiki somewhere so that we can help you annotate and update it?

Also, does anyone have experience with accessing data from dbGaP? We don’t otherwise do anything that requires IRB approval, so I don’t even know where to start…

javh · June 9, 2016, 1:28pm

That’s a good idea, @caschramm. It was sort of weird to attach it to the immcantation repo anyway, as that has our new lab member guide (also known as “6 pages of me being snarky”).

I’ve setup a publicly writable version here for now:
https://bitbucket.org/javh/airrseq/wiki/PublicAIRR-seq

Ideally, it should really be a database, so we can sort/search on useful information and cross-reference to publications that have reused the data. I think at a minimum:

Species
Receptor & chain/class
Template (RNA/DNA)
Primer positions (5’RACE, FWR1, CH1, etc)
UMI use and length
Sequencing platform and read length

Not sure what the best/easiest platform for this would be.

caschramm · June 9, 2016, 2:53pm

A little bit of googling leads me to suggest FosWiki as a platform that looks like it would meet our needs…

@ematsen - does this look like something you might be able to implement as a subsection of B-T.cr or should it be hosted elsewhere?

ematsen · June 10, 2016, 1:42am

First, this is all totally great, and huge thanks to @javh for pulling this together.

Hm, do we really need to go all the way to a full wiki? What about a CSV hosted on GitHub that upon push builds a HTML page with a sortable table?

CSVs are more machine-readable than wikis, too.

javh · June 13, 2016, 8:07pm

Those are good suggestions @caschramm and @ematsen. I’m inclined to think the csv plus sortable table is the better approach for now. It’s not a huge list of papers (yet).

caschramm · June 15, 2016, 3:00pm

I have no problem with CSVs, either…

bussec · June 15, 2016, 7:28pm

I checked again on our human data sets and for most of them the numbers were in the same range (15-25%). It is probably important to note that the amplification process during single-cell PCR typically runs into saturation and is not quantitative due to potential primer bias (sorry, no UMIs at this point). Thus scPCR will likely not capture the quantitative differences between between productive and non-productive transcripts.
There are however two data sets that show substantially lower numbers (5-10%), but they also differ in the sampled cell populations. Going back to all data sets, the general trend seems to be that populations with high transcriptional activity (e.g. plasma cells) have lower percentages of non-productive transcripts.

caschramm · June 16, 2016, 6:49pm

Fascinating, thanks!

mgvignali · January 22, 2017, 3:24pm

are you aware of this resource: A public database of memory and naive B-cell receptor sequences
PLOS ONE 08/11/2016 11(8):e0160853 - the underlying data can be found in Adaptive’s immuneAcess database.