Looking for BCR datasets containing nonproductive recombinations

Continuing the discussion from Evaluation datasets etc:

Thanks to @javh for the list of publically available HTS data. Most of these are sequenced from mRNA, so are unlikely to capture the nonproductive repertoire.
However, some used gDNA, allowing the nonproductive repertoire to be captured as well:

  1. Boyd, S. D. et al. Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing. Sci. Transl. Med. 1, 12ra23 (2009). SRA Accession: SRP001460
  2. Jiang,N. et al. (2013) Lineage structure of the human antibody repertoire in response to influenza vaccination. Sci. Transl. Med., 5, 171ra19. SRA Accession: SRA058972
  3. Ohm-Laursen and Barington (2007) Analysis of 6912 unselected somatic hypermutations in human VDJ rearrangements reveals lack of strand specificity and correlation between phase II substitution rates and distance to the nearest 3’ activation-induced cytidine deaminase target. J Immunol. 178(7):4322-34. EMBL AM076988–AM083316

However, the first two use VH-internal primers (FR1 or FR2) and the last only looks at a single V gene (VH3-23). There’s also a really nice paper looking directly at the nonproductive allele in mice:

Any one aware of other data sets?

In addition to what @caschramm has listed, these are all the public BCR data sets sequenced from gDNA that I’m aware of:

  1. Jackson,K.J.L. et al. (2014) Human Responses to Influenza Vaccination Show Seroconversion Signatures and Convergent Antibody Rearrangements. Cell Host Microbe, 105–114.
  • dbGaP Accession: phs000760.v1.p1
  • Roche 454, gDNA
  1. Kaplinsky, J. et al. Antibody repertoire deep sequencing reveals antigen-independent selection in maturing B cells. Proc. Natl. Acad. Sci. 111, E2622–E2629 (2014).
  • BioProject Accession: PRJNA248676
  • Illumina MiSeq 2x150, gDNA
  1. Michaeli, M. et al. Immunoglobulin gene repertoire diversification and selection in the stomach - from gastritis to gastric lymphomas. Front. Immunol. 5, 1–14 (2014).
  • BioProject Accession: PRJNA206548
  • Roche 454, gDNA
  1. Parameswaran,P. et al. (2013) Convergent Antibody Signatures in Human Dengue. Cell Host Microbe, 13, 691–700.
  • BioProject Accession: PRJNA205206
  • Roche 454, gDNA
  1. Roskin, K. M. et al. IgH sequences in common variable immune deficiency reveal altered B cell development and selection. Sci. Transl. Med. 7, 302ra135–302ra135 (2015).
  • dbGap Accession: phs000934.v1.p1
  • Illumina MiSeq and Roche 454, mRNA and gDNA
  1. Wang,C. et al. (2014) Effects of aging, cytomegalovirus infection, and EBV infection on human B cell repertoires. J. Immunol., 192, 603–11.
  • dbGAP Accession: phs000666.v1.p1
  • Roche 454, gDNA/mRNA
  1. Wang,C. et al. (2014) B-cell repertoire responses to varicella-zoster vaccination in human identical twins. Proc. Natl. Acad. Sci. U. S. A.
  • dbGAP Accession: phs000817.v1.p1
  • Roche 454, gDNA/mRNA

I’ve been periodically updating my list. For now I’ve put it here:

Apologies for the plain-text-ness of it. I may move it into a more searchable format at a later date.

Edit: Updated the URL to the new location.

1 Like

Thanks so much, Jason!

“You do not have access to the wiki”

Doh! How profoundly unhelpful of me… Access should be public now. Sorry about that.

Brief comment on this: In general nonsense-mediated mRNA decay (NMD) should destabilize Ig/TCR transcripts with out-of-frame rearrangements. However, the constant regions of IgK and IgL are only encoded by a single exon, hence these loci should not be affected as the stop codon will often only arise in the last exon (not in the J segment). Looking at our single-cell data (mouse) we typically find 25-30% of the Igk or Igl transcripts to be out-of-frame, while the number for Igh is around 15-20%. I never did the statistics whether the difference is significant, but my main point here would be, that NMD seems to be less effective than one would assume.

1 Like

This is interesting to me… for our human bulk mRNA preps, we get 3-5% of reads for which V and J can both be assigned have out-of-frame junctions and another 3-5% have stop codons. This looks pretty consistent between heavy and light chains, though I haven’t checked systematically. I wonder if this is a function of the species or of the prep…

I wonder if this may be useful to break down by species… maybe the list can be reposted in a wiki somewhere so that we can help you annotate and update it?

Also, does anyone have experience with accessing data from dbGaP? We don’t otherwise do anything that requires IRB approval, so I don’t even know where to start…

That’s a good idea, @caschramm. It was sort of weird to attach it to the immcantation repo anyway, as that has our new lab member guide (also known as “6 pages of me being snarky”).

I’ve setup a publicly writable version here for now:

Ideally, it should really be a database, so we can sort/search on useful information and cross-reference to publications that have reused the data. I think at a minimum:

  1. Species
  2. Receptor & chain/class
  3. Template (RNA/DNA)
  4. Primer positions (5’RACE, FWR1, CH1, etc)
  5. UMI use and length
  6. Sequencing platform and read length

Not sure what the best/easiest platform for this would be.

1 Like

A little bit of googling leads me to suggest FosWiki as a platform that looks like it would meet our needs…

@ematsen - does this look like something you might be able to implement as a subsection of B-T.cr or should it be hosted elsewhere?

First, this is all totally great, and huge thanks to @javh for pulling this together.

Hm, do we really need to go all the way to a full wiki? What about a CSV hosted on GitHub that upon push builds a HTML page with a sortable table?

CSVs are more machine-readable than wikis, too.

Those are good suggestions @caschramm and @ematsen. I’m inclined to think the csv plus sortable table is the better approach for now. It’s not a huge list of papers (yet).

I have no problem with CSVs, either…

I checked again on our human data sets and for most of them the numbers were in the same range (15-25%). It is probably important to note that the amplification process during single-cell PCR typically runs into saturation and is not quantitative due to potential primer bias (sorry, no UMIs at this point). Thus scPCR will likely not capture the quantitative differences between between productive and non-productive transcripts.
There are however two data sets that show substantially lower numbers (5-10%), but they also differ in the sampled cell populations. Going back to all data sets, the general trend seems to be that populations with high transcriptional activity (e.g. plasma cells) have lower percentages of non-productive transcripts.

Fascinating, thanks!

are you aware of this resource: A public database of memory and naive B-cell receptor sequences
PLOS ONE 08/11/2016 11(8):e0160853 - the underlying data can be found in Adaptive’s immuneAcess database.