Sponsored by the AIRR Community

D gene germlines

I wonder if anyone has a good feeling for how confident we are in the available sets of germline D genes?

We talk a lot about our (lack of) confidence in the Vs, and the Js don’t mutate much and seem to have pretty low diversity, so they’re probably ok.

But the Ds seem like a problem. We’re frequently telling our experimental colleagues “this is the naive sequence for this clonal family, go and synthesize this to see how good it is”. And within the V, there’s at most a few bases of uncertainty. But in the cdr3 our dependence on the completeness and correctness of the D germline set is a bigger problem, largely because of the lack of homology among D genes.

As I’m sure you’re familiar, sometimes there’s a D that aligns quite confidently, and sometimes there isn’t. In the latter case, it’s usually just a highly mutated cdr3 – but what about when it isn’t, and there’s a D we don’t know about? There’s a growing body of germline inference algorithms for V, but it’s hard to imagine these extending even in principle to D, so my impression is we’re stuck with germline sequencing. Does anyone have a good feeling for the state of this? Or any plans to do more?

thanks

Why couldn’t we just restrict analysis to sequences with 0-2 mutations in VH, or even just sort naive B cells? Is the length of the D gene a barrier?

OK, good point, I take that part back. Which helps me to think of a better way of saying what I’m worried about: because the V has significant structural constraints (and, yes, also because it’s longer), I have a lot of confidence that previously unknown V genes will either be separated from known alleles by a handful of SNPs (and thus picked up by the inference algorithms), or by easily-identifiable indels.

With the D, in contrast, it seems much more plausible to me that previously-unknown alleles would have little or no homology to existing D genes, which’d mean that no matter how little mutation, it would be difficult to distinguish from N-inserts. At least, with the existing allele inference approaches, which focus on being intelligent about detecting differences in point mutations – if we changed approach to being smart about differences in N-insertion and deletion frequencies, it might be feasible.

Default imgt D genes:

AGAATATTGTAATAGTACTACTTTCTATGCC
AGCATATTGTGGTGGTGACTGCTATTCC
AGCATATTGTGGTGGTGATTGCTATTCC
AGGATATTGTACTAATGGTGTATGCTATACC
AGGATATTGTACTGGTGGTGTATGCTATACC
AGGATATTGTAGTAGTACCAGCTGCTATACC
AGGATATTGTAGTAGTACCAGCTGCTATGCC
AGGATATTGTAGTGGTGGTAGCTGCTACTCC
CTAACTGGGGA
GAGTATAGCAGCTCGTCC
GGGTATAGCAGCAGCTGGTAC
GGGTATAGCAGCGGCTAC
GGGTATAGCAGTGGCTGGTAC
GGTACAACTGGAACGAC
GGTATAACCGGAACCAC
GGTATAACTGGAACAAC
GGTATAACTGGAACGAC
GGTATAACTGGAACTAC
GGTATAGTGGGAGCTACTAC
GTAGAGATGGCTACAATTAC
GTATTACGATATTTTGACTGGTTATTATAAC
GTATTACGATTTTTGGAGTGGTTATTATACC
GTATTACTATGATAGTAGTGGTTATTACTAC
GTATTACTATGGTTCGGGGAGTTATTATAAC
GTATTACTATGTTCGGGGAGTTATTATAAC
GTATTAGCATTTTTGGAGTGGTTATTATACC
GTATTATGATTACGTTTGGGGGAGTTATCGTTATACC
GTATTATGATTACGTTTGGGGGAGTTATGCTTATACC
GTATTATGATTTTTGGACTGGTTATTATACC
GTGGATACAGCTATGGTTAC
GTGGATATAGTGGCTACGATTAC
GTGGATATAGTGTCTACGATTAC
TGACTACAGTAACTAC
TGACTACGGTGACTAC
TGACTACGGTGGTAACTCC
TGACTATGGTGCTAACTAC
TGGATATTGTAGTAGTACCAGCTGCTATGCC


Hi Duncan,

We have previously reported a couple of new putative D alleles. An IGHD3-10 allele was inferred from Sanger sequencing data (Lee CE et al 2006. Immunogenetics 57: 917-25), and an IGHD3-16 allele was reported from 454 data in Boyd et al 2010 (J Immunol; 184: 6986-92). These sequences are:

IGHD3-10p03
tactatggttcAgggagttattataac
IGHD3-16
p03
gtattatgattacAtttgggggagttatcgttatacc

I think D alleles should be on our radar, and I agree that by restricting analysis to relatively unmutated V genes it should be pretty straight forward to pick up allelic variants. I think it is also true that if an unknown D gene is too different from those we know, it could slip through unseen in our analyses. I was interested to find solutions to D-less alignments for many years, and never saw anything that suggested there were other genes, but my datasets were tiny by today’s standards.

One explanation for ‘D-less" VDJs is that many D genes are too short to detect, and a simple analysis of 5’ and 3’ removals from D genes shows that there must be many D genes that are below the minimum length required to make a positive call. I agree though that I have never been sure that this can explain all the sequences for which we can’t identify a D gene.

Since we are talking D genes, the 2006 paper mentioned above (Lee CE et al 2006) suggests that the IMGT classification of D genes may not be correct . We concluded that IGHD4-23 and IGHD5-24 are functional, but IMGT reports them as open reading frames of uncertain functionality. IGHD1-14*01 and IGHD6-25 are defined as functional by IMGT, but we concluded this was not true.

I would be interested to know if we can reach agreement on this today.

Andrew

2 Likes

Great Andrew, thanks for the thoughts, that’s really useful.

I think I’ll see about hacking together something soon to look for obvious D genes that’re really different from existing ones, just to hopefully notice if there is a big problem.