Summary of the first month of the ‘Evaluation of existing reported human germline genes’ subgroup

martin_corcoran · September 28, 2016, 3:44pm

Six individuals joined the subgroup, Andrew Collins, Mats Ohlin, Duncan Ralph, Katharina Imkeller and Corey Watson and we included Steve Kleinstein in the discussion.

The initial suggestion was that we limit discussion initially to the human VH repertoire to set up some ground rules that could apply to light chains and J and D segments.
Most people in the group made a number of useful contributions to the discussion.
Duncan pointed out that limiting evaluation to functional or expressed sequences may miss interesting sequences currently designated as ORFs.
Both Mats and Duncan agreed with a suggestion that we should assign some kind of evidence level for each sequence in the current database.

Mats suggested that we should use the same criteria for validating current alleles using inferral tools as we are going to us for inferring novel alleles and Corey suggested because there is potential overlap between this group and the inferring novel alleles group the efforts should be coordinated.

Several members of the group pointed out a difficulty in defining the terminal nucleotides of V genes and there was some discussion on how to address this issue – either through genomic means (for example using a database of genomic sequences encompassing V alleles from numerous individuals) or by computational means – for example allowing for some ambiguity at the final few nucleotides.

Andrew proposed a set of five rules for treating the current reported human germline sequences.

Sequences must be reported in a peer reviewed journal.
Sequences must not include ambiguities
A single cDNA derived sequence cannot be the sole evidence for a germline sequence
The database must include full-length sequences.
We exclude all sequences generated by six studies that have high sequencing error rates.

Andris Js 1993
Campbell MJ 1992
Adderson EE 1993
Ollee T 1992
Van Es JH 1992
Weng N1992

This action removes 102 sequences from the current database.

There was general agreement with the first four of Andrews suggested rules –

The final rule I was unsure if it may appear a little bit arbitrary to remove just those sequences and not other sequences and wondered whether an approach of assigning levels of confidence to all sequences may be a compromise.

The suggestion was to have, for example, a traffic light based confidence level scheme.
Low confidence sequences (such as those from the above six studies) are red, those with one additional level of evidence, genomic identity or expression in multiple copies with independent rearrangements, moving the sequence to the amber level and those with both expression and genomic evidence or two independent confirmatory genomic studies moves the sequence to the green level.

Overall I think we covered a lot of ground this first month and can feel hopeful that over the next few weeks and months we can make good progress to define the high confidence set of human germline sequences from the currently reported germline sequences.

a.collins · September 30, 2016, 8:19am

Thanks for the excellent summary Martin. So that people understand better why I suggested excluding the 6 named studies, let me explain a little about the six studies. They reported many sequences, and these either explicitly came from a single individual, or appear to have come from a single individual. The studies were important at the time they were published, but they were exploring the nature of the repertoire at a time when almost nothing was known. I doubt any of the authors expected all of their sequences to be incorporated into a ‘semi-official’ repertoire of germline genes. In fact, a number of the publications include explicit statements that the sequences are likely to include sequencing errors. The fact that this is true is clear when you consider the sequences they reported, using the IMGT names. At the time, they did not think of them as genes and allelic variants, but rather just as sequences that they had generated, some of which were likely to correctly report germline sequences. They were:

Andris et al: IGHV2-5*04, *05, *06, *07, *08 and 09. IGHV2-7001, *09, *10, *11, *12

Campbell et al: IGHV2-70*02, *03, *06, *07, 08. IGHV4-403, 04, 05.
IGHV4-2805. IGHV4-30-403, 04. IGHV4-3103, *06, *07, *08, 09.
IGHV4-3403, *06, 07. IGHV6-102.

Adderson et al: IGHV3-15*01, *03, *04, *05, *06, *07, 08. IGHV3-4902.

Olee et al: IGHV3-30*01, *04, *05, *06, *07, *09, *10, *11, *12, *13,
*14, *15, *16, *17, 18, 19. IGHV3-30-302. IGHV3-3301, *03, *04,
05. IGHV3-6403, *04, *05.

Weng et al: IGHV4-2803, 04. IGHV4-30-203. IGHV4-30-402. IGHV4-31*03,
10. IGHV4-3404, *05, 09, 10. IGHV4-3906. IGHV4-5907, 10.
IGHV4-6103, *05.

vans Es et al: IGHV4-30-202. IGHV4-3104, 05. IGHV4-3408. IGHV4-3905. IGHV4-5903, *04, *05, *06.

I should also clarify that I am very comfortable with sequences like these moving to a ‘red’ category, rather than being completely discarded. In fact I think it is very important that all sequences remain accessible in any new database, even if they are clearly not real germline sequences. If we do not retain them, it will be impossible for people in the future to make sense of historical reports.

a.collins · October 4, 2016, 11:34am

Martin, for one, has expressed concern that removal of sequences arising from the six studies listed above is arbitrary. PLEASE GIVE US SOME FEEDBACK! Do you agree?

If the quality of studies can be our focus of attention for the moment…

How do we compare the reliability of a sequence reported in a study 30 years ago with the sequences reported by Corey Watson in recent years?

I began to ponder the complexities of this question when I began to think of the next stage in the evaluation of the IMGT repertoire. If the 5 rules mentioned above lead to sequences being flagged as ‘red’ in a ‘traffic light’ system, we need to think of what makes a sequence green rather than orange.

I have been playing with the idea that a sequence needs to be confirmed by two independent reports for it to be green. This seems sensible, but it would mean that most sequences that have been reported in recent years would be ‘orange’, as they have not been independently confirmed. Is this appropriate?

If a single study confirmed reported sequences by the amplification and sequencing of identical sequences in completely independent runs, would this carry the same weight as reports from independent labs?

If two independent reports have been made of a sequence, but one sequence was truncated 3’ by 10 nucleotides, would the sequence be green?

If two independent reports have been made of a sequence, but one sequence was truncated 3’ by 20 nucleotides, would the sequence be green?

If a single report has been published of a sequence, can it be confirmed by the existence of either one or perhaps multiple perfect matches to the sequence from historical reports of VDJ sequences (Sanger sequencing)?

If a single report has been published of a sequence, can it be confirmed by inference from Rep-Seq data?

Please give your feedback on this. You do not have to be an ‘expert’. It is important for us to get a sense of how people respond to these suggestions, even if they do not consider themselves to be experts, as ultimately we hope to offer a dataset based upon one or other set of criteria, and this dataset will be judged by both ‘experts’ and ‘non-experts’. Hopefully we can find a system that seems acceptable to everyone, but this will require input from teh broad community of researchers.

martin_corcoran · October 4, 2016, 1:34pm

Hopefully we will get some additional feedback here, but some quick thoughts on this.

Overall, my feeling is that we approach evaluation of the set of sequences as something in which we can constantly revise as we accumulate more data. We will soon be able to get large numbers of individualized germline databases based on expression data. This will provide evidence datapoints that we can use to strengthen our confidence in particular alleles. The idea that we may have limited data for a particular sequence now - even if we strongly believe that sequence is a real germline - should not make us hesitant about placing that sequence in the red or orange boxes based on current data. If it is a real germline we will inevitably accumulate the necessary independent data to shift the sequence towards the green box.

"How do we compare the reliability of a sequence reported in a study 30
years ago with the sequences reported by Corey Watson in recent years? "

Again, we may have trouble doing this in a way that may not appear arbitrary or subjective. Corey’s data is bound to be more reliable but I wonder would we lose anything at the current stage by treating both in a similar way - as independent data points?
A sequence that is present in one of the old error prone studies is unlikely to appear in a high quality genomic sequencing study - unless that particular allelic sequence is real and reasonably common in the population.
And a high quality sequencing study will produce more real alleles - hence more alleles with a high chance of being independently verified by additional genomic sequencing studies or inferred from expressed sequences.
What this means in practice is that, on their own, either Corey’s or the error prone studies would not be enough to place a sequence into the green box. In fact both studies may only be enough (on their own) to get a sequence into the red box.
Alleles present in the good quality study, however, will have a much better chance of being re-identified in independent studies and hence will relatively rapidly move towards the green box.
Erroneous sequences present in the error prone study will not be independently identified and will remain in the red box.

I have been playing with the idea that a sequence needs to be confirmed by two independent reports for it to be green. This seems sensible, but it would mean that most sequences that have been reported in recent years would be ‘orange’, as they have not been independently confirmed. Is this appropriate?

I think this is appropriate for the reasons mentioned previously. If the sequences are real germlines then they should (unless they are very rare population specific alleles) be identifiable in future in other individuals.

If a single study confirmed reported sequences by the amplification and
sequencing of identical sequences in completely independent runs, would
this carry the same weight as reports from independent labs?

Yes, with the caveat that the lab in question had taken caution to avoid false positives in both runs.
For example, in the case of expressed sequences, the largest source of false positives can be low level sequence errors of highly expressed alleles. So if we find, for example, allele IGHV1-6901 identified expressed at a high level in one individual, if we also detect a closely related allele that differs by a single nucleotide, such as IGHV1-6906 but is present at only a very low level (say, 0.5% of the level of 1GHV1-6901) we need to be careful that the IGHV1-6906 sequences are not false positives.

If two independent reports have been made of a sequence, but one
sequence was truncated 3’ by 10 nucleotides, would the sequence be
green?

If two independent reports have been made of a sequence, but one
sequence was truncated 3’ by 20 nucleotides, would the sequence be
green?

In my opinion truncated sequences do not comply with rule 4 so cannot be used as evidence.

If a single report has been published of a sequence, can it be confirmed
by the existence of either one or perhaps multiple perfect matches to
the sequence from historical reports of VDJ sequences (Sanger
sequencing)?

No.
Because we do not know if those historical sequences were false positives (low level sequence/PCR errors of highly expressed alleles).
(In theory I think this is less likely and many instances of this type of data can provide some evidence for a particular germline sequence but I suggest we should try to be strict with our criteria since it is likely that real germlines will be found in additional independent studies in the future.)

If a single report has been published of a sequence, can it be confirmed by inference from Rep-Seq data?

I don’t see why not - so long as the Rep-Seq approach takes care, as previously described, to avoid false positives.
For example, one genomic sequence would place a sequence in the red box.
A single independent Rep-Seq germline identification (for example within a library of 1 million sequences, 1000 exact sequences found with 500 unique cdr3s, using 7 J alleles, and no closely related co-expressed alleles that differ by 1 nucleotide) would provide an additional evidence point - which moves the sequence from the red box into the orange box.