Six individuals joined the subgroup, Andrew Collins, Mats Ohlin, Duncan Ralph, Katharina Imkeller and Corey Watson and we included Steve Kleinstein in the discussion.
The initial suggestion was that we limit discussion initially to the human VH repertoire to set up some ground rules that could apply to light chains and J and D segments.
Most people in the group made a number of useful contributions to the discussion.
Duncan pointed out that limiting evaluation to functional or expressed sequences may miss interesting sequences currently designated as ORFs.
Both Mats and Duncan agreed with a suggestion that we should assign some kind of evidence level for each sequence in the current database.
Mats suggested that we should use the same criteria for validating current alleles using inferral tools as we are going to us for inferring novel alleles and Corey suggested because there is potential overlap between this group and the inferring novel alleles group the efforts should be coordinated.
Several members of the group pointed out a difficulty in defining the terminal nucleotides of V genes and there was some discussion on how to address this issue – either through genomic means (for example using a database of genomic sequences encompassing V alleles from numerous individuals) or by computational means – for example allowing for some ambiguity at the final few nucleotides.
Andrew proposed a set of five rules for treating the current reported human germline sequences.
Sequences must be reported in a peer reviewed journal.
Sequences must not include ambiguities
A single cDNA derived sequence cannot be the sole evidence for a germline sequence
The database must include full-length sequences.
We exclude all sequences generated by six studies that have high sequencing error rates.
Andris Js 1993
Campbell MJ 1992
Adderson EE 1993
Ollee T 1992
Van Es JH 1992
This action removes 102 sequences from the current database.
There was general agreement with the first four of Andrews suggested rules –
The final rule I was unsure if it may appear a little bit arbitrary to remove just those sequences and not other sequences and wondered whether an approach of assigning levels of confidence to all sequences may be a compromise.
The suggestion was to have, for example, a traffic light based confidence level scheme.
Low confidence sequences (such as those from the above six studies) are red, those with one additional level of evidence, genomic identity or expression in multiple copies with independent rearrangements, moving the sequence to the amber level and those with both expression and genomic evidence or two independent confirmatory genomic studies moves the sequence to the green level.
Overall I think we covered a lot of ground this first month and can feel hopeful that over the next few weeks and months we can make good progress to define the high confidence set of human germline sequences from the currently reported germline sequences.