Sponsored by the AIRR Community

Standardizing the format of a germline set

@martin_corcoran So then wouldn’t the inclusion of 5"UTR potentially present issues, if not everyone is describing them in their repertoire sequencing? I certainly get that there could be SNPs within the 5’UTR that are allele determinants (and I would typically be in favor of including these), but if the data coming into the database are regularly going to have varying lengths of 5’UTRs captured, then doesn’t this make this a challenging feature to include in allele designation? I wonder if we shouldn’t simply stick to coding sequence for now??? I think this deserves some discussion…

It does present issues, I agree, but if we consider one use of the database is as a resource for groups who may have an interest in cloning monoclonal antibodies then I think it is worthwhile including this region - even if we do not have a perfect solution at present. I am not advocating that people switch to sequencing the 5’UTR and leader as part of their standard analysis, just that having this information in a germline database is going to be helpful for some purposes.

So, in a scenario where you describe a new allele, and this new allele includes 10 bp of the 5’UTR, and then @bussec describes a second sequence that matches this allele but includes 16 bp of the 5’UTR, does his sequence then supersede yours?

@bussec, how would your db scheme handle such a situation re aggregation and source layers?

First, a new source record would be generated for the second sequence. Then, as the sequences are identical in the overlapping parts, the id of the new source record would be included in the aggregation record. The overall confidence in the existence of the segment (represented by the aggregation record) would increase, assuming that the new sequence originates from an independent study.
We have not yet decided which sequence would be reported back to a user, this could either be one of the associated sequences without further editing (e.g. the longest one or the one with the highest overall quality) or a trimmed version focusing on a predefined region (e.g. ATG to RSS for V segments).