Apologies if I am re-visiting points that have been made already, here or on other threads. Focusing on @ematsen’s “One record per original source reference”. The point regarding publications seems to be implying that publications may not be necessary. I need to be convinced on that. It is true that a bad publication serves little purpose, but seeing detail of how sequences have been generated has been critical to previous evaluations of ‘confidence levels’. I also think we need to encourage publications that identify BCR/TCR genes. A problem that we had for many years was that people came to see such reports as unimportant. By insisting on publications, we counter that view.
On the point of ‘confidence levels’, is this an evaluation of the confidence of the observation in the original source reference, or is this a measure of overall confidence, taking in additional evidence? If it is the former, this is an evaluation that I am not aware is being worked on at present. If the latter, the confidence level will change over time. Or have I missed something? I thought that confidence levels, as well as ‘functionality’ or ‘rearrangeability’ would be a separate layer of the system.
I would also add a comment on the ‘one record per unique sequence’ problem. I know leader sequences, RSS etc have been mentioned occasionally in various threads. Although most reported sequences only include the coding region, we should be encouraging the reporting of these elements. And one day, this will lead to a new kind of diversity in the germline repertoire. Not only are identical sequences possible at different loci, but we have the possibility of identical coding regions being associated with varying RSS or other regulatory elements. I can imagine attention returning to these elements as we struggle to understand the highly variable utiilization frequencies of different genes. Should the database be designed so that it is ready for that?