There are two potential user groups, I think. One group are experts, who will want to see every sequence deposited for a species and make their own minds up about which should be included in a particular analysis. The others are non-experts, who just want a good ‘general’ germline set to use for their analysis. Are we aiming our site at both, or are we aiming it at the experts, who will then in turn make recommendations about which subset should be used for particular purposes, and publish those subsets on their own site?
If we intend non-expert users to make use of our site, I think we will need some form of subsetting or selection in addition to filtering on the fields available. Suppose that a longer sequence for an existing gene is deposited. We don’t want to delete or replace the older sequence, because its origin, evidence and so on is important, and it will have been used in previous analyses. But we don’t want both sequences showing up in a germline set being sent to a parser. Or suppose that a previously deposited gene is found to be invalid, for some reason, or incorrectly sequenced.
I suppose we could add a flag to each gene to indicate whether it should be included in a germline set for parsers, but if it’s just one flag, who makes that call? And if we accept that experts are going to make subsets for their own purposes, don’t we think it’s possible that some of those subsets might be useful for the wider community?
I think it would be very helpful to have some input on this point from users and curators.