My proposal for “something else” is a database requiring minimal human curation, but lots of automatic curation by software. The key feature would be an API that could be used by annotation software to report back what germline genes were observed. It might work as follows:
- Cast a wide net for a starting set of alleles, including IMGT, UNSWIg, vgenerepertoire, etc.
- Users run annotation software, such as those listed here, which reports back which alleles were found in annotations.
- Users also run software that infers alleles directly from data, such as TIgGER, which also reports back that certain new alleles were found.
- This information is automatically aggregated by the database, which maintains a running list of known alleles and the support for them, as well as periodic versioned releases of a simple flat file of germline genes that can then be used by germline gene software.
There are a variety of issues here, including privacy concerns and the need for each data set to just be reported once. Or perhaps this is just too complicated of a solution for a problem that will just go away as we get better at sequencing genomes.