I would suggest keeping the versioning of the underlying sequences separate from versioning of inferences.They will likely be managed by different, separate, people according to different rules.
Here’s a modified suggestion for the file structure:
- h, k, l directories as in your github
- within each directory:
germline.fasta - a single file containing the master set of germline sequences that we create. These would have neutral names, probably just an incrementally assigned index number.
names_xxx.csv - one or more files defining names that are used under a particular naming scheme, mapping them to the ‘neutral’ names in the germline file. For example, names_imgt.csv might, for the human germlines, map IGHV1-69*01 to something, and so on.
positions_xxx.csv - one or more files defining key positions (as in your extras.csv) for a particular numbering scheme, e.g. positions_imgt.csv, positions_kabat.csv
How this would work in practice:
AIRR would sponsor or manage a central repository for germlines: the germline.fasta files for each species, and the underlying data showing who contributed them, what their origins were and so on. Each gene would be allocated an index number, and existing entries would never change. Like a genbank for germlines.
UNSWIG, IMGT and others - maybe AIRR as well in some cases - would publish names_xxx.csv files, mapping these underlying genes into their favoured naming scheme, and bringing together the collections that they feel are suitable for particular purposes. Because entries in germline.fasta never change, the version dependency is quite simple: you just need a germline.fasta file that contains all the genes referenced in names_xxx.csv.
The same (or different) people would publish positions_xxx.csv files for numbering schemes of interest. These would need to be kept in step with the names_xxx.csv files.
It would be easy to convert existing IMGT-aligned germline sets into this format. IMGT definition files could be maintained in this format automatically, just being updated whenever the IMGT libraries changed. Hopefully we could persuade UNSWIG and other maintainers of germlines to adopt the format, or automate conversion to it from whatever they use at the moment,