A little later than planned, I have now published a revised schema with simplified structure. After some thought, I am proposing that we have one record per germline sequence, but make it a compound record that includes one or more field delineations - where a ‘field delineation’ is the set of information necessary to identify where the different fields are according to the IMGT scheme, Kabat scheme or whatever. ‘Field delineation’ is the term used by NCBI on the IgBLAST website, and it strikes me as a better term than ‘alignment’ which I have used to date. I hope the term is ok with the community.
The compount record format is no problem for JSON and I think can easily be flattened into multiple sets of columns in a CSV format, or multiple sets of name/value pairs in a FASTA format if we wish to include all the information in that format. I hope you will agree that it provides an acceptable compromise between fexibility and simplicity of format, but please let me know your thoughts.
@werner.muller and @mikhail.shugay I have included the items from your posts - specifically genome co-ordinates and evidence. @werner.muller I have not as yet included cross references to other databases. I do at the moment have fields for a single ‘database of record’ where the gene sequence is deposited. Which other databases should we include? I suppose we could make this a list of name-value pairs if necessary, if we wish to keep it flexible, but it would be useful to have examples so that we can check what kind of references would be necessary.
To address these comments I made in a parallel thread I have included sequence_status and deprecation_reason to give us a little more flexibility in categorizing and subsetting gene sets.I have provided for two alternate gene names, in addition to the canonical name, so that we could, for example, list the current human gene name as well as a novel one should we introduce a new naming scheme.
To address potential issues with update conflicts I have proposed that the record for every gene should have a version number. That way, if a spreadsheet or other file with edits is uploaded, it will be easy to tell whether other authors have updated the record in parallel.
I have renamed all fields in snake_case
. I would love to display them in courier when I refer to them in the file, but it doesn’t seem possible to have mixed fonts in a cell.
Please post comments in this thread, or questions if anything is unclear, or comment in the document if you prefer. I am particularly keen to know if I have captured everything required for deposition, evidence and so on.
p.s. I have made some small changes this morning. In particular I have attempted to clarify what is meant by a ‘database of record’ - in most cases, I assume that this would be the AIRR DB, but having the field makes it possible for us to extend to others if we wish to. I have also added an explicit field for a Genbank ID, should the sequence have been deposited in Genbank. With reference to the question I asked in this post about references to other databases, this means we are now referring explicitly to Genbank and Ensembl.
Thanks!
William