The leaders of the germline efforts have, so that we can come to agreement, asked for a summary of the threads of discussion on the nitty-gritty of storing and serving the germline information. This is my attempt to do so. Please make suggestions to this post or other relevant topics linked to below.
Overall goals for the website + backend
We would like to
- store complete information as described in the community-designed schema see discussion
- make simplified means for people to download / interact with the database
We have a simple, small, set of data and we would like to keep all aspects of this as simple as possible.
Storing the complete information
How best to store information in the above-mentioned schema?
Design goals
- the entire database should be downloadable in a way that enables easy interaction with lots of languages and platforms
- we would like to be able to view differences between versions of the database
Proposal
- For backend storage, encode @wlees’ schema as a JSON object, one per species; store this information in GitHub.
- Curators submit new sequences as a spreadsheet, which is validated by our code to make sure that it fits the schema, and then gets incorporated into the DB. It may be useful to use a spreadsheet plugin tool to validate right off the bat. Until further notice, even inferred alleles will come through this route.
- Curators will also be able to download information on a given sequence as a spreadsheet, edit it, and resubmit it as a revision. Upon uploading, the site will check to make sure that some basic information, such as the sequence itself, does not change.
- There will be metadata fields that will be required for each such submission, which will ask who is making the change, and why. These fields will get turned into commit messages for the backend GitHub repository storing the database.
Interacting with the database
Website
Design goals see discussion
- It should be very easy to download a high confidence set of germline sequences for your species of interest in a simple format (probably FASTA + CSV)
- It should be clear what is being downloaded
- It should be easy to switch between species
- It should be easy to change to other levels of evidence for the sequences (e.g. direct germline sequencing only or a more inclusive set)
- It should be easy to access previous releases of the database
- There should be some sort of standardized means by which computers (not humans clicking) should be able to access the latest version of the database. This could be as simple as fixing a url address like
http://new-database.org/human/high-confidence/latest
) - From an excellent post by Christian: it should be clear why the database exists, how it’s used, how people can contribute, and who we are
- Each gene sequence should get its own page, with all the associated information about it
Proposal
- Static site built with Python, Bootstrap 4, and Jinja2
- The workflow is: when a new spreadsheet is handed over with a new or modified germline sequence (see above), we run code to validate the spreadsheet and update the database; when this is done we commit the new version of the DB to GitHub. That commit triggers a job (e.g. via Wercker) that processes this new DB into the website, including downloads.
- Successive differences between database versions will be available as a readable “diff”.