Coming to consensus on germline database website + backend design

ematsen · September 13, 2016, 9:21pm

The leaders of the germline efforts have, so that we can come to agreement, asked for a summary of the threads of discussion on the nitty-gritty of storing and serving the germline information. This is my attempt to do so. Please make suggestions to this post or other relevant topics linked to below.

Overall goals for the website + backend

We would like to

store complete information as described in the community-designed schema see discussion
make simplified means for people to download / interact with the database

We have a simple, small, set of data and we would like to keep all aspects of this as simple as possible.

Storing the complete information

How best to store information in the above-mentioned schema?

Design goals

the entire database should be downloadable in a way that enables easy interaction with lots of languages and platforms
we would like to be able to view differences between versions of the database

Proposal

For backend storage, encode @wlees’ schema as a JSON object, one per species; store this information in GitHub.
Curators submit new sequences as a spreadsheet, which is validated by our code to make sure that it fits the schema, and then gets incorporated into the DB. It may be useful to use a spreadsheet plugin tool to validate right off the bat. Until further notice, even inferred alleles will come through this route.
Curators will also be able to download information on a given sequence as a spreadsheet, edit it, and resubmit it as a revision. Upon uploading, the site will check to make sure that some basic information, such as the sequence itself, does not change.
There will be metadata fields that will be required for each such submission, which will ask who is making the change, and why. These fields will get turned into commit messages for the backend GitHub repository storing the database.

Interacting with the database

Website

Design goals see discussion

It should be very easy to download a high confidence set of germline sequences for your species of interest in a simple format (probably FASTA + CSV)
It should be clear what is being downloaded
It should be easy to switch between species
It should be easy to change to other levels of evidence for the sequences (e.g. direct germline sequencing only or a more inclusive set)
It should be easy to access previous releases of the database
There should be some sort of standardized means by which computers (not humans clicking) should be able to access the latest version of the database. This could be as simple as fixing a url address like http://new-database.org/human/high-confidence/latest)
From an excellent post by Christian: it should be clear why the database exists, how it’s used, how people can contribute, and who we are
Each gene sequence should get its own page, with all the associated information about it

Proposal

Static site built with Python, Bootstrap 4, and Jinja2
The workflow is: when a new spreadsheet is handed over with a new or modified germline sequence (see above), we run code to validate the spreadsheet and update the database; when this is done we commit the new version of the DB to GitHub. That commit triggers a job (e.g. via Wercker) that processes this new DB into the website, including downloads.
Successive differences between database versions will be available as a readable “diff”.

w.lees · September 15, 2016, 3:14pm

I think it’s important that changes are tracked, in such a way that any user who wishes can see how a particular record changed over time: who changed it, when, and why. There may be some information, particularly information relating to deposited sequences, including the sequence itself, which should never be changed. The approach of downloading and uploading entire gene sets will make it difficult to track and supervise changes. For example a curator might make several changes in the spreadsheet for different reasons. How do they record the correct reason for each change? How do they ensure that no unintended changes were made by accident?

The approach seems to enforce a single representation of the germline set for a species. As far as I can see it wouldn’t support, for example, the publication by different groups of a human germline with differing naming schemes, or numbering schemes, or to have different selections of ‘high confidence’ sequences - an upload from one curator would be all too likely to overwrite changes made by another. Is this what we want? By the way, the same issue would make it difficult for multiple curators to work on the same scheme. Unless we built a mechanism for merging changes, they would need to agree which of them ‘had the baton’ at any particular time.

William

ematsen · September 16, 2016, 1:15am

@w.lees, thanks for your very on-target comments. I have edited the proposal above accordingly.

I completely agree, and this is one of the reasons we chose to have the full information in GitHub in a text-based format. That said, a text diff of JSON is a little hard to look at, so we will implement ways to see the differences easily (e.g. analogous to http://jsondiff.com/).

Commit messages can describe who and how, and will be built from metadata in the submission files.

We can enforce this with the code.

We have revised things such that there is one spreadsheet per sequence, enforcing that level of granularity. Re accidents, well, there’s only so much we can do to catch that, but I hope people look at the diffs.

Don’t we want one complete germline set per species? I can’t imagine even what the website would look like with multiple naming schemes. Regarding confidence, we could certainly offer a filtering based on that.

sdwfrost · September 16, 2016, 5:32pm

I’ve been using RethinkDB as a database backend, and it’s quite nice; changes to the database can trigger events.

If you’re going the Jinja2 route, why not use Flask; it makes routing nice and easy…

ematsen · September 16, 2016, 7:02pm

Thanks, Simon!

Our thinking has been to build a static site. In this case, we would be processing all of the data in one go and building all the pages, rather than querying a database. Would you still advocate RethinkDB in that case?

Same applies for Flask, which IIUC is for dynamic sites, though Brian has taken a look at Frozen Flask. My feeling is that if we are going through the data already to build simplified CSV/FASTA downloads the same code path could be used to generate HTML directly.

But I’m all ears!

caschramm · September 16, 2016, 7:04pm

I agree with Erick - isn’t the whole point to have a master list in one place and then let more advanced users subset as they see fit? Then they can describe in their methods which version they pulled from us and what other processing they applied. Perhaps we could have a way for people to upload custom lists of rules so that they can easily be reused by others and/or applied to different versions/species, etc?

sdwfrost · September 16, 2016, 7:56pm

Dear @ematsen,

I wasn’t quite sure what you were thinking when it comes to generating the static files. If one was going to do this a lot, then it would make sense to have a database that triggered an event to process all the files. As for Flask, it makes the routing easier (it doesn’t have to be dynamic content, as in Frozen Flask). I used DocPad in the past, which seemed to work well…GitHub uses Jekyll…there’s lot of options.

cswarth · September 16, 2016, 8:40pm

Hi Simon,

I’ve been working with Erick to prototype some of the ideas being discussed here. So far I’ve just been using Jinja to fill in some HTML templates to produce a static site. I agree that Flask makes it easier to structure the code in a way that reflects the hierarchy of the website, and Frozen Flask means we can get the benefits of routing decorators and still have a static site. I haven’t used Frozen Flask before but I’m interested to test it.

Chris

w.lees · September 17, 2016, 2:19pm

There are two potential user groups, I think. One group are experts, who will want to see every sequence deposited for a species and make their own minds up about which should be included in a particular analysis. The others are non-experts, who just want a good ‘general’ germline set to use for their analysis. Are we aiming our site at both, or are we aiming it at the experts, who will then in turn make recommendations about which subset should be used for particular purposes, and publish those subsets on their own site?

If we intend non-expert users to make use of our site, I think we will need some form of subsetting or selection in addition to filtering on the fields available. Suppose that a longer sequence for an existing gene is deposited. We don’t want to delete or replace the older sequence, because its origin, evidence and so on is important, and it will have been used in previous analyses. But we don’t want both sequences showing up in a germline set being sent to a parser. Or suppose that a previously deposited gene is found to be invalid, for some reason, or incorrectly sequenced.

I suppose we could add a flag to each gene to indicate whether it should be included in a germline set for parsers, but if it’s just one flag, who makes that call? And if we accept that experts are going to make subsets for their own purposes, don’t we think it’s possible that some of those subsets might be useful for the wider community?

I think it would be very helpful to have some input on this point from users and curators.

w.lees · September 17, 2016, 2:38pm

Thanks Erick.

For naming and numbering:

I think we would probably want to allow for at least two names - the name as it stands today, and a new name, assuming that the community decides to move away from the current naming, at least for some species.
Likewise for numbering - really, field delineation and numbering - I think we would probably want to support at least the existing systems. I am aware of four, as detailed on bioinf.org.

If we take the view that this number is unlikely to grow very quickly and that there is general community consensus around them, I could collapse the multiple tables in the schema, and just add a few more columns. If the community subsequently decided to support a different numbering scheme or even an additional name, it would be easy enough to add a new column.

Again I think this is an area where it would be very useful to hear from potential users and curators.

caschramm · September 19, 2016, 8:32pm

Definitely.

Agreed - I think this is included in the Design Goals that @ematsen mentioned at the top of the post. At the very least, there should be a (default) “high confidence” set and a “kitchen sink” set. Updated and deprecated genes will move in and out (respectively) of the high confidence set as necessary, which is why we need transparent version control. And the more inclusive levels will still retain this information explicitly for advanced users. It does involve a certain amount of us making a call, but that’s part of the mandate, but it’s not a single opaque flag that constrains the advanced users in anyway. This is what I meant by people using (and sharing) lists of rules.

This is an interesting scenario. My instinct is that we should have “superseded by” and “replaces” fields as part of the schema, and then retain both as separate entries in the “all inclusive” database; only the most recent/best supported version would be included in the “high confidence” database for general users, though. (Of course, this also implicates previous and on-going discussions about naming conventions…)

laserson · September 21, 2016, 3:48pm

I really like the GitHub idea. In line with providing easy-to-use curated sets, we could use git’s tagging mechanism to make “releases” for default versions people could use. We could also use git hashes to reference non-release versions that are also published to the database. (So if someone really needs to use the master version of the database, it should at least be reconstructable). We could also think of ways of using GitHub’s forking functionality to let people build their own custom germlines and make them available to others as well in a consistent manner.

At the moment, I am strongly in favor of a static site. Even if we have to rebuild the site once every few days, I think it’ll basically be a trivial task. That will also make routing simpler in my mind (because it’s static), and we could use GitHub pages to host the website reliably and for free.

Any thoughts on using Travis instead of Werker? I get the feeling that it’s much more widely used, and I suspect that we’re not going to need the type of computation that requires containers for the website rebuilding.

Finally, it may make sense to transition to a dynamic site in the future. Especially if the size of the data sets starts to get large enough to cause high latency (e.g., if each page load requires downloading many MB of data). But this is something that we could switch out in the future without affecting the users or the URL routing.

laserson · September 21, 2016, 3:51pm

I was also going to add that JSON can be output in a non-readable or readable manner, and I don’t know how different parsers do it in different programming languages. We should check whether GitHub is smart in showing diffs of JSON objects. Separately, we could also consider YAML instead, as it also has very broad programming language coverage, maintains all of JSON’s advantages (and in fact is a superset of JSON), and is generally far more human-readable, if that’s an important consideration.

ematsen · September 21, 2016, 9:15pm

Thanks everyone for your contributions to the discussion. I’m completely in agreement with @caschramm’s most recent comments

OK, I’m listening, but perhaps this should become part of the schema discussion?

Thanks for your comments, @laserson, and I’m happy that you like the ways things are going.

We are using Wercker for the time being, but in practice it’s a small change if we want to use Travis in the future-- it’s just running some Python scripts.

@cswarth has been working on this for a little while now and has been pushing it to his own private repository. Sooner or later as things mature it’ll be time to transfer the repos over to a shared GH organization. What do we want this to be? Do we want to start a germline DB organization, or use https://github.com/airr-community?

laserson · September 21, 2016, 9:42pm

For the sake of not over-splintering, I would vote trying to centralize things at AIRR Community · GitHub. Please let me know if you need access to the org.

mikhail.shugay · September 21, 2016, 10:11pm

Perhaps I’m quite late for the discussion but here are my 5 cents. First, I must note that I like the idea with hosting the database at Github very much. I had some great experience in terms of management, productivity and troubleshooting with my own project (vdjdb) that uses issues to track submissions, pull requests to update the database and continuous integration to verify database integrity (Travis CI).

I have a couple of questions though:

Will the database be solely dedicated for B-cell receptors? I don’t think there can be any compatibility issues with TCRs
Will database submissions be handled only in form of spreadsheets that should be sent to a limited group of curators? I’m interested if someone will be able to make a pull request with a metadata file and a FASTA file with specific header format that will be then mapped and processed into a set of annotated JSON records. Such scheme seems easier to use for a bioinformatician than spreadsheets. What is the metadata format?
The compatibility with IMGT and ImmunoSeq nomenclature (I’m unsure whether those two are one-to-one compatible) should also be discussed. I think most of RepSeq data to this date uses this format, needless to say that lots of immunologists will be very confused by a new nomenclature. So there should be an ability to lift-over legacy data to new database.

Another important aspect of storing the germline sequences is the germline itself. It would be ideal if each sequence can be traced directly to corresponding genome contig, i.e. genome assembly ID and genomic coordinates (e.g. in GTF format). There are many tasks where this is quite important such as looking at genomic features shaping V-J recombination. I haven’t found anything related in the schema linked in original post. Have you considered adding this feature?

ematsen · September 21, 2016, 10:50pm

Not dedicated to BCRs at all. @w.lees’ schema includes both. Did I slip up somewhere?

For the non-programmers, we must use spreadsheets. Bioinformaticians can cook up spreadsheets too, and there’s a big advantage to having a single way in which data enters the DB.

There are copyright issues with doing this, see this thread.

mikhail.shugay · September 21, 2016, 11:21pm

I refer to the title of this topic

Spreadsheets can become really cumbersome due to column formats, merged columns, annotations, etc. If the spreadsheet can be easily compiled from a tab-delimited file I think there will be no problems. I was considering a person who fetches germline sequences from e.g. RNA-Seq data and would prefer to automate the submission generation process.

Even for a list of closest IMGT segment names? This means that any database/clonotype table that contains IMGT identifiers is copyrighted. The Sui generis database rights are only applicable when a substantial portion of the database is used. Anyways, I think a script that downloads IMGT on user side and matches IMGT segments to AIRR database segments can be a feasible solution at least for legacy TCR sequencing data.

bussec · September 21, 2016, 11:39pm

There is a small chance that there are some kind of IP rights on the names/naming scheme itself. I am currently in the process to rule out these last 5% of uncertainty, but until then we should play it safe. Sui generis database rights are not a concern here, since we are building up the whole database from free and open sources de novo.

ematsen · September 21, 2016, 11:53pm

Fixed now.

Yes, certainly. By “spreadsheet” I don’t mean .xls necessarily. Everything will be submittable via CSV or TSV.