I’ve been using RethinkDB as a database backend, and it’s quite nice; changes to the database can trigger events.
If you’re going the Jinja2 route, why not use Flask; it makes routing nice and easy…
I’ve been using RethinkDB as a database backend, and it’s quite nice; changes to the database can trigger events.
If you’re going the Jinja2 route, why not use Flask; it makes routing nice and easy…
Thanks, Simon!
Our thinking has been to build a static site. In this case, we would be processing all of the data in one go and building all the pages, rather than querying a database. Would you still advocate RethinkDB in that case?
Same applies for Flask, which IIUC is for dynamic sites, though Brian has taken a look at Frozen Flask. My feeling is that if we are going through the data already to build simplified CSV/FASTA downloads the same code path could be used to generate HTML directly.
But I’m all ears!
I agree with Erick - isn’t the whole point to have a master list in one place and then let more advanced users subset as they see fit? Then they can describe in their methods which version they pulled from us and what other processing they applied. Perhaps we could have a way for people to upload custom lists of rules so that they can easily be reused by others and/or applied to different versions/species, etc?
Dear @ematsen,
I wasn’t quite sure what you were thinking when it comes to generating the static files. If one was going to do this a lot, then it would make sense to have a database that triggered an event to process all the files. As for Flask, it makes the routing easier (it doesn’t have to be dynamic content, as in Frozen Flask). I used DocPad in the past, which seemed to work well…GitHub uses Jekyll…there’s lot of options.
Hi Simon,
I’ve been working with Erick to prototype some of the ideas being discussed here. So far I’ve just been using Jinja to fill in some HTML templates to produce a static site. I agree that Flask makes it easier to structure the code in a way that reflects the hierarchy of the website, and Frozen Flask means we can get the benefits of routing decorators and still have a static site. I haven’t used Frozen Flask before but I’m interested to test it.
There are two potential user groups, I think. One group are experts, who will want to see every sequence deposited for a species and make their own minds up about which should be included in a particular analysis. The others are non-experts, who just want a good ‘general’ germline set to use for their analysis. Are we aiming our site at both, or are we aiming it at the experts, who will then in turn make recommendations about which subset should be used for particular purposes, and publish those subsets on their own site?
If we intend non-expert users to make use of our site, I think we will need some form of subsetting or selection in addition to filtering on the fields available. Suppose that a longer sequence for an existing gene is deposited. We don’t want to delete or replace the older sequence, because its origin, evidence and so on is important, and it will have been used in previous analyses. But we don’t want both sequences showing up in a germline set being sent to a parser. Or suppose that a previously deposited gene is found to be invalid, for some reason, or incorrectly sequenced.
I suppose we could add a flag to each gene to indicate whether it should be included in a germline set for parsers, but if it’s just one flag, who makes that call? And if we accept that experts are going to make subsets for their own purposes, don’t we think it’s possible that some of those subsets might be useful for the wider community?
I think it would be very helpful to have some input on this point from users and curators.
Thanks Erick.
For naming and numbering:
If we take the view that this number is unlikely to grow very quickly and that there is general community consensus around them, I could collapse the multiple tables in the schema, and just add a few more columns. If the community subsequently decided to support a different numbering scheme or even an additional name, it would be easy enough to add a new column.
Again I think this is an area where it would be very useful to hear from potential users and curators.
Are we aiming our site at both
Definitely.
If we intend non-expert users to make use of our site, I think we will need some form of subsetting or selection in addition to filtering on the fields available.
Agreed - I think this is included in the Design Goals that @ematsen mentioned at the top of the post. At the very least, there should be a (default) “high confidence” set and a “kitchen sink” set. Updated and deprecated genes will move in and out (respectively) of the high confidence set as necessary, which is why we need transparent version control. And the more inclusive levels will still retain this information explicitly for advanced users. It does involve a certain amount of us making a call, but that’s part of the mandate, but it’s not a single opaque flag that constrains the advanced users in anyway. This is what I meant by people using (and sharing) lists of rules.
Suppose that a longer sequence for an existing gene is deposited. We don’t want to delete or replace the older sequence, because its origin, evidence and so on is important, and it will have been used in previous analyses.
This is an interesting scenario. My instinct is that we should have “superseded by” and “replaces” fields as part of the schema, and then retain both as separate entries in the “all inclusive” database; only the most recent/best supported version would be included in the “high confidence” database for general users, though. (Of course, this also implicates previous and on-going discussions about naming conventions…)
I really like the GitHub idea. In line with providing easy-to-use curated sets, we could use git’s tagging mechanism to make “releases” for default versions people could use. We could also use git hashes to reference non-release versions that are also published to the database. (So if someone really needs to use the master version of the database, it should at least be reconstructable). We could also think of ways of using GitHub’s forking functionality to let people build their own custom germlines and make them available to others as well in a consistent manner.
At the moment, I am strongly in favor of a static site. Even if we have to rebuild the site once every few days, I think it’ll basically be a trivial task. That will also make routing simpler in my mind (because it’s static), and we could use GitHub pages to host the website reliably and for free.
Any thoughts on using Travis instead of Werker? I get the feeling that it’s much more widely used, and I suspect that we’re not going to need the type of computation that requires containers for the website rebuilding.
Finally, it may make sense to transition to a dynamic site in the future. Especially if the size of the data sets starts to get large enough to cause high latency (e.g., if each page load requires downloading many MB of data). But this is something that we could switch out in the future without affecting the users or the URL routing.
I was also going to add that JSON can be output in a non-readable or readable manner, and I don’t know how different parsers do it in different programming languages. We should check whether GitHub is smart in showing diffs of JSON objects. Separately, we could also consider YAML instead, as it also has very broad programming language coverage, maintains all of JSON’s advantages (and in fact is a superset of JSON), and is generally far more human-readable, if that’s an important consideration.
Thanks everyone for your contributions to the discussion. I’m completely in agreement with @caschramm’s most recent comments
- I think we would probably want to allow for at least two names - the name as it stands today, and a new name, assuming that the community decides to move away from the current naming, at least for some species.
- Likewise for numbering - really, field delineation and numbering - I think we would probably want to support at least the existing systems. I am aware of four, as detailed on bioinf.org.
OK, I’m listening, but perhaps this should become part of the schema discussion?
Thanks for your comments, @laserson, and I’m happy that you like the ways things are going.
Any thoughts on using Travis instead of Werker? I get the feeling that it’s much more widely used, and I suspect that we’re not going to need the type of computation that requires containers for the website rebuilding.
We are using Wercker for the time being, but in practice it’s a small change if we want to use Travis in the future-- it’s just running some Python scripts.
@cswarth has been working on this for a little while now and has been pushing it to his own private repository. Sooner or later as things mature it’ll be time to transfer the repos over to a shared GH organization. What do we want this to be? Do we want to start a germline DB organization, or use https://github.com/airr-community?
For the sake of not over-splintering, I would vote trying to centralize things at https://github.com/airr-community. Please let me know if you need access to the org.
Perhaps I’m quite late for the discussion but here are my 5 cents. First, I must note that I like the idea with hosting the database at Github very much. I had some great experience in terms of management, productivity and troubleshooting with my own project (vdjdb) that uses issues to track submissions, pull requests to update the database and continuous integration to verify database integrity (Travis CI).
I have a couple of questions though:
Another important aspect of storing the germline sequences is the germline itself. It would be ideal if each sequence can be traced directly to corresponding genome contig, i.e. genome assembly ID and genomic coordinates (e.g. in GTF format). There are many tasks where this is quite important such as looking at genomic features shaping V-J recombination. I haven’t found anything related in the schema linked in original post. Have you considered adding this feature?
Will the database be solely dedicated for B-cell receptors? I don’t think there can be any compatibility issues with TCRs
Not dedicated to BCRs at all. @w.lees’ schema includes both. Did I slip up somewhere?
Will database submissions be handled only in form of spreadsheets that should be sent to a limited group of curators? I’m interested if someone will be able to make a pull request with a metadata file and a FASTA file with specific header format that will be then mapped and processed into a set of annotated JSON records. Such scheme seems easier to use for a bioinformatician than spreadsheets. What is the metadata format?
For the non-programmers, we must use spreadsheets. Bioinformaticians can cook up spreadsheets too, and there’s a big advantage to having a single way in which data enters the DB.
The compatibility with IMGT and ImmunoSeq nomenclature (I’m unsure whether those two are one-to-one compatible) should also be discussed. I think most of RepSeq data to this date uses this format, needless to say that lots of immunologists will be very confused by a new nomenclature. So there should be an ability to lift-over legacy data to new database.
There are copyright issues with doing this, see this thread.
Not dedicated to BCRs at all. @w.lees’ schema includes both. Did I slip up somewhere?
I refer to the title of this topic
For the non-programmers, we must use spreadsheets. Bioinformaticians can cook up spreadsheets too, and there’s a big advantage to having a single way in which data enters the DB.
Spreadsheets can become really cumbersome due to column formats, merged columns, annotations, etc. If the spreadsheet can be easily compiled from a tab-delimited file I think there will be no problems. I was considering a person who fetches germline sequences from e.g. RNA-Seq data and would prefer to automate the submission generation process.
There are copyright issues with doing this, see this thread.
Even for a list of closest IMGT segment names? This means that any database/clonotype table that contains IMGT identifiers is copyrighted. The Sui generis database rights are only applicable when a substantial portion of the database is used. Anyways, I think a script that downloads IMGT on user side and matches IMGT segments to AIRR database segments can be a feasible solution at least for legacy TCR sequencing data.
Even for a list of closest IMGT segment names?
There is a small chance that there are some kind of IP rights on the names/naming scheme itself. I am currently in the process to rule out these last 5% of uncertainty, but until then we should play it safe. Sui generis database rights are not a concern here, since we are building up the whole database from free and open sources de novo.
I refer to the title of this topic
Fixed now.
If the spreadsheet can be easily compiled from a tab-delimited file I think there will be no problems.
Yes, certainly. By “spreadsheet” I don’t mean .xls necessarily. Everything will be submittable via CSV or TSV.
Another important aspect of storing the germline sequences is the germline itself. It would be ideal if each sequence can be traced directly to corresponding genome contig, i.e. genome assembly ID and genomic coordinates (e.g. in GTF format). There are many tasks where this is quite important such as looking at genomic features shaping V-J recombination. I haven’t found anything related in the schema linked in original post. Have you considered adding this feature?
The schema started off as the minimum subset of information required by germline parsers but has grown into a schema definition for the whole germline set. I’ll be adding traceability to the germline this weekend, along with other related items suggested by @werner.muller. I would be grateful of further advice from you and others about what is required in the way of evidence and metadata, though, because I feel I understand what is required by parsers quite well, but am less confident about this area. Please feel free to contribute to the standardization thread
William
Dear @cswarth
Would you mind putting your Jinja2 templates up somewhere? I’m working on some similar problems (with a real time HIV surveillance system, modelled from NextStrain, HIVtrace and Microreact), and it would be great to share code for more generic components.
Best
Simon
Simon, Our prototype jinja templates are in the igdbweb repo on github,
https://github.com/cswarth/igdbweb/tree/master/igdbweb/templates
These really are crude prototypes. You’ll see some non-functional display elements like breadcrumbs that I’m in the process of replacing with real Flask code.
Flask and Frozen-Flask works like a dream.
– Chris