Germline database website design

ematsen · August 6, 2016, 11:51pm

Along with the much more important decisions concerning what sequences should actually go into a next generation database, @bussec has suggested that we should start considering what such a database would look like.

He has also suggested that our organizing principle should be to give people what they want as quickly as possible. This is, for most users, a Fasta file with the germline sequences for their species of interest along with data about the sequences (see this topic for a discussion of what this might look like). I really like this idea, and would extend this in the following ways:

Defaults should be sane. The easiest thing to get to should be a high confidence set of germline sequences.
It should be clear what people are downloading.
It should be easy to change to other levels of evidence for the sequences (e.g. direct germline sequencing only or a more inclusive set).
It should be easy to access previous releases of the database.
There should be some sort of standardized means by which computers (not humans clicking) should be able to access the latest version of the database. This could be as simple as fixing a url address like http://new-database.org/human/high-confidence/latest)

Please contribute what you would like to see! Then we can think about how to make those operations easy.

[Note: @bgaeta and @a.collins have stated that they don’t want to lead this effort by extending IgPdb, though I note that their site does very quickly shepherd you to a Fasta download link .]

a.collins · August 7, 2016, 12:54am

Should evidence be a part of the database, or should evidence be stored separately? Should the database facilitate access to evidence?

In IgPdb, we try to provide access to evidence of the existence of inferred alleles, but we only provide a taste. So for example, a study may have generated hundreds or thousands of alignments that point to the existence of a particular new allele. We might provide a small set of sequences that highlight the fact that the putative polymorphism has been seen in association with diverse D genes and J genes. In some cases we might further show that the sequence has been seen in more than one individual. In other cases we have shown that the same sequence has been inferred in studies from more than one laboratory.

There was a particular need for IgPdb to have associated evidence. As it has functioned as a site for the deposition of sequences that IMGT would not accept, we have been saying: “So you don’t believe us? Well here is the evidence.”

If we can establish a datbase that is accepted as having the general backing of the AIRR community, then the need for linked evidence is different. If a nomenclature committee was deciding on the existence or otherwise of particular alleles, then presumably they would have a database of evidence, and a record of decision making.

Nevertheless, it is worth pondering the place of evidence as we comment on Erick’s post.

werner.muller · August 12, 2016, 4:06am

It would be easy to provide hosting of a new database on my web page and I am happy to provide a search page like for the VBASE2 database. Please have a look and see if something would be required in addition.

The schema we used may be outdated and you might want to expand the schema for your database.

bussec · August 15, 2016, 2:37pm

I fully agree with the points made by @ematsen. However, although these germline datasets constitute the core of the website, I think that there is number of additional things that we should address even though they do not directly impact the quality of the germline sequences. I also should point out that I consider the website to be a front-end, which will query one or multiple underlying databases to produce the requested datasets (i.e. the website server is not necessarily identical to the database server).
In my opinion the important question to ask is: If you would go onto the database website for the first time, would you find it interesting/appealing enough that you would consider using our datasets or even would like to contribute to the database? Or would you think that there is no apparent difference from IMGT, so you can as well save your time and stick to the “standard”? Thus, I consider it worthwhile to discuss about the following points:

The website should start from a “clean slate”, i.e. new domain name, new interface using state-of-the-art web design.
“What is [name_of_new_germline_database]?” section: Tell users in one minute, why we set up this database in positive terms (i.e. no IMGT-bashing: People might share our feelings about IMGT, but they are more likely to look for solutions than wanting to have their problems spelled out):

We want to provide a free and open germline segment reference DB, which everyone can use, no strings attached
We represent a diverse community of both experimental as well as computational scientists, who use this database in our daily work to analyze AIRR data. Our joint experience has been and will continue to be integrated into the structure and the content of this database.
We have transparent criteria for inclusion of datasets into the database [link_to_criteria_catalog].
We have an open governance structure, everybody is welcome to join [link_to_contribute_section].

“How to use” section: Not everyone compiles his or her own BLAST databases on a daily basis, so we should provide detailed how-tos for scenarios like:

You want to use the database with the web version of IgBLAST
You want to use the database with a stand-alone version of IgBLAST
Currently available engines that come pre-packaged/pre-configured with a current version of the database

“More about the project” section: All the details that people might be interested in, but are to much for 2):

Inclusion criteria catalog
Technical infrastructure
Links to code base
Links to the individual databases
Roadmap

“How to contribute” section: You think this is a projected worth contributing to? Great, here are a number of things we need help with:

Testing and validation of existing datasets
Curation of existing datasets
Creation of new datasets (new species)
Coding (database or API)

“Who we are” section: Names and contact data of at least three people.

ematsen · August 20, 2016, 5:01pm

Hear hear! How should we proceed?

I propose that we use Bootstrap 4, which is a modern web framework designed by Twitter. It is the most popular web framework. The only downside from my perspective is that it doesn’t support Internet Explorer 8, which is also no longer supported (even with security updates) by Microsoft. If you insist that people should be able to use Windows XP to browse the database, speak now!

I also propose that the website should be static, meaning that all pages are prepared in advance rather than computed on the fly. This will allow us to host the site very cheaply and reliably on Amazon S3 or any other such service, with zero software dependencies. One can still have quite complex interaction with static sites, as shown by my colleague Trevor Bedford’s website nextflu.org.

Of course, all source code for the website will be public, liberally licensed, and kept under the AIRR GitHub organization.

Feedback, please!

werner.muller · August 20, 2016, 6:08pm

As long as ii it is one page per germline gene with a unique link, then it would be fine with me.
I could then easily build a dnaplot search page like the one for VBASE2.

VBASE2 uses Postgres database (which is very fast) to store and to provide the VBASE2 entries.
The Postgres database is filled by a software script. It would be very easy to generate
static pages once the structure is known.

jianye · August 22, 2016, 6:46pm

Are you saying storing the database content statically (like in files)? I am not tech-savvy but it looks to me that the proposed content of the germline database does not appear to be a simple one (i.e., you have sequence, meta data, germline gene evidences, perhaps even user submission of new genes, etc)…This can get quite complicated and I’d store these in something like a SQL database and then web pages can be generated dynamically. It looks to me also that this task needs significant efforts if you want a full-fledged, stable web site. Too many times I see a useful tool or resource disappeared after a postdoc left or PI moved to a different job.

jianye · August 22, 2016, 7:20pm

I’d say yes evidence is always needed regardless…it just the nature of science.

ematsen · August 24, 2016, 2:40am

Thanks for your thoughts.

I was thinking that we would have one page per germline gene as @werner.muller suggests, plus pages for germline gene databases at various levels of resolution (e.g. high-confidence set, etc).

I am not thinking that we will have user submission through the website. If this is something that others (@ctwatson?) are thinking, let me know as soon as possible.

People can download the database in the format we decide and do whatever queries they want to do on it locally using their programming language of choice. I don’t see any reason to have this functionality available through the website.

You are totally right re stability, but the idea is to get stability through simplicity and high-quality open-source code from a range of contributors. That said, I would guess that a grant application would happen at some stage.

ctwatson · August 24, 2016, 11:27am

@ematsen, I was not thinking that we would have user submission via website.

bussec · August 31, 2016, 12:54pm

Given the current OS usage shares, I think we can live with it. And unless IE8 (or IE6) goes retro, time should be on our side

Just that I get it right: Statistics would be still updated regularly (e.g. every day), just not for every user while on the site? Otherwise the Influenza site look nice.

I also do not think that there is any necessity for a direct submission form, as long as it is straight-forward for people to get in touch. New data that is not fully annotated by the user needs curation anyhow and those who can provide finished datasets will likely have other means to integrate them.

ematsen · August 31, 2016, 1:08pm

The site would get regenerated every time there is a change to the underlying database. Does that make sense? Or do you have some other meaning of “statistics” that I’m not following?

bussec · August 31, 2016, 1:51pm

No, there isn’t, “statistics” was referring to information on database records.

ematsen · September 1, 2016, 11:48pm

The current plan is to write the site generator in Python using the Jinja2 templating library.

Again, feedback welcome.

bgaeta · September 3, 2016, 2:58pm

FYI, IgPdb uses a simple mySQL/PHP framework. As it stands it is a good framework to provide access to the germline repertoire sequences. Where it gets complicated (and is currently unfinished) is in having a submission system especially for evidence. Entering a new allele is easy but entering 100s/1000s of supporting sequences requires an infrastructure it doesn’t currently have. Our philosophy when it was designed was very much to present the evidence and allow searching based on “weight of evidence” so users can download repertoire sets that satisfy their requirements in terms of number of supporting sequences/studies etc