Nested vs flat data for VDJ rearrangements

laserson · August 24, 2016, 4:47pm

There seems to be general consensus for two things: the desire to use CSV because it’s “easy”, and the desire to make VDJ rearrangements nested records. These two things are inherently incompatible, if you want to follow best practices (and I think we do). We can discuss some possible solutions to this problem here.

As an initial note, the nesting of the rearrangement records is mainly due to the fact that aligners will spit out multiple V, D, and J alignments, all of which are currently getting stored. One potential way out of this pickle is to limit the number of alignments to, say, 3 V, 3 D, and 3 J alignments. The corresponding fields would each be, for example, suffixed with a _1, _2, or _3. This might cover 99% of the use cases.

Other alternatives are to use a text-based format that can support nesting, like JSON/XML. Or simply allow the reporting of only the best VDJ alignments.

Daniel_Gadala-Maria · August 25, 2016, 12:15am

Looking at some sample data I had handy, V calls 1 to 3 would in fact do a little bit better than covering 99% of cases. Also, if the samples are from a subject whose genotype has been inferred and that information is used to limit the allele assignments, this number goes to zero in all but an extreme hypothetical case (where a subject is heterozygous at duplicated loci and the positions differentiating the alleles are mutated or not sequenced). I am not super familiar with how this would look for a D segment assignment, though.

javh · August 26, 2016, 7:56pm

Nested V(D)J assignments are the most common use, but this is occasionally relevant for other annotations. Isotype, for example.

I would say in the case of nesting within a columnar flat file, it’s not really problematic if you define the rule that any “nests” are restricted to positional arrays of the same data type.

So, a column in a TSV file may contain either one value or an array of values. We use commas to delimit elements in the array and each item in the array must be the same data type. Provided you stick to the rules, then you don’t get any bad behavior and it’s trivial to parse. Though, it’s not as computationally efficient as having multiple columns.

TSV is chosen over CSV because:

Tabs do not occur in free text fields output by IMGT/HighV-QUEST and IgBLAST, and are generally rare in free text fields.
Both IMGT/HighV-QUEST and IgBLAST use commas to delimit embedded arrays already, so it simplifies parsing their output to use commas as the array delimiter.
It’s easier to translate from FASTA/Q headers, where existing formats (including ours) delimit fields by either pipe or space characters, with commas reserved for embedded lists.

schristley · August 29, 2016, 3:59pm

After our WG meeting, I took some time and reviewed the SAM specification to see what design decisions they made. One thing that is noticeable is that they did not really design SAM to be human-readable, even though it is text (thus technically human-readable), they use many short codes, bitwise flags, etc., primarily to keep the size of the file smaller. Our challenge could be even greater as we have potentially many more columns (SAM has 11 mandatory, 1 optional). Should we utilize these same features to reduce size, or should we rely upon compression?

Another is that the SAM specification supports a dual approach regarding nesting. They use additional rows but also have a large set of optional tags that can be packed into an additional column. Reads can align to multiple places in the (genome) sequence, and each of these alignments can be provided as a separate row. They use the flags to indicate a row which is the “primary” alignment versus a “secondary” alignment, however they are using bitwise for the flags, so it isn’t immediately obvious without mentally decoding the flag. What do you think about using the multiple row approach for nesting?

SAM uses a code to define header rows, I think we could use the same design for our metadata.

javh · August 29, 2016, 4:41pm

I think the TSV should be a convenience format with a simple, human-readable data structure. It should be limited in scope, so it doesn’t get messy.

For complex nesting and other such things I think we should use a more appropriate format (eg, XML/JSON), rather than try to shoehorn it into a TSV.

psathyrella · August 29, 2016, 5:05pm

I agree that nesting information in [ct]svs is unproblematic. My particular solution is a column for the best match, and then another column with information about all additional matches (in the form of a python dict written with the csv module). I like this, since the vast majority of the time you only look at columns that are super simple, but the extra information is there if you need it, in other columns that are easy to ignore.

I would, however, argue that we should go with [ct]svs rather than json/xml. I think you make a really good point that the two options represent a tradeoff between human readability on the one hand, and a more natural representation of more complex information on the other. But I think BCR annotations are simple enough, and all of the existing code bases are immature enough (and there’s so many of them), that it would be better to bias ourselves toward a format that makes debugging and converting between software frameworks less error-prone and thus faster.

Even sam, which seems to be much more mature than anything we’re using for BCR annotation, only a couple days ago was crashing on me in a way that was only easy to debug because the sam file is human-readable enough to figure out where in the file the issue was.

Those seem like reasonable reasons to prefer tabs over commas, but I think an equally reasonable argument can be made for the other choice: 1) since tabs are special characters, they require escaping in many contexts, and this is different in, say, emacs, vim, sed, bash, zsh, which means it’s frequently broken 2) tabs are visually hard to distinguish from spaces, which makes tsvs impossible to parse by eye when there’s spaces in the column headers 3) given that it seems like most of the discussion on b-t.cr centers around how we can all not have to use imgt or igblast in the near future, I’m not sure that backwards compatibility with their particular output formats should be a big criterion around which to design our future format.

laserson · August 29, 2016, 5:07pm

My impression is that we should be developing not a convenience format, but the reference implementation of a “comprehensive” format for long-term storage and a broad set of analytics. I think TSV could potentially serve that function if we don’t ask too much of it.

Another option for TSV would be to denormalize the data as @schristley points out is done in BAM. That’s because the core object that’s stored in BAM is an alignment rather than a read. It would be as if each row in our TSV file would instead be one of the VDJML segment matches, along with all the read-related data replicated into the row. An extra complication here is that there would need to be separate records for Vs, Ds, and Js.

laserson · September 21, 2016, 3:25pm

Two additional options here:

If we look at the resulting CSV file as a database, then the “read id” field does not necessarily have to be a primary key. That is, each row could store a V, D, or J alignment and there could be multiple such alignments per “read”. This would require some kind of group-by operation in order to analyze each read. One way to optimize this is to ensure that all rows associated with a given read are required to be next to each other. However, I strongly oppose this, as putting strong constraints on the order of rows is a very un-relational thing to do, and makes it harder for distributed tools to work on the files.
Combining the two approaches, this more closely mirrors the structure of VDJML. We can support 2 different kinds of AIRR files: one file has one row per “read” and includes, say, the top V/D/J alignments. Another output file looks more like the file described in #1 above, which contains all the additional alignments. This would also work more nicely with an aligner that assigns probabilities to each possible germline. This way, either the main summary file can be carried forward for most applications, or some people can consume the more detailed alignment file if their work needs it. In my mind, this solution brings the best of both worlds, and also doesn’t require us to violate best practices by making CSV a nested format.

laserson · September 28, 2016, 3:53pm

One point raised by @Daniel_Gadala-Maria is that if we choose a single “top” V, D, and J assignment for the primary file, there needs to be a consistent way to deal with ties. Though perhaps randomly choosing could also be “consistent”? I’m not sure if this should simply be implementation-dependent, though.

Daniel_Gadala-Maria · September 29, 2016, 12:30am

For problem scale: in my experience running IMGT/HighV-QUEST on data from blood, 10-20% assignments include ties. Inferring a subject’s Ig genotype and restricting assignments to that set of alleles reduces this to 1-3%, though that can still be a large number of sequences and ties would still need to be dealt with. Here are some options that come to mind, if we really don’t want to accept multiple assignments, though I hope we can come up with something better:

-Randomly choosing. Not be my preference, since re-running the same data through the same pipeline would produce different results.
-Using SHM targeting models. Could compare the mutations from the germlines to guess which pattern is more likely, but would be a little more time consuming and would not solve situations where sequence is not available to make a call (e.g. allele-differentiating positions are in the junction)
-Take the alphabetically first allele. Not great in that it would be a consistent bias, though lower-numbered alleles tend to be the most common (presumably they were identified earlier because they are more common)

laserson · September 29, 2016, 12:55am

Thinking about it more, this specific issue seems to be a bit out of scope for a file format spec IMO. Any given tool can choose to break the ties however it wants, and this can even be an option to the software. For example, if IgBlast allowed the user to specify random tie breaking vs alphabetical tie breaking, it seems strange to me that one of those options would be an invalid output while the other would be valid. (Especially when the file format would in principle be structurally compatible with reporting the data in either case.)

Daniel_Gadala-Maria · September 29, 2016, 4:46am

That makes sense to me. So then, the official format would require that there only be one call, but it is up to the user to figure out how to break ties in creating the file.

w.lees · September 29, 2016, 8:08am

I think it might be unduly restrictive to restrict the format to a single tie. Often there are very close calls - the winner can differ by as little as a single nucleotide from the runner-up, and, given the substantial divergence from germline one can see in B-cell sequences, a single nucleotide can mean very little. There’s certainly room for downstream tools to improve in their discrimination - genotyping has been mentioned, and clonal inference might also help. As tools improve, they are going to need access to more comprehensive information on calls and likelihood. I think the overall format should allow for that. The information can always be filtered later (or by the parser itself) if the user’s only interested in a single call per sequence.

Defining a nested format in JSON/XML, and then providing a flattened implementation in CSV/TSV, seems a reasonable way to go and doesn’t enforce an inherent limit on the number of calls that can be made.

laserson · September 29, 2016, 2:59pm

Just to make sure, @w.lees, your comment takes into account the fact that the tools should output 2 files? The first “primary” file contains just the “top” alignments, while the second file will contain all the alignments.

Or are you on board with the 2-file approach, but you mean that the primary file should still have room for, say, the top 2 or 3 alignments?

w.lees · September 29, 2016, 4:07pm

Sorry - I see the suggestion now. But it looks as though the two files you suggest could very easily have an identical format - in which case is it really necessary to mandate that a parser produces two files? The benefit seems to be slight, compared to the additional complexity required for tools to be capable of reading and writing both of them - unless we’re suggesting that implementers just pick one or the other format, in which case there are compatibility issues that don’t really seem justified.

I’d suggest one file, and we leave it up to implementers to decide whether they will simply write out all the calls they make, or have some option to restrict the output to the top call, or top three, or whatever. Perhaps this makes things slightly more complicated for a consuming tool, if it’s only interested in the top call, but if that’s seen to be a problem we could have a flag to indicate the top call, and such a tool could simply ignore any unflagged records.

laserson · September 29, 2016, 6:45pm

I don’t think they’re semantically identical. To make it easier, let me call the “primary” output file the “rearrangements file” and the “secondary” file with all the alternative alignments the “alignments file”.

In the rearrangements file, the read_id functions as a primary key, and should be unique in the file, while in the alignments file, the ‘read_id’ is not unique. I would also imagine that the schemas would likely not be identical (though they can be if we want) for several reasons:

It would likely be wasteful to replicate the same read/rearrangement-associated fields into each row that’s associated with the read. The alignments file may also be much larger, as there can be tens of rows per read. The alignments file is more “specialized” anyway, so would not necessarily need the extra fields. And it should be easy enough to join the two files if necessary. (e.g., using a data.frame in R or pandas in Python, or (Py)Spark for huge files).
I could imagine that if we were defining different types for V, D, and J alignments, they might actually have slightly different schemas, so even having a single file designed for “alignments” is a bit of an abuse from a pure data modeling perspective. You could imagine you’d want a separate file for each. Put another way, each row in the alignments file would be either a V alignment, D alignment, or J alignment.

I would also prefer to have just a single file as output, but to successfully do that and capture the multiple alignments, the structure of the record for a single read would have to include 3 nested subtables (one each for V, D, and J), and this is not possible to do with CSV (unless you do things that are generally considered dangerous).

The way I imagine it, an analysis tool that’s writing out this data would by default, say, output only the rearrangements file. You could instead add a flag which specifies that you should output both the rearrangements file and the alignments file. Any tool implementor would need to support both of these, but since it’s just simple CSV, it should be pretty simple.

That said, if you do really think that many people would at least want access to the top 2 or 3 alignments for V, D, and J, that could easily be supported in the rearrangements file with fields like v_gene_1, v_gene_2, etc.

Finally, I agree that the rearrangements and alignments file can have identical schemas. It would just mean that people would have to be more careful about interpreting the results in such a file. You would also reproduce all the read-associated fields for each individual V, D, and J alignment for that read.

ematsen · September 30, 2016, 12:20am

Great to see this active discussion.

@laserson-- in your scheme of separating the rearrangements from the alignments, I’m not sure if I understand how confidence would be displayed. In my mind, a level of confidence for a given rearrangement is a function of the entire thing-- V / D / J genes, amount of trimming, etc. With your scheme would we be able to assign various levels of confidence to various entire rearrangements? Sorry if I didn’t follow how it would work.

laserson · September 30, 2016, 7:26pm

From the Change-O perspective, all this is really about is converting the current data structure into a relational one. So assuming we just kept all the Change-O fields, anything that would be representable there should be representable here.

w.lees · October 1, 2016, 9:37am

I’m not sure that there is such a thing as ‘all’ the arrangements. In the parsers I’m familiar with - mainly IMGT and IgBLAST - the user specifies how many alignments they want. In IMGT this can range from 1 to 20 and is 5 by default. In IgBLAST the range is 1 to 200, with 10 by default. The size of the output is entirely under the user’s control. I can’t see more that 10 being required, except in very special cases, where the number of reads would probably be small.

IMGT’s csv format only provides the highest confidence alignment (it makes a failry arbitrary choice if there are multiple alignments with the same highest confidence). Within that alignment, it’s possible that there are multiple V-genes,say, with the same confidence. If so, it puts the candidates in a list within the cell. We can call that a table if we wish, but it fits pretty conveniently into the csv format as far as I’m concerned. The point of using the csv format is to make the file readable, and I think it works well from that point of view. Similarly, I don’t see an issue with multiple records having the same read_id. Again, it works from the point of view of readability.

If we were to have two file formats, I’d suggest that one is this human-readable format, with a variable number of records per read, and the other is a JSON format. We could mandate one, and make the other optional (and I bet someone would write converters between the two pretty quickly).

All the best

William

schristley · October 4, 2016, 4:51pm

Even though V(D)J assignment is the base content for this format, there are other annotations to consider. VDJML is probably the “purest” in the sense that it includes assignment info and not much else. The extra stuff is metadata about the tool used, parameters, germline db, etc. Change-O’s file and VDJServer’s RepSum TSV both include many columns that are annotations above and beyond assignment. From the WG’s analysis of those three formats, there is some overlap but each have unique fields.

When thinking about 2 files as suggested by @laserson, how should we consider these additional annotations? Only provide them for the “primary” assignment, or should those annotations be provided for all assignments and thus exist in both files? What if the user decides that the primary assignment is not correct and wants to use a secondary assignment from the other file, if the secondary file doesn’t have those annotations then there needs to be some mechanism to generate those annotations and create a “primary” file with all the assignments the user wants to use for downstream analysis.

I’m tending to lean toward simpler is better with just a single file, avoiding multiple files with possible different schema, that might just make things harder for downstream tools.