Sponsored by the AIRR Community

Minimal core fields and derived fields for VDJ rearrangement data

Looking at some of the existing schemas for VDJ rearrangement data (Change-O, VDJML, etc.), it seems that many fields are “derived” fields, in which they could be computed from some smaller core set of fields For example, if we have the coordinates of the start of CDR3 and the end of CDR3, we could compute the junction length. To what extent do we want to incorporate derived fields? Should we also mandate exactly how they are to be computed from the core fields?

On a related note, most of the fields can be computed from annotations that are “lifted over” from the alignment to an associated germline sequence. Specifically, this is the locations of the boundaries of all the FWRs and CDRs. Are these few data points in fact sufficient to cover our needs? Is the germline group on board with ensuring that a “valid” germline set includes all the annotations we need?

This is what we have done in VBASE2 and its associated VDJ rearrangement analysis program.
All derived parameters are automatically computed on the fly and then send back to the user.
In the VBASE2 database we also store computed information for the user to query the
database more quickly but could also be generated on the fly without problems.

It therefore depends which information you are after. For example, if you want to extract all CDR1 regions from
all gremline sequences it is much faster to extract this information from the database compared to an on the fly computation of all CDR1 sequences. If you only want this information from a few sequences, on the fly computation would be as fast.

In response to @laserson’s query, the germline groups are still working to define how we should deal with sequences supported by differing levels or differing quality of evidence, and we are still debating how high to set the bar for new polymorphisms inferred from VDJ datasets. I hope we can give some clarity in the next week or so.

After our call, it seems the consensus is to mandate reporting only of the core fields. We may endorse some derived fields, in which we’ll mandate how to compute them from the core fields.

It seems like we’ll at least need to know the coordinates of all the FWR/CDR boundaries. I’m assuming this will be part of the germline spec?

The FR/CDR boundaries are indeed in the germline schema.I’ve done my best to ensure that all the information you would need is in the schema, but your review would be welcome.

Sorry I missed the call and discussion regarding core/derived fields @laserson, but here are my thoughts. I’m OK with mandating a set of core fields but I would like to encourage the WG to standardize the name and format for as broad a set of derived fields as we can. If we don’t then I think we will quickly get back to the same position as today where each tool may define their own names, etc. with incompatible overlaps.

1 Like