Representation of alignments in VDJ data


We should decide on a format for storing alignments in the VDJ rearrangement format. Some options can include gapped alignments, CIGAR strings, BTOP strings. What are the different tradeoffs?


I like that BTOP includes the nucleotide mismatch information and does not has a separate associated piece of information (start position) compared to CIGAR. Either might be beneficial over a gapped alignment in terms of saving space in a file, but storing the gapped sequence seems like it would make it immediately useful for comparing it to other sequences, as long as the gap scheme is conserved across segments (as in IMGT).


As long as the storage format is not “lossy” and allows reconstructing the full gapped alignment unambiguously there’s no issue with using it for storage. But I do think that for human display purposes the full gapped alignment should be used (either pulled directly from the database if stored in that format or reconstructed on the fly from whatever compressed format is used). Comparing multiple sequences aligned to the same reference sequence is less prone to errors when eyeballing the full alignment.


I did a dry run of representing VDJ alignments using BTOP and CIGAR formats. I have compared the representation of an actual VDJ sequence alignment using those formats. One of the question I have is should the alignments be represented from the perspective of the query sequence or the germline sequence?(Please see slides 8 and 9 for more clarification).


I have attached @Nishanth’s presentation on the topic as well.

VDJ_alignment_representations_BTOP_CIGAR_comparisons.pdf (256.5 KB)