I’m giving a course on repseq in Portugal, and I’d like just to use one pass of IgBLAST (rather than run migmap for vdjtools), then process the results in ChangeO and vdjtools. Matching up the fields, I have:
I’m assuming that CDR3nt is the CDR3_IMGT, and I can easily translate to get to the vdjtools CDR3aa. How does one go from Change-O to the Vend, Dstart, Dend, Jstart variables in vdjtools? I know it’s simple maths, but in my experience, 99% of bugs are due to indexing errors, and it looks a little fiddly.
In addition, it would also help in CDR3 analysis if these variables (Vstart, Dstart, Dend, Jstart) were added to the ChangeO format specification in some form, so naming is consistent.
In VDJtools, as well as in mitcr and migmap:
Vend - zero-based (in CDR3 coordinates, starting from first base of Cys) position of last V segment base
Dstart - zero-based position of first D segment base
Dend - zero-based position of last D segment base
Jend - zero-based position of first J segment base
You may choose to omit Vend..Jend fields while importing in VDJtools format. However this means that certain statistics (e.g. insert size) will return NA or -1.
Also note: in VDJtools format CDR3aa is translated in both V → J and J->V directions, stop codons are marked as “*” and incomplete codon (this will be in the mid of N-D-N region if there is a frameshift) is marked with “_”. See also: http://vdjtools-doc.readthedocs.io/en/latest/input.html#input
Can you please post an example file in Change-O format?
This is for the SEQUENCE_INPUT column (query). For the corresponding values in the germline sequences (reference) use the “GERM” columns instead of the “SEQ” columns.
N/P region end points are calculated in the same way, but we don’t have N/P start in the data standard, as these are just the V end or D end.
And yep, CDR3_IMGT is the CDR3 nucleotide sequence. We don’t have any translated columns in the data standard. If we need them, we just append them with alakazam::translateDNA.
That’s how I think the Change-O/VDJTools conversion should work. Though, I’d have to actually sit down with a VDJTools input file to make sure everything converts as I expect it to. I’m happy to do so if you post a VDJTools file as well.
I put a FASTQ from the Jiang et al. (2013) study (cleaned with pRESTO as per the example docs), along with the output of migmap, converted to vdjtools format as well as the IgBLAST and ChangeO files in a gist here.
Looks like I originally misunderstood what the columns VStart, DStart, DEnd, and JStart in VDJTools were. These fields are relative to the junction region, and not relative to the full length sequence, as I assumed. Meaning, the above conversions are wrong.
Also, in VDJTools the CDR3 refers to the junction region, not the IMGT CDR3 definition, so it includes the conserved residues flanking the IMGT CDR3. Hence, it’s two codons longer than the CDR3_IMGT column in Change-O. However, you can just map the JUNCTION column to the cdr3nt column to resolve this.
Here’s the conversion that seems to work. There are a few discrepancies in the conversion using the files @sdwfrost posted, but we think those are probably just D-segment alignment differences between MiGMAP and vanilla IgBLAST.
I posted a script to do the conversion on the Immcantation repo. Use like so:
> changeo2vdjtools.R file_db-pass.tsv
I didn’t do careful testing, so if you hit any snags let me know.
I’ll make an issue on the Change-O repo about adding this alignment info. There’s probably a way we can incorporate it as optional output without introducing too much clutter from redundant info.
Thanks for all the help and clarifications. I’ll do some further testing. In the meantime, can you clarify the differences in migmap vs. the IgBLAST output? I don’t quite see how migmap and IgBLAST could differ.
Migmap runs the analysis in parallel without using the BLAST -p option. The latter is to speed up the alignment of a single sequence when aligning against a large database. Migmap splits input in chunks and streams them to IgBlast, which appears to be faster than using -p option.