Convert between ChangeO and vdjtools formats?


#1

Dear All (esp @javh @mikhail.shugay)

I’m giving a course on repseq in Portugal, and I’d like just to use one pass of IgBLAST (rather than run migmap for vdjtools), then process the results in ChangeO and vdjtools. Matching up the fields, I have:

count=DUPCOUNT
frequency=DUPCOUNT/sum(DUPCOUNT)
CDR3nt=CDR3_IMGT
V=V_CALL
D=D_CALL
J=J_CALL

I’m assuming that CDR3nt is the CDR3_IMGT, and I can easily translate to get to the vdjtools CDR3aa. How does one go from Change-O to the Vend, Dstart, Dend, Jstart variables in vdjtools? I know it’s simple maths, but in my experience, 99% of bugs are due to indexing errors, and it looks a little fiddly.

In addition, it would also help in CDR3 analysis if these variables (Vstart, Dstart, Dend, Jstart) were added to the ChangeO format specification in some form, so naming is consistent.

Best
Simon


#2

Hello Simon,

In VDJtools, as well as in mitcr and migmap:
Vend - zero-based (in CDR3 coordinates, starting from first base of Cys) position of last V segment base
Dstart - zero-based position of first D segment base
Dend - zero-based position of last D segment base
Jend - zero-based position of first J segment base

You may choose to omit Vend…Jend fields while importing in VDJtools format. However this means that certain statistics (e.g. insert size) will return NA or -1.

Also note: in VDJtools format CDR3aa is translated in both V -> J and J->V directions, stop codons are marked as “*” and incomplete codon (this will be in the mid of N-D-N region if there is a frameshift) is marked with “_”. See also: http://vdjtools-doc.readthedocs.io/en/latest/input.html#input

Can you please post an example file in Change-O format?


#3

Dear @mikhail.shugay

Zero-based! Glad I asked. Thanks for the extra info about the translation too. If I post a FASTQ with IgBLAST and ChangeO output, will that be OK?

Best
Simon


#4

Hey @sdwfrost,

In Change-O, the indexing should be 1-based. Rather than start and end positions, we have start position and length. So:

Vstart ~ V_SEQ_START
Vend ~ V_SEQ_START + V_SEQ_LENGTH
Dstart ~ D_SEQ_START
Dend ~ D_SEQ_START + D_SEQ_LENGTH
Jstart ~ J_SEQ_START
Jend ~ J_SEQ_START + J_SEQ_LENGTH

This is for the SEQUENCE_INPUT column (query). For the corresponding values in the germline sequences (reference) use the “GERM” columns instead of the “SEQ” columns.

N/P region end points are calculated in the same way, but we don’t have N/P start in the data standard, as these are just the V end or D end.

And yep, CDR3_IMGT is the CDR3 nucleotide sequence. We don’t have any translated columns in the data standard. If we need them, we just append them with alakazam::translateDNA.

That’s how I think the Change-O/VDJTools conversion should work. Though, I’d have to actually sit down with a VDJTools input file to make sure everything converts as I expect it to. I’m happy to do so if you post a VDJTools file as well.


#5

Dear @javh @mikhail.shugay

I put a FASTQ from the Jiang et al. (2013) study (cleaned with pRESTO as per the example docs), along with the output of migmap, converted to vdjtools format as well as the IgBLAST and ChangeO files in a gist here.

Best
Simon


#6

Looks like I originally misunderstood what the columns VStart, DStart, DEnd, and JStart in VDJTools were. These fields are relative to the junction region, and not relative to the full length sequence, as I assumed. Meaning, the above conversions are wrong.

Also, in VDJTools the CDR3 refers to the junction region, not the IMGT CDR3 definition, so it includes the conserved residues flanking the IMGT CDR3. Hence, it’s two codons longer than the CDR3_IMGT column in Change-O. However, you can just map the JUNCTION column to the cdr3nt column to resolve this.

Here’s the conversion that seems to work. There are a few discrepancies in the conversion using the files @sdwfrost posted, but we think those are probably just D-segment alignment differences between MiGMAP and vanilla IgBLAST.

VEnd = V_GERM_LENGTH_IMGT - 312 + 3
DStart = VEnd + NP1_LENGTH
DEnd = DStart + D_GERM_LENGTH
JStart = DEnd + NP2_LENGTH

I posted a script to do the conversion on the Immcantation repo. Use like so:

> changeo2vdjtools.R file_db-pass.tsv

I didn’t do careful testing, so if you hit any snags let me know.

I’ll make an issue on the Change-O repo about adding this alignment info. There’s probably a way we can incorporate it as optional output without introducing too much clutter from redundant info.


#7

Dear @javh @mikhail.shugay

Thanks for all the help and clarifications. I’ll do some further testing. In the meantime, can you clarify the differences in migmap vs. the IgBLAST output? I don’t quite see how migmap and IgBLAST could differ.

Best
Simon


#8

There can be a few differences:

  1. Migmap runs IgBlast v 1.4.0, which is not the latest one. I’m still unsure what are the differences in 1.6.1 compared to 1.4.0, the changelog was not updated (ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.6.1/ChangeLog)

  2. Migmap runs the analysis in parallel without using the BLAST -p option. The latter is to speed up the alignment of a single sequence when aligning against a large database. Migmap splits input in chunks and streams them to IgBlast, which appears to be faster than using -p option.

  3. Migmap uses its own references (see https://github.com/mikessh/migmap/blob/master/src/main/resources/segments.txt). If not using --all-alleles option with Migmap it will align only to *01 alleles, so the results can differ in terms of V/J and segment boundaries.

  4. Migmap will filter out reads without CDR3, out-of-frame reads, reads with incomplete CDR3 and reads with low-quality mutations/N-nucleotides by default (see options here https://github.com/mikessh/migmap#general). So some reads and clonotypes can be dropped from the output.

Best regards,
Mike