Duplicate of: github.com/lindenb/jvarkit/issues/190
Hi can anyone help me resolve the issue I’m having with sam2tsv.
It is a nifty piece of software but I have been encountering issues with it regarding the numbering of nucleotides it shows for the reference sequence.
Here’s what sam2tsv tells me:
The nucleotide string marked in red CTGGCCGAGCTAG
is the read reporting the mutation T>A (line #469).
But the reference sequence listed by sam2tsv (green box) doesn’t match the read sequence at all. In fact it is correct sequence from the reference fasta, but it is right-shifted by 16
bases. With other reference files, this number is different, e.g. 20.
In fact, if I search the sequence ATGGAGACCCGCT
in my reference sequence it spans 356-368
. In contrast sam2tsv lists these residues in the range 339-352
(as seen in the green box). Off by exactly
To summarize:
1) The position listed in the green box (reference), corresponds to the sequence in the red box (read).
2) The reference sequence listed in the green box, is off by 16 bases.
Information:
PacBio HiFi CCS reads aligned using pbmm2
to custom reference (cDNA).
Files:
sam containing the read in question
Command used:
java -jar '~/Git/Others_Cloned/jvarkit/dist/sam2tsv.jar' -R ../../../reference/pBG-ERBB2/ERBB2.fa 1.ERBB2_Library.bam
Read more here: Source link