bam – Detect mutation context in a read of a sam file

That kind of custom fiddling with reads and variants is very cumbersome, non-standard and also error-prone. Do a standard variant callign pipeline and then filter for the mutations that you want. Then extract the variant position (so the coordinates) and get the variant context from the reference genome. Using individual reads for this is not advisable as only a properly called variant is reliable. The power of NGS is the validation of an observation by redundancy (=many reads convering a variant). With individual reads your variant can simply be noise, there is hardly a way to tell. Alignment will confidently map the read to the “context” as you call it, so in my opinion there is not really an advantage of calling the “context” directly from individual reads. You could better extract every confident C>A mutation, then get the three bases up/downstream and then make some kind of position frequency matrix to create a motif.

Example, just for illustration, imagine these were for variants (A in the middle) with three bases up/downstream:

CTAATTG
CTCATAG
CTTATAG
CCTAGTG

The motif would then be this, you can easily see whether there is some conservation in the “context” of your variant.ad

enter image description here

The variant calling you could do with an end-to-end pipeline such as nf-core sarek, this will eventually give you a VCF file(s) with the variant coordinates. From there simply filter for the variants you want, get the coordinate, add three bases up/downstream to start/end and then use something like bedtools getfasta to get the sequences from the reference genome.

Read more here: Source link