I’m looking at RNAseq data from CCLE. The data is paired-end.
Take the cell line Hs578T and the gene HRAS as an example.
The cell line carries a G12D mutation (c.35G>A), so the change in cds is:
ggc ggtgtgggca agagtgcgct g - Wildtype CDS
gAc ggtgtgggca agagtgcgct g - Mutant CDS
^
My question is, when I grep
the mutant CDS gAcggtgtgggcaagagtgcgctg
, I do not get a match in my .bam file. But when I grep
the reverse complement (i.e., anti-sense sequence), I get matches coming from both mates.
My question is, why are there reads in the .bam file that correspond to the anti-sense strand of the DNA? Shouldn’t all the reads of mRNA be the CDS?
I got the answer –
The punch line is that the sequence in .bam files is ALWAYS ‘+’ strand of the reference, no matter what mate the read came from, or which strand the gene sits on.
Read more here: Source link