Parts of sequences are given below-
Reference sequence (pre-alignment):
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGT
Reference sequence (post-alignment) and below it is Sample sequence (post-alignment):
--------------------------------------------------------------------------------------------------attaaa---------ggtt------------------tataccttc---------ccaggtaacaaa-------------ccaacc-----aactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgt
--------------------------------------------------------------------------------------------------------------------------------------------------------------taacaaa-------------ccaacc-----aactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgt
I’m adding a simpler to interpret example as per a comment on this post.
Say the ref seq is aattaaatttgggggtttt
and the sample seq is ttaaggggttaaatttgggggt--t
. Then post-alignment, they will be like-
--aa----ttaaatttgggggtttt
ttaaggggttaaatttgggggt--t
Now, since my ref seq was
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
a a t t a a a t t t g g g g g t t t t
I want that post-alignment also, the indexing should be conserved-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
- - a a - - - - t t a a a t t t g g g g g t t t t
t t a a g g g g t t a a a t t t g g g g g t - - t
I want to keep the indexing of the reference sequence conserved(i.e. the first base in ref seq post-alignment is a
, second is t
, third is t
, etc.), like they do in standard softwares, and then I want to run some quick analysis on it, say to check for conservation of a (mono/di)-nucleotide at some positions. If anybody has some insight on how to do it the most efficient way(memory-wise and time-wise), then that’d be great. I use Python for my work.
Read more here: Source link