Getting coordinates for a given sequence motif

Getting coordinates for a given sequence motif

2

I would like to get all coordinates for a given series of short sequence motifs, e.g. GATTACA (and its reverse complementary TGTAATC) from a reference genome, but want to do it in a way that allows me to get the coordinates of the sampled motif (because I need to intersect these coordinates with other annotation).

My current approach is based on a strict blastn-short search of that motif. However, this is extremely inefficient, because the minimum length for the algorithm to work requires sequences of a minimum length of 15 bp, so I need to append all possible combinations of nucleotides to the flanks of the motif to reach that minimum length. As consequence, the blast search is not just on a couple or a few motifs, but on thousands of them. For shorter targets, more than inefficient is is plain impractical.

There must be a way to do this more straightforwardly, but have doubts for longer ones, and I am wondering if anybody here has a better idea on how to address this problem.

Thank you,

EDITED: what I want really is not sampling, but getting all genomic coordinates for these motifs.


genome


assembly

• 53 views

Read more here: Source link