Getting coordinates for a given sequence motif
I would like to get all coordinates for a given series of short sequence motifs, e.g. GATTACA (and its reverse complementary TGTAATC) from a reference genome, but want to do it in a way that allows me to get the coordinates of the sampled motif (because I need to intersect these coordinates with other annotation).
My current approach is based on a strict blastn-short search of that motif. However, this is extremely inefficient, because the minimum length for the algorithm to work requires sequences of a minimum length of 15 bp, so I need to append all possible combinations of nucleotides to the flanks of the motif to reach that minimum length. As consequence, the blast search is not just on a couple or a few motifs, but on thousands of them. For shorter targets, more than inefficient is is plain impractical.
There must be a way to do this more straightforwardly, but have doubts for longer ones, and I am wondering if anybody here has a better idea on how to address this problem.
Thank you,
EDITED: what I want really is not sampling, but getting all genomic coordinates for these motifs.
• 53 views
Read more here: Source link