Using python FlashText to do pattern matching in nucleotide sequences

Using python FlashText to do pattern matching in nucleotide sequences

0

Hi all,

I’m playing with the idea of using FlashText (instead of RegEx) to do some pattern finding in nucleotide sequences. My idea came from the massive speed up seen in the post below:

dev.to/vi3k6i5/regex-was-taking-5-days-to-run-so-i-built-a-tool-that-did-it-in-15-minutes-c98?ref=codebldr

My basic idea is this; given lets say a sequence AGTCTCTCGCAGGTGCA, I want to scan through all reads in a Fastq file, and extract the coordinates where this sequence occurred. FlashText should quite quickly be able to search for this, and replace with lets say -------------, which I can then just scan through sequences and extract start and end positions of these dashes (or something in line with that concept).

But now comes the problem of FuzzyMatching. FlashText can only handle exact matches (this is where RegEx wins). And as we know very well, nucleotide sequences can mutate (insertions/deletions/base changes). If FlashText is really that fast however, maybe I could just write a function which could simulate all possible combinations of indels and base changes (similar to ie. RegEx {e=5}), and then pass all those strings to FlashText (to find in essence an exact match of all possible combinations).

My question is this: Does such an approach sound feasible, and does anyone know of any python package or software that can generate these mutations/combination strings?


python


bioinformatics


regex


flashtext

• 38 views

Read more here: Source link