Extract sequences from a fasta file with specific nucleotide repetition
In principle, regular expressions do not count occurrences exactly (at least easily) internally, you might think of /(ATC.*ATC){1,1}/
but that does not work.
So it is easier to use a perl “one-liner”, like so:
perl -e ' $x =()= "ATCGGGGGGATC" =~ /ATC/ig; print "pattern foundn" if $x == 2 '
This code is just for demonstration, you need to add file parsing and output behavior depending on your requirements.
In case the pattern could overlap, like in find exactly 2x ATCATC
and you want to include [ATC[ATC]ATC] as a valid hit, an additional check is required.
An alternative for making a regex that works in perl is to specify which pattern not to match in parts of the string, like so:
perl -e 'print "foundn" if "ATCATATC" =~ /^(?:(?!ATC).)*ATC(?:(?!ATC).)*ATC(?:(?!ATC).)*$/;'
where (?:(?!ATC).)*
matches everything except ATC.
You may need to take the reverse strand into consideration.
Both strands:
$ seqkit locate -i -p atc seqs.fa | column -t
seqID patternName pattern strand start end matched
Seq1 atc atc + 8 10 atc
Seq1 atc atc + 16 18 atc
Seq1 atc atc - 15 17 atc
Seq1 atc atc - 5 7 atc
Seq1 atc atc - 1 3 atc
Seq2 atc atc + 9 11 atc
Seq2 atc atc - 12 14 atc
Seq2 atc atc - 8 10 atc
Seq2 atc atc - 4 6 atc
Seq2 atc atc - 1 3 atc
$ seqkit locate -i -p atc seqs.fa
| csvtk freq -t -f seqID -nr
seqID frequency
Seq1 5
Seq2 5
# no desired sequences
Only forward strand:
$ seqkit locate -i -p atc --only-positive-strand seqs.fa | column -t
seqID patternName pattern strand start end matched
Seq1 atc atc + 8 10 atc
Seq1 atc atc + 16 18 atc
Seq2 atc atc + 9 11 atc
$ seqkit locate -i -p atc --only-positive-strand seqs.fa
| csvtk freq -t -f seqID -nr
seqID frequency
Seq1 2
Seq2 1
$ seqkit locate -i -p atc -P seqs.fa
| csvtk freq -t -f seqID -nr
| awk '$2 == 2'
| cut -f 1
| tee list.txt
Seq1
$ seqkit grep -f list.txt seqs.fa
[INFO] 1 patterns loaded from file
>Seq1
GATAGATATCGAATGATC
Traffic: 2439 users visited in the last hour
Read more here: Source link