Extract sequences from a fasta file with specific nucleotide repetition

Extract sequences from a fasta file with specific nucleotide repetition

2

I have a fasta file name seqs.fa with multiple sequences i.e.,

>Seq1
GATAGAT**ATC**GAATG**ATC**

>Seq2
GATGATAG**ATC**GATGC

I want grep/extract only those sequences having ATC repeated exactly 2 times like in Seq1.
How we can use grep/sed or {} method for this purpose?


linux


command

• 44 views

In principle, regular expressions do not count occurrences exactly (at least easily) internally, you might think of /(ATC.*ATC){1,1}/ but that does not work.

So it is easier to use a perl “one-liner”, like so:

 perl -e ' $x =()= "ATCGGGGGGATC" =~ /ATC/ig; print "pattern foundn" if $x == 2 ' 

This code is just for demonstration, you need to add file parsing and output behavior depending on your requirements.

In case the pattern could overlap, like in find exactly 2x ATCATC and you want to include [ATC[ATC]ATC] as a valid hit, an additional check is required.

An alternative for making a regex that works in perl is to specify which pattern not to match in parts of the string, like so:

perl -e 'print "foundn" if  "ATCATATC" =~ /^(?:(?!ATC).)*ATC(?:(?!ATC).)*ATC(?:(?!ATC).)*$/;'

where (?:(?!ATC).)* matches everything except ATC.

You may need to take the reverse strand into consideration.

Both strands:

$ seqkit locate -i -p atc seqs.fa | column -t 
seqID  patternName  pattern  strand  start  end  matched
Seq1   atc          atc      +       8      10   atc
Seq1   atc          atc      +       16     18   atc
Seq1   atc          atc      -       15     17   atc
Seq1   atc          atc      -       5      7    atc
Seq1   atc          atc      -       1      3    atc
Seq2   atc          atc      +       9      11   atc
Seq2   atc          atc      -       12     14   atc
Seq2   atc          atc      -       8      10   atc
Seq2   atc          atc      -       4      6    atc
Seq2   atc          atc      -       1      3    atc

$ seqkit locate -i -p atc seqs.fa 
    | csvtk freq -t -f seqID -nr
seqID   frequency
Seq1    5
Seq2    5

# no desired sequences

Only forward strand:

$ seqkit locate -i -p atc --only-positive-strand  seqs.fa | column -t 
seqID  patternName  pattern  strand  start  end  matched
Seq1   atc          atc      +       8      10   atc
Seq1   atc          atc      +       16     18   atc
Seq2   atc          atc      +       9      11   atc

$ seqkit locate -i -p atc --only-positive-strand  seqs.fa 
    | csvtk freq -t -f seqID -nr
seqID   frequency
Seq1    2
Seq2    1

$ seqkit locate -i -p atc -P  seqs.fa 
    | csvtk freq -t -f seqID -nr 
    | awk '$2 == 2' 
    | cut -f 1 
    | tee list.txt
Seq1

$ seqkit grep -f list.txt seqs.fa 
[INFO] 1 patterns loaded from file
>Seq1
GATAGATATCGAATGATC


Login
before adding your answer.

Traffic: 2439 users visited in the last hour

Read more here: Source link