Reverse complement of fasta file

Reading records separated by

>

is a nice idea as it gives you the whole chunk at a time. However, here you want to process and merge lines but not the header, thus distinguishing between lines. It is clearer to read line by line.

The sequence-line is specific: all caps and nothing else. The blank line separates records to process. The remaining possibility is the header. The sequence is assembled by joining lines that match its pattern, and once we hit the blank line it is processed and printed.

open (FASTA, $file_name) or die "error $!";

# sequence, built by joining lines =~ /^[A-Z]+$/
my $sequence="";

while (my $entry = <FASTA>)
{
    if ($entry =~ m/^[A-Z]+$/) {
        # Assemble the sequence from separate lines
        chomp($entry);
        $sequence .= $entry;
    }
    elsif ($entry =~ m/^\s*$/) { 
        # process and print the sequence and blank line, reset for next
        $sequence = reverse $sequence;
        $sequence =~ tr/ACGUacgu/UGCAugca/;
        print "$sequence\n";
        print "\n";
        $sequence="";
    }
    else { # header
        print $entry;
    }
}

# Print the last sequence if the file didn't end with blank line    
if (length $sequence) {
    $sequence = reverse $sequence;
    $sequence =~ tr/ACGUacgu/UGCAugca/;
    print "$sequence\n";
}

The

^

and

$

are

anchors

, for the beginning and end of string. So the regex matching the sequence requires that the whole line be strictly caps. The other regex allows only optional space

\s*

, specifying a blank line.

The sequence processing is copied from the question.

Read more here: Source link