You really don’t need regular expressions for this.
header = None
length = 0
with open('file.fasta') as fasta:
for line in fasta:
# Trim newline
line = line.rstrip()
if line.startswith('>'):
# If we captured one before, print it now
if header is not None:
print(header, length)
length = 0
header = line[1:]
else:
length += len(line)
# Don't forget the last one
if length:
print(header, length)
This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python. Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line). A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn’t have a new header line.
If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.
collections.Counter()
and a simple loop with state to the rescue.
I augmented your test data to have a multi-line sequence too, just so we can tell things work right.
import io
import collections
# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())
counter = collections.Counter()
header = None
for line in data:
line = line.strip() # deal with newlines
if line.startswith(">"):
header = line[1:]
continue
counter[header] += len(line)
print(counter)
This prints out
Counter({'header1': 60, 'header4': 16, 'header2': 10})
Do not reinvent the wheel. There are many specialized bioinformatics tools to handle fasta files.
For example, use infoseq
utility from the EMBOSS
package:
infoseq -auto -only -name -length -noheading file.fasta
Prints:
header1 30
header2 10
header3 16
Install EMBOSS
, for example, using conda
:
conda create --name emboss emboss
From the man page:
infoseq -help
-only boolean [N] This is a way of shortening the command
line if you only want a few things to be
displayed. Instead of specifying:
'-nohead -noname -noacc -notype -nopgc
-nodesc'
to get only the length output, you can
specify
'-only -length'
-[no]heading boolean [Y] Display column headings
-name boolean [@(!$(only))] Display 'name' column
-length boolean [@(!$(only))] Display 'length' column
Read more here: Source link