How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython

You really don’t need regular expressions for this.

header = None
length = 0
with open('file.fasta') as fasta:
    for line in fasta:
        # Trim newline
        line = line.rstrip()
        if line.startswith('>'):
            # If we captured one before, print it now
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)
# Don't forget the last one
if length:
    print(header, length)

This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python. Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line). A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn’t have a new header line.

If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.

collections.Counter() and a simple loop with state to the rescue.

I augmented your test data to have a multi-line sequence too, just so we can tell things work right.

import io
import collections

# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())


counter = collections.Counter()

header = None
for line in data:
    line = line.strip()  # deal with newlines
    if line.startswith(">"):
        header = line[1:]
        continue
    counter[header] += len(line)

print(counter)

This prints out

Counter({'header1': 60, 'header4': 16, 'header2': 10})

Do not reinvent the wheel. There are many specialized bioinformatics tools to handle fasta files.

For example, use infoseq utility from the EMBOSS package:

infoseq -auto -only -name -length -noheading file.fasta

Prints:

header1        30     
header2        10     
header3        16     

Install EMBOSS, for example, using conda:

conda create --name emboss emboss

From the man page:

infoseq -help
   -only               boolean    [N] This is a way of shortening the command
                                  line if you only want a few things to be
                                  displayed. Instead of specifying:
                                  '-nohead -noname -noacc -notype -nopgc
                                  -nodesc'
                                  to get only the length output, you can
                                  specify
                                  '-only -length'
   -[no]heading        boolean    [Y] Display column headings
   -name               boolean    [@(!$(only))] Display 'name' column
   -length             boolean    [@(!$(only))] Display 'length' column

Read more here: Source link