biopython – Help to create a dataframe in Python from a FASTA file

I want to create a dataframe in Python starting from a FASTA format file. Given the toy FASTA file that I am attaching, I built this program in Python that returns four colums corresponding to id, sequence length, sequence, animal name and rows corresponding to all the data available.

However, I am trying to understand how to modify this code in order to create a dataframe in which classes Human and Dog have the same number of data. For example, I want to say to Python: “Append to record (that is the empty list) id, sequence length, sequence and animal for Human, but do it a number of times that is equal to the number of data of the class with minimum number of data (that is Dog)”.

I think that a while loop is needed but I have a bit troubles to understand how to do it. Any suggestion ?

Below the Python code I wrote and the FASTA format file I used.

import pandas as pd
import re
def read_fasta(file_path, columns) :
    from Bio.SeqIO.FastaIO import SimpleFastaParser 
    with open("Proof.txt") as fasta_file :  
        records = [] # create empty list
        for title, sequence in SimpleFastaParser(fasta_file): #SimpleFastaParser Iterate over Fasta records as string tuples. For each record a tuple of two strings is returned, the FASTA title line (without the leading ‘>’ character), and the sequence (with any whitespace removed). 
            record = []
            title_splits=re.findall(r"[w']+", title) # Data cleaning is needed
          
                 
            
            record.append(title_splits[0])  #First values are ID (Append adds element to a list)
            record.append(len(sequence)) #Second values are sequences lengths
            sequence = " ".join(sequence) #It converts into one line
            record.append(sequence)#Third values are sequences
                
            #Fourth column will contain the species
            if "Human" in title_splits:
                    record.append("Human")    
            else:
                    record.append("Dog")
                
           
              
            records.append(record)
    return pd.DataFrame(records, columns = columns) #We have created a function that returns a dataframe

#Now let's use this function by inserting in the first argument the file name (or file path if your working directory is different from where the fasta file is)        
#And in the second one the names of columns
data = read_fasta("Proof.txt", columns=["id","sequence_length", "sequence", "animal"])
data

The FASTA format file is this:

>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Dog|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK
>Numer|Human|Hearth
HSSFIEIVNIEHVIEHIVK

My code prints a dataframe like:

       id  sequence_length                               sequence animal
0   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
1   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
2   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
3   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
4   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
5   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
6   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
7   Numer               19  H S S F I E I V N I E H V I E H I V K    Dog
8   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
9   Numer               19  H S S F I E I V N I E H V I E H I V K  Human
10  Numer               19  H S S F I E I V N I E H V I E H I V K  Human
11  Numer               19  H S S F I E I V N I E H V I E H I V K  Human

But I would like that the number of rows for Human is the same for Dog (because, in other words, I would like the same number of data for each class).

Hoping to have been clear, I thank you in advance.

Read more here: Source link