Extracting named fasta sequences according to list with Biopython
Hi all,
I’m trying to work out a quick script to extract a set of sequence fasta files from a multifasta and write them all to a new, single fasta file. To elaborate, I’ve got a proteome, and I want to extract a group of 15 or so proteins associated with a certain process, and write them to a new multifasta. To do so I want the script to read a document with a list of sequence names in it and sort the original multifasta using that list. I’m aiming to do this using Biopython.
This is my code so far:
from Bio import SeqIO
import sys
sample_file = open(str(sys.argv[2]), "r")
seq_list = []
outfile = open(str(sys.argv[3]), "w")
#This reads the guide document and turns each line into a list item in seq_list.
for line in sample_file:
stripped_line = line.strip()
line_list = stripped_line.split()
seq_list.append(line_list)
sample_file.close()
print(seq_list)
#This print function is to confirm that seq_list is indeed storing the names.
for record in SeqIO.parse(str(sys.argv[1]), "fasta"):
for n in seq_list:
if n == record.id:
SeqIO.write(record, outfile, "fasta")
outfile.close()
The main problem so far is that I can load the document of names into seq_list and print the list, but parsing SeqIO with it doesn’t seem to do anything. However, hardcoding the names into the code seemed to work fine. Any help would be greatly appreciated.
Thanks,
Lachlan
• 10 views
Read more here: Source link