python – Extract fasta files from ID list with Biopython

I am using Biopython to find sequences in a fasta file that match IDs from a .txt file comprising selected IDs. When searching for the ID names in the fasta file manually I do get hits, but the following script doesn’t find/extract any sequences:

#!/usr/bin/env python3
from Bio import SeqIO
wanted_ids = "transcript.orthogroup7.txt"
input_filename = "hq_isoseq_transcripts.fasta"
output_filename = "wanted_hq_isoseq_transcripts.fasta"
count = 0
total = 0
output_handle = open(output_filename, "w")
for record in SeqIO.parse(input_filename, "fasta"):
    total = total + 1
    if record.id in wanted_ids:
        count = count + 1
        SeqIO.write(record, output_handle, "fasta")
output_handle.close()
print(str(count) + " records selected out of " + str(total))

Here is an excerpt from the .fasta file

>20190807-P1_PBI19-RO01_8pM_HQ_transcript/0
GGTACGTATAGGAATTATAAGGGCTCATTGAGACAGTCTGAGCAAAAAAGTCTCGTTCAG
GCTCGCTAAAAGCCTTTGGTGACGCATCTGTCCGTGAGCGAGTCACATTTTGTGAGCGAG
TCACGAATCTCGGCGCCGGCGAATCATGAATCTCCGTAACAGTGAGTCGCGAATCTCCTA
AGCCGTGTGTTTTGCGCACGTCACTTCATTCCTCGGTACAGGGCTTCCTCTTAACAGGAA ... and so on
>20190807-P1_PBI19-RO01_8pM_HQ_transcript/1
GACTATATAGGAATTATAAGGGCTCATTGAGACAGTCTGAGCAAAAAAGTCTCGTTCAGG
CTCGCTAAAAGCCTTTGGTGACGCATCTGTCCGTGAGCGAGTCACATTTTGTGAGCGAGT
CACGAATCTCGGCGCCGGCGAATCATGAATCTCCGTAACAGTGAGTCGCGAATCTCCTAA
GCCGTGTGTTTTTGCGCACGTCACTTCATTCCTCGGTACAGGGCTTCCTCTCAGCAGGAA ... and so on 

and my ID .txt file

>20190807-P1_PBI19-RO01_8pM_HQ_transcript/0
>20190807-P1_PBI19-RO01_8pM_HQ_transcript/1

I execute the script in a conda environment and get the following unexpected message

(biopython_transcripts) $ python3 script.py
0 records selected out of 35629

When changing the “count” and “total” to for example 1000, I get

(biopython_transcripts) $ python3 script.py
1000 records selected out of 35629

So the script runs but doesn’t extract any sequences despite I am positive that the sequences exist and that the names in the .txt and the .fasta file match completely.

I am open to other solutions than Biopython but would really appreciate help with understanding why this doesn’t work and possible solutions.

Read more here: Source link