How to extract protein sequence of selected genes from gene bank like format

Hello everyone,

I used this program called antismash to predict the secondary metabolite clusters in my genome of interest. The output I got is gene bank like format whish is shown below:

LOCUS       scaffold_10            47160 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  scaffold_10.
ACCESSION   scaffold_10
VERSION     scaffold_10
KEYWORDS    .
SOURCE
  ORGANISM
            .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.0.1-25e43a7(changed)
            Run date     :: 2021-11-05 17:39:45
            NOTE: This is a single cluster extracted from a larger record!
            Orig. start  :: 2049841
            Orig. end    :: 2097001
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
     protocluster    1..47160
                     /aStool="rule-based-clusters"
                     /contig_edge="False"
                     /core_location="join{[2069841:2070128](+),
                     [2070175:2070449](+), [2070507:2071973](+),
                     [2072023:2074178](+), [2074232:2074784](+),
                     [2074838:2077001](+)}"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      join(20001..20287,20335..20608,20667..22132,22183..24337,
                     24392..24943,24998..27160)
                     /aStool="rule-based-clusters"
                     /tool="antismash"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"
     cand_cluster    1..47160
                     /SMILES="CC(=O)C(=O)O"
                     /candidate_cluster_number="1"
                     /contig_edge="False"
                     /detection_rules="cds(PKS_AT and (PKS_KS or ene_KS or
                     mod_KS or hyb_KS or itr_KS or tra_KS))"
                     /kind="single"
                     /product="T1PKS"
                     /protoclusters="1"
                     /tool="antismash"
     region          1..47160
                     /candidate_cluster_numbers="1"
                     /contig_edge="False"
                     /product="T1PKS"
                     /region_number="1"
                     /rules="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS or
                     hyb_KS or itr_KS or tra_KS))"
                     /tool="antismash"
     CDS             1899..2720
                     /ID="CDS_13110"
                     /NRPS_PKS="Domain: PKS_KR (26-149). E-value: 7.6e-08.
                     Score: 24.2. Matches aSDomain:
                     nrpspksdomains_jgi.p_FusspF11_1_766668_PKS_KR.1"
                     /NRPS_PKS="type: other"
                     /gene="jgi.p_FusspF11_1_766668"
                     /gene_functions="biosynthetic-additional (smcogs)
                     SMCOG1001:short-chain dehydrogenase/reductase SDR (Score:
                     161.6; E-value: 3.9e-49)"
                     /gene_kind="biosynthetic-additional"
                     /phase="0"
                     /source="prediction"
                     /transl_table=1
                     /translation="MTPPPKSALESDPAVGAQRRFSVTGNAVVTGGAGVLGLHACDALL
                     EHGLEGLMILDVNPAQSQGQITSLQNKFPRAKIMALKVDVTDENVVNAAMEETARVLGS
                     IDTLICFVGVVGCVETLDMPVPQWRKILDINTTGSFICAQAAARQMVKRGRGGSIVFVA
                     SISAHRVNYPQPQAAYNVSKSALLMLKSCLAAEWARYGIRTNSISPGYMDTILNEGDGI
                     AEHRKIWAEHNPSGRMGAPSELTGTVVLLASSAGSYINGADIVVDGGGIVL"
     aSDomain        1977..2345
                     /aSDomain="PKS_KR"
                     /aSTool="nrps_pks_domains"
                     /database="nrpspksdomains.hmm"
                     /detection="hmmscan"
                     /domain_id="nrpspksdomains_jgi.p_FusspF11_1_766668_PKS_KR.1
                     "
                     /evalue="7.60E-08"
                     /label="jgi.p_FusspF11_1_766668_PKS_KR.1"
                     /locus_tag="jgi.p_FusspF11_1_766668"
                     /protein_end="149"
                     /protein_start="26"
                     /score="24.2"
                     /specificity="KR activity: inactive"
                     /specificity="KR stereochemistry: C1"
                     /tool="antismash"
                     /translation="AVVTGGAGVLGLHACDALLEHGLEGLMILDVNPAQSQGQITSLQN
                     KFPRAKIMALKVDVTDENVVNAAMEETARVLGSIDTLICFVGVVGCVETLDMPVPQWRK
                     ILDINTTGSFICAQAAARQ"
     aSModule        1977..2345
                     /domains="nrpspksdomains_jgi.p_FusspF11_1_766668_PKS_KR.1"
                     /incomplete
                     /locus_tags="jgi.p_FusspF11_1_766668"
                     /tool="antismash"
                     /type="pks"
     CDS_motif       1980..2042
                     /aSTool="nrps_pks_domains"
                     /database="abmotifs"
                     /detection="hmmscan"
                     /domain_id="nrpspksmotif_jgi.p_FusspF11_1_766668_0001"
                     /evalue="1.90E-05"
                     /label="PKSI-KR_m1"
                     /locus_tag="jgi.p_FusspF11_1_766668"
                     /protein_end="48"
                     /protein_start="27"
                     /score="17.3"
                     /tool="antismash"
                     /translation="VVTGGAGVLGLHACDALLEHG"
     CDS             join(3252..3263,3321..3443,3493..4428)
                     /ID="CDS_13111"
                     /NRPS_PKS="Domain: PKS_ER (28-277). E-value: 6.4e-13.
                     Score: 40.2. Matches aSDomain:
                     nrpspksdomains_jgi.p_FusspF11_1_794272_PKS_ER.1"
                     /NRPS_PKS="type: other"
                     /gene="jgi.p_FusspF11_1_794272"
                     /gene_functions="biosynthetic-additional (smcogs)
                     SMCOG1040:alcohol dehydrogenase (Score: 275.7; E-value:
                     7.6e-84)"
                     /gene_kind="biosynthetic-additional"
                     /phase="0"
                     /source="prediction"
                     /transl_table=1
                     /translation="MSQTNLSCVLYGPGKARFENRPVPSLKDAHDVIIRISYVGVCGSD
                     VHFWTDGGFARKVSEDQPLVMGHEASGIVRSIGPDVTLLKPGDRVAIEPGFSCRRCKQC
                     KDGRYNLCPKMKFAADPPLTQGTLSRFFSIPEDFAYKIPDSLSLEEAVLVEPLAVAVHG
                     IRLAGLEVGQRVLVQGSGTIGLLTAAVAKAYGAKQVYITDVNLDKIKFAKKYLECSAFI
                     PDLGSTPEENAARFKTETGLDDGVDAVIECTGVEASTQTGLLALSAGGVLVQVGLGKPV
                     QAIPIHAMSEKEIVLKTSFRYGPGDYEIALELLESGKVSVRPLISSITPFEKATEAWEK
                     TRKGKGIKNLIRGVQD"

and so on……

Now from this file I want to extract protein sequence of the CDS having PKS domain ( could be PKS_ER, PKS_KS or PKS_anything). In the output file I want to include LOCUS Id and CDS ID and location. I have never used python but I found a similar issue in Biostar using python and have made few lines of codes based on that. Please help me build the code and correct me if I am wrong.

from Bio import SeqIO
from Bio.SeqFeature import SeqFeature

gbk = "mygenome_predicted_SMclusters.gbk"
fa = "mygenome_PKS_protein.fa"
input_handle = open(gbk, "r")
output_handle = open(fa, "w")

for record in SeqIO.parse(input_handle, "genbank"):
    features = [feature for feature in record.features if feature.type == "CDS"]
    for feature in features:
        if feature.qualifiers["Domain"][0] == "PKS_*":

Thanks

Read more here: Source link