python – How to extract the protein sequences of a genbank file using R or biopython

sorry for the question, I’m trying to extract the proteins sequences from a genbank file.

 gene            complement(516466..532086)
                 /gene="rtxA"
                 /locus_tag="VV1_RS17390"
                 /old_locus_tag="VV2_0479"
 CDS             complement(516466..532086)
                 /gene="rtxA"
                 /locus_tag="VV1_RS17390"
                 /old_locus_tag="VV2_0479"
                 /inference="COORDINATES: similar to AA
                 sequence:RefSeq:WP_011081430.1"
                 /note="Derived by automated computational analysis using
                 gene prediction method: Protein Homology."
                 /codon_start=1
                 /transl_table=11
                 /product="MARTX multifunctional-autoprocessing
                 repeats-in-toxin holotoxin RtxA"
                 /protein_id="WP_011081430.1"
                 /translation="MGKPFWRSVEYFFTGNYSADDGNNSIVAIGFGGEIHAYGGDDHV
                 TVGSIGATVYTGSGNDTVVGGSAYLRVEDTTGHLSVKGAAGYADINKSGDGNVSFAGA
                 AGGVSIDHLGNHGDVSYGGAAAYNGITRKGLSGNVTFKGAGGYNALWHETNQGNLSFA
                 GAGAGNKLDRTWFNRYQGSRGDVTFDGAGAANSISSRVETGNITFRGAGADNHLVRKG
                 KEGNHTANLANEDISSANGYHSMGKGGYSLSDLHYSVNAVRSTSETVADIDEYTDQTL
                 FKPATDSGESSGDVRFNGAGGGNVIKSNVTRGNVYFNGGGIANVILHSSQFGHTEFNG
                 GGAANVIVKSGEEGDLTFRGAGLANVLVHQSKQGKMDVYAGGAVNVLVRIGDGQYLAH"

I’m using a code from a previous answer to extract the protein and the locus_tag, but I want some extra features like:

>locus_tag|old_locus_tag|gene|product|complement|protein_id|chromosome|Accession|organism|strain
MGKPFWRSVEYFFTGNYSADDG ..........

the result will be like:

>VV1_RS17390|VV2_0479|rtxA|MARTX multifunctional-autoprocessing repeats-in-toxin holotoxin RtxA"|516466..532086|WP_011081430.1|Chromosome-II|NC_004459.3|Vibrio vulnificus|CMCP6
MGKPFWRSVEYFFTGNYSADDG ..........

this is the code that I’m using, how to add those features, or an other way using R ???

from Bio import SeqIO
file_name="CMCP6.gb"
# stores all the CDS entries
all_entries = []
with open(file_name, 'r') as GBFile:
    GBcds = SeqIO.InsdcIO.GenBankCdsFeatureIterator(GBFile)

    for cds in GBcds:
        if cds.seq is not None:
            cds.id = cds.name
            cds.description = ''
            all_entries.append(cds)


# write file
SeqIO.write(all_entries, '{}.fasta'.format(file_name[:-3]), 'fasta')

But the result from the code only add the locus_tag and the sequence:

>VV1_RS17380
MFIKNLSIGKKIAAAFSIIAVINIAFGIFLSTELNTVKSELLNYTEDTLPAMEKVDAVRD
KISYWRRTQFAVFAMNDENQIKQTITRNEGIRREIETELAAYGKSVWPGEEEQTYNRLMS
LWSGYLSTMDKFNDALLAGDKDAAYPILTNSLSTFESIETEVNKLVMILKGAMDSNKNQI
LSSVNGLNTTAVISNIAIFAVMVIMTLLLTRIICGPLNIVVRQANAIARGDLSQDLDRNA
IGNDELGELADATIKMQDDLRQVIDNVIAAVTQLSSAVEEMTQISEMSASGMKDQQMQVT
LVATAMTEMKAAVADVARNTEDSASQAYDANRRTQDGAKETHQMVVSIEEVADIIAKAGD
TVAELEAQSNQINVVVDVIRGIADQTNLLALNAAIEAARAGESGRGFAVVADEVRTLAGR
TQDSTGEITSIIEKLQDLAKQAKHATEDSRTSIGACVEQGNNAQELMGSIEKSIANIADM
GAQIATACGQQDSVAEELSRNIENIHMASQEVAQGSQQTAQACRELTQLSVSLQDVMSRF
KLN
>VV1_RS17385
MNFKKTLLSIAIASASLTPAFSYSAPLLLDNTVHQTSQIAGANAWLEISLGQFKSNIEQF
KSHIAPQTKICAVMKADAYGNGIRGLMPTILEQQIPCVAIASNAEAKLVRESGFEGELIR
VRSASTSEIEQALSLDIEELIGSEQQARELASLAEKYSKTIKVHLALNDGGMGRNGIDMS
TERGPKEAVAIATHPSVAVVGIMTHFPNYNAEDVRTKLKSFNQHAQWLMESAGLKREEIT
LHVANSYTALNVPEAQLDMVRPGGVLYGDLPTNPEYPSIVAFKTRVASLHSLPAGSTVGY
DSTFTTANDAVMANLTVGYSDGYPRKMGNKAQVLINGQRANVVGVASMNTTMVDVSNIKG
VLPGDEVTLFGAQKNQHISVGEMEENAEVIFPELYTIWGTSNPRFYVK
>VV1_RS17390
MGKPFWRSVEYFFTGNYSADDGNNSIVAIGFGGEIHAYGGDDHVTVGSIGATVYTGSGND
TVVGGSAYLRVEDTTGHLSVKGAAGYADINKSGDGNVSFAGAAGGVSIDHLGNHGDVSYG
GAAAYNGITRKGLSGNVTFKGAGGYNALWHETNQGNLSFAGAGAGNKLDRTWFNRYQGSR
GDVTFDGAGAANSISSRVETGNITFRGAGADNHLVRKGKVGDITLQGAGASNRIERTRQA
EDVYAQTRGNIRFEGVGGYNSLYSDVAHGDIHFSGGGAYNTITRKGSGSSFDAQGMEYAK

Thanks so much

Read more here: Source link