I am trying to programmatically get whole genes ( with intron and exon structure as defined by CDS) using Biopython Entrez esearch and efetch utilities.
from Bio import Entrez
Entrez.email = "myemail@gmail.com"
handle = Entrez.esearch(db="gene",retmax = "10",term="P53 AND Homo Sapiens [organism]")
record = Entrez.read(handle)
handle_first_record = Entrez.efetch(db="gene",id=record["IdList"][0],rettype="gb",retmode="text")
info = handle.read()
#Is there a more direct way of getting the start and stop from the Annotation Field
annot = info.split("n")[6].split()
chrom = annot[3]
start_stop = annot[4].split("..")
start = start_stop[0][1:]
stop = start_stop[1][:-1]
print(f"Chromid: {chrom} Start:{start} Stop: {stop}")
gbfile_handle = Entrez.efetch(db="nuccore",id=chrom,start=start,stop=stop)
# Need to figure out how to parse this record to get a Genbank file
A typical annotation is given below
1. TP53
Official Symbol: TP53 and Name: tumor protein p53 [Homo sapiens (human)]
Other Aliases: BCC7, BMFS5, LFS1, P53, TRP53
Other Designations: cellular tumor antigen p53; antigen NY-CO-13; mutant tumor protein 53; p53 tumor suppressor; phosphoprotein p53; transformation-related protein 53; tumor protein 53; tumor supressor p53
Chromosome: 17; Location: 17p13.1
Annotation: Chromosome 17 NC_000017.11 (7668421..7687490, complement)
MIM: 191170
ID: 7157
Is there an easier way in Biopython to get the Genbank file for a human gene starting from the name of the gene than the way above .
Read more here: Source link