How to extract genomic upstream region of a protein identified by its NCBI accession number?
I have a list of NCBI protein accession numbers. I would like to extract out the upstream genomic region of the corresponding gene’s nucleotide sequence. I will be thankful to you if you can show me how to get this done.
For example, here are some of the protein accession numbers. I would like to extract out the upstream genomic region of their corresponding nucleotide sequence.
EET74829.1
VEI24834.1
AYW77996.1
EJD65589.1
EFM49534.1
• 72 views
These appear to be protein accession numbers that are pointing to various assemblies so there is no direct gene associations. So it may be best to do this as a three step process.
Using Entrezdirect:
Get the accession number of nucleotide assembly/genome
$ esearch -db protein -query AYW77996 | elink -target nuccore | efetch -format acc
CP033719.1
Get the nucleotide start/stops for CDS
$ efetch -db nuccore -id CP033719.1 -format fasta_cds_na | grep AYW77996
>lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS]
Use the location coordinates to get the sequence you want (e.g. 200 bp upstream). Pay attention to the strand locations.
$ efetch -db nuccore -id CP033719.1 -format fasta -seq_start 1885067 -seq_stop 1885267
>CP033719.1:1885067-1885267 Propionibacterium acidifaciens strain FDAARGOS_576 chromosome, complete genome
GGCTCCGAGCACTGGCGCCAGGTGGGCGGCCTGGGCAACATCGCAGCCCTGCTCGGTCTCGTCGCCGTGG
CCGTCTGGTCGTCCGTGGTCCGGGACGCCGCCGAGGCCGAGCGGCCCCCGTCCGCGCGGGGCGGCCCCGG
CCCGGTCGGCGGGGGAGCCCCCGACAACCCGCCCGCCATGACGATCCCGAGGACCGACGCA
Traffic: 2352 users visited in the last hour
Read more here: Source link