How to extract genomic upstream region of a protein identified by its NCBI accession number?

How to extract genomic upstream region of a protein identified by its NCBI accession number?

1

I have a list of NCBI protein accession numbers. I would like to extract out the upstream genomic region of the corresponding gene’s nucleotide sequence. I will be thankful to you if you can show me how to get this done.

For example, here are some of the protein accession numbers. I would like to extract out the upstream genomic region of their corresponding nucleotide sequence.

EET74829.1

VEI24834.1

AYW77996.1

EJD65589.1

EFM49534.1


bedtools


extract_upstream_region


genomic_sequence


NCBI

• 72 views

55 minutes ago by


mrj

▴

50

These appear to be protein accession numbers that are pointing to various assemblies so there is no direct gene associations. So it may be best to do this as a three step process.

Using Entrezdirect:

Get the accession number of nucleotide assembly/genome

$ esearch -db protein -query AYW77996 | elink -target nuccore | efetch -format acc
CP033719.1

Get the nucleotide start/stops for CDS

$ efetch -db nuccore -id CP033719.1 -format fasta_cds_na | grep AYW77996
>lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS]

Use the location coordinates to get the sequence you want (e.g. 200 bp upstream). Pay attention to the strand locations.

$ efetch -db nuccore -id CP033719.1 -format fasta -seq_start 1885067 -seq_stop 1885267
>CP033719.1:1885067-1885267 Propionibacterium acidifaciens strain FDAARGOS_576 chromosome, complete genome
GGCTCCGAGCACTGGCGCCAGGTGGGCGGCCTGGGCAACATCGCAGCCCTGCTCGGTCTCGTCGCCGTGG
CCGTCTGGTCGTCCGTGGTCCGGGACGCCGCCGAGGCCGAGCGGCCCCCGTCCGCGCGGGGCGGCCCCGG
CCCGGTCGGCGGGGGAGCCCCCGACAACCCGCCCGCCATGACGATCCCGAGGACCGACGCA


Login
before adding your answer.

Traffic: 2352 users visited in the last hour

Read more here: Source link