Dear all, I am working with a list of Ensembl accession codes for a desired group of proteins.
I have downloaded the protein annotations related to the genome assembly GRCH38.
I fetched the genomic coordinates from UniProtKB API service using the Ensembl accession codes. The service provide a protein annotation records with the coordinate needed.
However, I would like to fetch the same coordinates parsing locally the GRCh38 data, instead to query an online database. I think I found a way that involves FASTA protein sequences file and a GTF protein annotations file for the GRCh38 genome assembly. Through the Ensembl proteins codes (in FASTA sequences) it would be possible to find the Ensembl genes codes in the GTF annotations, and finally in the same annotations, the desired genomic coordinates. Nevertheless, the last update for the GTF annotations file is 19-Mar-2021 while for the protein sequences in FASTA format is from 27-Mar-2021 (today is 19-Sep-2021).
From this discrepancy, it is raised my doubt about the most up-to-date information available.
Now I am wondering:
If I query UniProtKB through an API service, it is possible to find protein annotations not yet included in the GTF annotations set related to a specific genome assembly?(in this case GRCh38 of 27-Mar-2021).
In other words, protein annotations fetched from UniProtKB, could be more updated than the 27-Mar-2021 GTF annotations related to the GRCh38?
It is possible that in the UniProtKB database are stored proteins codes with a correspondence in Enseble database (cross-link section in UniProt webpages) but not yet included in the GRCh38 GTF annotations, downloadable through Ensembl FTP service? (I mean the GTF file Homo_sapiens.GRCh38.104.chr.gtf.gz, in this repository ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/)
I am asking this, because if I am interested in the latest update of protein codes and annotations, I think that should be considered the amount of new codes and annotations that are potentially submitted each month. In light of this, if the online databases for instance, update their content with a more higher frequency compare to the genome assembly, I will go for the API querying strategy.
Thanks for your answers.
Read more here: Source link