Sequence (annotation) databases in 2021

Forum:Sequence (annotation) databases in 2021


Hi everyone,

So I know there are several threads on this topic already (or tangentially related to it). For example:

But these threads are really old now. Things have probably changed quite significantly in the mean time.

So I would like to start a small discussion on the topic of sequence databases.

Here are some issues that I think the good folks here at Biostars could address.

  • So the various sequence databases such as UniProt, NCBI‘s various
    divisions (RefSeq, NR, what else is there?), Ensembl(?), how do
    these stack up in 2021?
  • What are some other useful databases one should be aware of?
  • Are there any popular databases that have gone sour? (But people still just use them anyway out of habit.)
  • Are there any new and upcoming sequence + sequence annotation resources in the near future?
  • What are some general gotchas, myths, and misunderstandings/misconceptions one should watch out for when it comes to sequence resources?







updated 1 hour ago by


written 2 hours ago by



Primary sequence databases (NCBI/ENA/DDBJ) have been around for decades and are always current (they share sequences overnight). They are primary repository of sequences and sync submissions overnight. They carry annotations for parts of their sequences for others there may be none.

Genome Reference Consortium is the apex body that manages genome releases for important genomes (human, mouse, zebrafish, rat and chicken). They release primary genome builds that then get deposited into appropriate sections of primary databases. Organizations offer annotation that they internally generate (NCBI, Ensembl, UCSC) but the underlying sequence is identical for a given genome build.

There are plenty of other derived/special focus databases. UniProt (LINK) is all things proteins, PDB (LINK) for protein structures.

You will find organism specific databases that originally provided sequence/annotations for those genomes around. They were useful in early days of genome sequencing but as NGS tool off they became subject to disappearing grant money. Some have turned to a subscription model (e.g. TAIR, BioCyc, KEGG) to support themselves. Parts of them may still be freely accessible but other parts require a subscription. If you are lucky enough to have access then you are all set otherwise you end up finding other (perhaps less desirable) alternatives. But you should be able to find the info you need in some free form elsewhere.

before adding your answer.

Traffic: 1659 users visited in the last hour

Read more here: Source link