Context
I’m currently exploring the AlphaFold 2 dataset. The goal is to use deep learning to generate some embeddings to represent the structures and group structurally similar proteins together using a clustering algorithm.
I have my first pass at the clusters of AlphaFold proteins. Assuming that structure and function are closely related, I’d like to see if proteins sharing similar functions ended up in more or less the same clusters.
Data I need
I’d like to find the known and verified functional labels for the proteins available in the AlphaFold dataset, so I can check if any clusters have a concentration of proteins of certain functions, etc. What is the best resource to get these functional labels?
What I tried and didn’t work
I downloaded the GO molecular function dataset from GSEA, then using the UniProt API (www.uniprot.org/uploadlists/), I pulled all human protein UniProt ID associated with each gene. Unfortunately, I found that some of the protein IDs mapped to GO molecular functions were not available in my dataset at all. This puzzled me because I thought AlphaFold covered 98.5% of the human proteins (20,000), so I expected to find all of the previously known human proteins with a functional label (such as those in GO) inside the AlphaFold dataset.
e.g., gene name RAD50 was associated with the following 15 UniProt protein IDs: [‘A0A494BZW0’, ‘A0A494BZX5’, ‘A0A494BZX8’, ‘A0A494C0Y7’, ‘A0A494C122’, ‘A0A494C1B7’, ‘A5D6Y3’, ‘C9JNH8’, ‘E7EN38’, ‘E7ESD9’, ‘E9PM98’, ‘H7C0P8’, ‘H7C0V2’, ‘Q32P42’, ‘RAD50’]
But I could only find `C9JNH8` in AlphaFold.
Questions
-
Should I try another database for functional labels? If so, which one? I heard about FunCat, but I am not sure how exactly it’ll be different from GO.
-
Why are some of the proteins in Uniprot (listed above, associated with gene RAD50) not available in AlphaFold?
Read more here: Source link