What No One Is Talking About

In this next installment of our AlphaFold Series, we look at the potential drawbacks and limitations of the approach. There is no doubt that AlphaFold is a breakthrough in protein structure prediction, and we have commented on some of the exciting opportunities it presents. In mid-Aug 2021, two weeks after the AlphaFold2 structures were released, we announced that we moved quickly to integrate AlphaFold2 into the proteome structure pipeline that is used for training MatchMaker, our deep learning model for predicting drug/target interactions. While we are really excited about this work, as is the scientific community as evidenced by the feedback received – we also recognize that in order to use a new technology effectively, it is important to be aware of its limitations and potential pitfalls that might result.

One of the most important things to know when using a machine learning model is that while the model may have been made with a specific goal in mind, such as predicting the structure of an individual protein chain given its sequence, what that model actually does is always defined by the nature of the training data. In this case, AlphaFold predicts what protein chains would look like if they were found in the PDB (protein structure database), and it’s important when using AlphaFold2 to know that many of these structures are not actually the folded state of an individual protein.

Essentially, when people study proteins they don’t always study them to obtain individual protein structures. A large amount of the data in the PDB comes from people studying structures that only form in specific contexts. As an extreme example of this, the PDB is filled with proteins that only fold upon binding to other proteins, proteins that fold upon binding to substrates or metal ions, proteins that only fold when they are chemically modified, and proteins that fold directly into large complexes, such as the ribosome.

When using known structures, the context is directly available, both in the structures themselves and in the literature associated with them, and many structures are published with explicit descriptions of why a protein will not adopt that structure on its own. When structures are not known, AlphaFold allows you to obtain them at unprecedented accuracy, but the trade-off relative to earlier methods is that they come stripped of context.

As an example, we can illustrate how the human  60S ribosomal protein L19 shows up in the context of the PDB (PDB ID: 4UG0). The ribosome is a large molecular machine that the cell uses to print proteins, reading our genetic code as a blueprint, and it is built as a complex from a diverse set of proteins bound to large ribosomal RNA molecules. The structure that L19 forms in the PDB is then specific to the complex, and we know that ribosomal protein L19 does not adopt that structure free in solution, it folds that way only during the construction of the ribosome.

When AlphaFold2 predicts the structure of L19 it provides a single chain, but it does not predict the fold that protein would take on its own, it predicts the structure that this protein is likely to have when found in the PDB. This prediction is very accurate, but as you can see it recalls the structure without providing the context.

The nature of these predictions are important for users to understand because context is required for understanding protein behavior, and while AlphaFold2 provides a powerful tool for obtaining structures it becomes up to the user to provide the context. Essentially, when using AlphaFold2, data pipelines need to be built that can bring the context back, and understanding how AlphaFold2 works is key to that process.

We believe that with the right amount of caution, the potential of AlphaFold2 to increase the accuracy of structure-based computational methods in medicine is enormous. We are particularly excited about using AlphaFold2 to access protein structure as easily as genetic sequence across the entire tree of life. We’ve already begun to use this to extend the reach of Cyclica’s Ligand Design into infectious disease, animal models, agriculture and other areas that benefit from working with non-human proteomes.

L19_largest

Fig 1: Molecular visualizations of Ribosomal Protein L19 as found in the solved structure of the Human 80S Ribosome on the left, and in the AlphaFold2 prediction set on the right. 

Dr. Robert Vernon, Senior Computational Scientist
Dr. Andreas Windemuth, Chief Science Officer


Read more here: Source link