Deep learning opens up protein science’s next frontiers

Artistic rendering of a protein molecule
Credit: Ian Haydon, Institute for Protein Design

At their heart, proteins are just polymers: linear chains of amino acids drawn from a library of 20 or so possibilities. But unlike synthetic polymers, which tend to contort randomly, proteins reliably fold into characteristic three-dimensional shapes. The diversity of those shapes gives rise to the complexity of the biological world.

Discovering protein structures, whether by theory or experiment, is a notoriously challenging problem. The first one wasn’t determined until the late 1950s, long after the basic rules governing the structures of smaller molecules were well understood. Today the Protein Data Bank contains structures of more than 180 000 different proteins. But even that vast resource barely makes a dent in the tens of millions of proteins known to be encoded by genes across all living species.

Last November, as part of the Critical Assessment of Structure Prediction project, researchers at DeepMind in London showed that their AlphaFold 2 model had made astonishing headway. Given a protein’s amino-acid sequence, AlphaFold 2 could predict its structure with most atomic positions correct to within an angstrom—less than the length of a chemical bond.

Inspired by AlphaFold 2 but having only a rough idea of the model’s architecture, Minkyung Baek, in David Baker’s group at the University of Washington, and her colleagues developed a similarly capable model, called RoseTTAFold, in time to publish their results concurrently with AlphaFold 2’s this summer.

Both AlphaFold 2 and RoseTTAFold use deep learning—a type of artificial intelligence—which means that their internal workings are largely a black box. But their guiding principles are well known, because they’re the same principles that have been guiding structural biologists for years.

All biological protein sequences are the result of evolution, and any given protein is evolutionarily related to thousands of others across the tree of life. A protein of interest likely has evolutionary cousins already in the protein sequence database that can hold clues about its structure, even if their own structures are unknown.

How can that happen? When one amino acid in a protein randomly mutates, it creates an evolutionary pressure for its neighbors in 3D space to mutate too, to preserve the overall folding energy and keep the structure stable. By looking at which pairs of amino acids most frequently mutate in tandem across a family of related proteins, one can glean information about which amino acids are near each other in the folded structure. The power of deep learning is that it can recognize those patterns and correlations more keenly than human observers or more straightforward algorithms.

Precisely relating protein sequence to protein structure was a long-standing goal, but it’s only part of what biologists want to know. Proteins in living things aren’t isolated, static structures; almost all of what they do involves flexing their shapes and interacting with other molecules.

Deep-learning methods have made some progress toward solving the structures of multimolecular complexes. For example, the structure in the figure, found by RoseTTAFold, shows the signaling protein interleukin-12 bound to its receptor. But overall, it’s much more challenging to predict multiprotein structures than single-protein structures. Mutation correlations don’t provide nearly as much information about which amino acids interact with one another when those interactions are between different molecules, especially when the molecules didn’t evolve in the same species, as in the case of a virus protein and the host it attacks. Understanding protein complexes and dynamics will continue to challenge biologists, chemists, physicists—and programmers too—for years to come. (J. Jumper et al., Nature, 2021, doi:10.1038/s41586-021-03819-2; M. Baek et al., Science, 2021, doi:10.1126/science.abj8754.)

Read more here: Source link