Machine Learning for Drug Discovery at ICLR 2022

For the last decade, the field of deep learning and AI has been dominated by applications to images and text. However, in the past two years, the field has seen an upsurge of chemical and biological applications. The international conference on learning representations [ICLR], is the largest academic AI conference in the world, with an h5-index […]

Fragment Ligand Generation

Extremely data-efficient ligand generation What is a sufficient number of data points to train a deep learning algorithm? 1,000? 1 million? 1 billion? Of course, it depends on the problem. But it also depends on the neural network architecture and training algorithm chosen to solve the problem. Powers et. al. recently published a preprint describing a ligand […]

Team ZONTAL April 21, 2022 No Comments

A new state-of-the-art model for molecular conformer generation

In structure-based drug discovery, most methods rely on two key elements of accuracy: accurate protein structure modeling and accurate drug structure modeling. AlphaFold is able to predict protein structures with unprecedented accuracy. But drug structure modeling lags behind, with current models for conformer generation only providing 67% accuracy on a common molecular conformer benchmark. GeoDiff predicts drug conformations with […]

Team ZONTAL April 14, 2022 No Comments

Deep learning methods for designing proteins scaffolding functional sites

In a recent preprint from the Baker Lab, Jue Wang et. al. outlined a framework for protein design that uses protein structure prediction neural networks. This framework defines several types of protein design problems and demonstrates multiple effective approaches to solve these problems. Here we condense this detailed study into its main points.   Protein Design Tasks: The […]

Design of protein binding proteins from target structure alone

How do you design a protein that binds to another protein given only the target protein structure? Until recently, you could use Rosetta to manually craft a protein using expert heuristics. However, this process is laborious, expensive, and does not generalize. Researchers at the Institute for Protein Design recently published a groundbreaking work outlining a systematic process […]

Team ZONTAL March 17, 2022 No Comments

The Protein Folding Problem – is it Solved?

In CASP14, DeepMind presented the results of AlphaFold, a deep neural network designed for protein structure prediction. During the experiment, AlphaFold predicted structures with an average deviation of  ~1 Å from the C-alpha atoms of experimentally solved structures. Now, ~1 Å is often also used as the resolution to denote high accuracy experimental structures (they […]

Team ZONTAL February 10, 2022 No Comments

Predicting Transition State Structures with Tensor Field Networks and Transfer Learning

The year 2021 was a Pandora’s Box for machine learning in chemistry. DeepMind put the chemistry world on notice when it published its approach to the protein folding problem [1]. I expect that we will continue to see machine learning approaches quickly dominate the well-defined, data-rich problems in chemistry. However, there are other challenges that are harder to […]

equidock

Today we break down a paper recently accepted for publication at ICLR 2022: “Independent SE(3) Equivariant Models for End-to-End Rigid Docking” [1]. Docking is the problem of finding the pose and orientation by which a ligand binds to a protein. Solving the computational docking problem would increase our understanding of biological interactions at the molecular level and catalyze a leap forward in drug discovery. But current docking methods are slow, preventing the use of docking at scale. The authors’ method, “EquiDock”, takes advantage of symmetries in 3D space to achieve a 80-500x speed-up in protein-protein docking. How did they do it? Let’s dive in.

“Inductive Bias” is a fancy term used by machine learning researchers to describe design elements in the structure of a neural network that adapt it to a specific problem. For example, convolutional neural networks (CNNs) are designed to be “translation invariant” – after being trained on images with dogs in the bottom left quadrant, a CNN will still recognize images with dogs in the top right quadrant. By creating a network with good “inductive bias”, the learning problem becomes easier for the network to solve and the network can be expected to generalize better.

Similarly, when working with data in 3D Euclidean space, like proteins and other molecules, we can take known physical principles, and build them into neural networks. In the case of protein-protein docking, the docking pose is “equivariant” to their initial starting positions and orientations – the pose rotates and shifts proportionally to rotation and shift in the inputs. EquiDock predicts docking poses that are equivariant to the poses of the input proteins.

The algorithm is also pairwise independent, meaning that if you reverse the roles of protein 1 and protein 2, the result is the same.

Now let’s get into the algorithm. At the heart of EquiDock is an equivariant keypoint-prediction graph neural network, combined with a differentiable “keypoint alignment” algorithm.

First, the algorithm creates two graphs, {V1, E1}, and {V2, E2}, one for each protein. It considers C-alpha atoms as “nodes”, with edges between C-alpha atoms at fall within a certain radius. The network takes as input the C-alpha coordinates of two proteins, as well as node features (e.g. amino acid identity) and edge features (e.g. distance between C-alpha atoms) that are invariant to the coordinate frame of reference.

The IEGMN (Independent E(3)-Equivariant Graph Matching Network) then transforms the coordinates, node features, and edge features, combining information from all of these to update the node features and coordinates. This is done recurrently, with shared weights for each layer. The last layer in the IEGMN predicts K “keypoints”, or predicted binding pocket points, as weighted sum of updated node coordinates, weighted by a multi-head attention mechanism based on the updated node features from each protein. Crucially, all updates to the atom coordinates, including this final keypoint layer, respect equivariance.

The next step is a differentiable singular value decomposition to recover a rotation that will rotate the predicted keypoint coordinates (Y1) for one protein to align with the keypoint coordinates (Y2) of the other protein. The translation then comes from the vector that translates the mean of Y1 to be centered at the mean of Y2.

This predicted rotation and translation for Y1 is then applied to the original coordinates X1, to obtain a final prediction for the coordinates of the docked protein. The main loss is the mean-squared error between this prediction and the true coordinates.

The architecture is shown below:

independent se(3)-equivariant models for end-to-end rigid protein docking

Now, there are some unanswered questions left. What is to prevent the network from learning keypoints that don’t correspond to the true binding pocket? Does this network respect physical non-intersection constraints?

The authors address the first question by adding an auxiliary loss term to make the key points match the true binding pocket points by an optimal transport loss.

They also define a function that defines the borders of each protein and add an auxiliary loss penalizing atoms from the other protein that fall within this border.

The model performs well, achieving better accuracy than many docking algorithms that take 100x longer, though lower accuracy than the HDock algorithm. But EquiDock achieves its accuracy much faster, enabling high-throughput docking.

What are some possible extensions of this research? The most natural is to apply the same model to drug-protein docking. The exact same algorithm can be used, using drug atom coordinates instead of C-alpha coordinates. The authors also plan to extend the work to flexible docking and molecular dynamics, allowing for flexibility in protein-ligand interactions.

What are some potential use cases of this model? With fast protein-protein docking, it is possible to computationally create much larger protein-protein interaction networks. These protein-protein interaction networks, paired with other types of interaction networks, can be mined for new scientific discoveries about biochemical pathways, biochemical causes of disease, causes of drug toxicity, and more.

[1] “Independent SE(3) Equivariant Models for End-to-End Rigid Docking”, https://arxiv.org/pdf/2111.07786.pdf