Team ZONTAL September 1, 2022 No Comments

DALL·E 2, Imagen, and Applications to Chemistry

In the past two months, DALL·E 2 has taken over the internet. From Bart Simpson edited into Egyptian art to Donald Trump as the Lorax, text-to-image AI produces amazing results. Caption: “Panda weaving a basket made of cyclohexane”, DALL·E 2 Are these an impressive-but-gimmicky party trick? Or can these innovations be harnessed for applications in scientific domains? Many […]

Team ZONTAL August 23, 2022 No Comments

Making Chemistry Knowledge Machine-Actionable

The history of chemistry has been epitomized by individual chemists coming up with hypotheses, running experiments at lab-scale, and producing discoveries. But in 2022, chemistry data is generated at a scale previously unseen, computers can rapidly process that data, and the data can be widely distributed at relatively minimal cost. This new frontier of global-scale […]

Team ZONTAL August 12, 2022 No Comments

Transformer Retrosynthesis

In drug discovery, there are two main approaches to hit finding: 1) virtual screening of existing small molecule libraries and 2) generative design of new molecules. Generative molecule design can result in better binders, but it may be unknown how to synthesize them. The task of retrosynthesis – designing a synthesis pathway for a molecule […]

Coarse-grained Molecular Dynamics with Geometric Machine Learning

We live an a world where chemistry computation is increasingly competitive with experimentation. AlphaFold predicts protein structure with accuracy sufficient for many applications. In the limit scenario, computational chemists envision biochemistry simulations on a scale that allows them to trace exact mechanisms of disease. A recent pre-print achieves molecular simulation with nanosecond time steps, which is 1000 […]

SELFIES and the future of molecular string representations

Neural sequence models have recently produced astonishing results in domains ranging from natural language to proteins and biochemistry. Current sequence models trained on text can explain jokes, answer trivia, and even write code. AlphaFold is a sequence model trained to predict protein structure with near-experimental accuracy. In the chemistry domain, sequence models have also been used for learning problems on […]

Machine Learning for Drug Discovery at ICLR 2022

For the last decade, the field of deep learning and AI has been dominated by applications to images and text. However, in the past two years, the field has seen an upsurge of chemical and biological applications. The international conference on learning representations [ICLR], is the largest academic AI conference in the world, with an h5-index […]

Fragment Ligand Generation

Extremely data-efficient ligand generation What is a sufficient number of data points to train a deep learning algorithm? 1,000? 1 million? 1 billion? Of course, it depends on the problem. But it also depends on the neural network architecture and training algorithm chosen to solve the problem. Powers et. al. recently published a preprint describing a ligand […]

Team ZONTAL April 21, 2022 No Comments

A new state-of-the-art model for molecular conformer generation

In structure-based drug discovery, most methods rely on two key elements of accuracy: accurate protein structure modeling and accurate drug structure modeling. AlphaFold is able to predict protein structures with unprecedented accuracy. But drug structure modeling lags behind, with current models for conformer generation only providing 67% accuracy on a common molecular conformer benchmark. GeoDiff predicts drug conformations with […]

In structure-based drug discovery, most methods rely on two key elements of accuracy: accurate protein structure modeling and accurate drug structure modeling. AlphaFold is able to predict protein structures with unprecedented accuracy. But drug structure modeling lags behind, with current models for conformer generation only providing 67% accuracy on a common molecular conformer benchmark. GeoDiff predicts drug conformations with a neural network diffusion model, increasing accuracy to 89% on the same benchmark. Improvements in drug modeling cascade to improved prediction of drug affinity, toxicity, and other pharmacokinetic properties, reducing drug development cost and increasing effectiveness and time-to-market.

 

GeoDiff approaches the problem of predicting the geometric conformation of a molecule from its graph. It builds on a type of generative deep learning model called a “diffusion model”, which transforms a sample from a simple distribution, like a Gaussian distribution, into a sample from a more complicated distribution, like the Boltzmann distribution over molecular conformers. In this post, we will use the example of a conformer and the Boltzmann distribution even when referring to general diffusion models.

Captura de pantalla 2022-04-19 141353

Figure 1:The diffusion process converting a Gaussian sample to a conformer. Forward diffusion goes right to left, backward diffusion goes left to right.

 

Diffusion models assume a relationship between the Boltzmann distribution and a Gaussian distribution. They assume that, given enough added Gaussian noise, the Boltzmann distribution can be transformed into a Gaussian distribution. This part, which we call the “forward” direction, is easy to model. But diffusion models also presume the existence of a “backward” model that can remove noise from a Gaussian sample to transform it into a sample conformer from the Boltzmann distribution. They treat the process of sampling as a Markov chain of steps, in each step removing noise from a sample from a Gaussian distribution to generate a sample from the Boltzmann distribution.

That noise-removal process must be learned, and is learned by optimizing the “Evidence Lower Bound” (ELBO) popularized by the Variational Autoencoder paper by Kingma and Welling. Essentially, the ELBO is a Kullback–Leibler (KL) divergence loss ensuring that a distribution over latent variables matches a known distribution. In the case of diffusion models, the latent variables are the Markov steps in between a Gaussian sample and the conformer sample. The KL divergence term ensures that, at each step in the Markov chain, the distribution over Gaussian samples with noise removed (backward samples) matches the distribution of conformer samples with noise added (forward samples).

diffusion_model_objective

The ELBO objective for GeoDiff. q models backward diffusion, pθ models (learned) forward diffusion.

This is a very high-level flyover on diffusion models; Lilian Weng has a great blog post deriving the diffusion model objective in more detail.

GeoDiff’s main innovation is a diffusion model designed to be equivariant, allowing it to operate on atom coordinates independent of their original position and orientation.

For its architecture, GeoDiff uses a “Graph Field Network” (GFN) which combines invariant graph features and invariant relative coordinate positions to predict equivariant coordinate updates. This provides equivariance in the backward diffusion process.

However, the forward process of adding Gaussian noise to coordinates is not inherently equivariant. To overcome this, when considering the forward diffusion term during training, GeoDiff centers the coordinates to have zero center-of-mass at each step of the diffusion process and aligns the coordinates with a consistent frame of reference, making the forward diffusion target equivariant.

As far as results, GeoDiff achieves significant improvements (as much as 50%) in the COV-R, MAT-R, COV-P, and MAT-P metrics, which are variants of recall and precision designed to measure how well distributions overlap. For the sake of keeping this post short, I’ll refer you to the paper for a more detailed description of metrics and methods.

This paper will be presented in a few months at ICLR 2022 in the Machine Learning for Drug Discovery Workshop, along with other exciting papers (see our previous posts on EquiDock and EquiBind.