Unlocking Nature’s Molecular Secrets: Introduction to How AI is Transforming Metabolome Research

What is Metabolomics, anyway? Metabolomics, a rapidly evolving field in scientific exploration, has emerged as a powerful tool for understanding the complex dynamics of biological systems [2]. By closely examining small molecules, including those generated by cellular processes and external sources, metabolomics goes beyond traditional boundaries, providing a comprehensive view of molecular landscapes. In the […]

Claude 3 – The Next Generation of AI Assistants

It almost took a year, but we finally have a challenger for GPT-4. In the rapidly evolving field of artificial intelligence, a new milestone has been reached with the introduction of Claude 3, the latest iteration of Anthropic’s groundbreaking AI assistant. Building upon the success of its predecessors, Claude 3 promises to revolutionize the way […]

Data Centricity: Key for the Successful Digital Journey towards a Digital Lab

While data meshes and data fabrics are often discussed, data centricity is frequently not consequently executed, despite its significant implications. Data centricity places data at the core of operations and decision-making processes, going beyond mere data utilization in applications. To implement data centricity, three principles are crucial: recognizing data as a key asset, ensuring data […]

Disconnection-Aware Retrosynthesis

In a new paper, researchers at IBM Research recently presented a novel approach to retrosynthesis. In chemical synthesis, the retrosynthesis problem involves determining the optimal sequence of steps to synthesize a given molecule starting from readily available building blocks, known as precursors. In retrosynthesis, a chemist or computational model must first identify a suitable disconnection […]

DiffDock – A Diffusion Model for Molecular Docking

Molecular docking is a critical task in drug design, as it involves predicting the binding structure of a small molecule ligand to a protein. Traditional methods for molecular docking rely on search-based algorithms and scoring functions to estimate the correctness of a proposed structure. However, these methods can be slow and inaccurate, especially for high-throughput […]

RFDiffusion – Leveraging the Power of DDPMs to Generate Protein Sequences and Structures

RFDiffusion is a new method for protein design that leverages the power of denoising diffusion probabilistic models (DDPMs) to generate protein sequences and protein structures. This approach represents a significant advance in the field of protein design, as it allows for the design of complex protein architectures and functions from simple molecular specifications. Figure 1: RFDiffusion […]

MILCDock – Machine Learning Consensus Docking

Molecular docking tools are commonly used in drug discovery to computationally identify new molecules through virtual screening. However, these tools often suffer from inaccurate scoring functions that can vary in performance across different proteins. To address this issue, researchers at Brigham Young University have developed MILCDock, a machine learning consensus docking tool that uses predictions from […]

DALL·E 2, Imagen, and Applications to Chemistry

In the past two months, DALL·E 2 has taken over the internet. From Bart Simpson edited into Egyptian art to Donald Trump as the Lorax, text-to-image AI produces amazing results. Caption: “Panda weaving a basket made of cyclohexane”, DALL·E 2 Are these an impressive-but-gimmicky party trick? Or can these innovations be harnessed for applications in scientific domains? Many […]

In the past two months, DALL·E 2 has taken over the internet. From Bart Simpson edited into Egyptian art to Donald Trump as the Lorax, text-to-image AI produces amazing results.

pandaCaption: “Panda weaving a basket made of cyclohexane”, DALL·E 2

Are these an impressive-but-gimmicky party trick? Or can these innovations be harnessed for applications in scientific domains?

Many AI methods are developed in the laboratory before seeing practical adoption. This allows for measurable improvement of algorithms before real-life application. For example, reinforcement learning algorithms that were tuned on video games are now are used for robotics. Likewise, the Transformer architecture that was developed for text (admittedly a useful application in its own right) was recently adapted into AlphaFold, a model that has now folded every protein known to science.

The backbone of text-to-image AI is a newly popular type of neural networks called “diffusion models”. These models gradually transform an image of random pixels into a high-resolution image. The DALL·E 2 (from OpenAI) and Imagen (Google) models render photorealistic images from arbitrary text descriptions.

elephant_toothpasteCaption: “Elephant toothpaste explosion with foam in the shape of an elephant”, DALL·E 2

 

So how can diffusion models help chemistry?

First, we should differentiate between conditional diffusion models and unconditional diffusion models. Unconditional diffusion models generate images randomly sampled from the distribution of the training data. On the other hand, conditional diffusion models modify that generated image based on some other type of input. In the case of DALL·E 2 and Imagen, this other input is text.

Diffusion models have some constraints. The output must have a fixed size – for example, diffusion models can generate images of fixed resolution, but they cannot generate variable-length sentences. Additionally, diffusion model outputs must be continuously-valued (though it is possible to generate continuously-valued embeddings of discretely-valued data). Finally, diffusion models are generative models, meaning that they are designed for sampling from distributions. So, it makes most sense to use them when there is an interesting distribution over an output variable, as opposed to a single correct value.

What chemistry prediction problems fall within these constraints? Two recent papers predicted molecular conformers using diffusion models conditioned on molecular SMILES strings. GeoDiff predicts molecular coordinates, and Torsional Diffusion predicts molecular torsion angles. Protein structure generation also fits within this framework, as another recent paper predicts protein atom coordinates conditioned on backbone constraints.

jedi

Caption: “Jedi uses a lightsaber to slice DNA in half”, DALL·E 2

Diffusion models have the benefits of the high-fidelity samples generated by generative adversarial networks (GANs) while maintaining a simpler architecture and being easier to train. Though chemistry applications are just beginning to emerge, diffusion models have yielded state-of-the-art accuracy in molecular conformer generation and offer a promising approach for other high-dimensional sampling problems.