Disconnection-Aware Retrosynthesis

In a new paper, researchers at IBM Research recently presented a novel approach to retrosynthesis. In chemical synthesis, the retrosynthesis problem involves determining the optimal sequence of steps to synthesize a given molecule starting from readily available building blocks, known as precursors. In retrosynthesis, a chemist or computational model must first identify a suitable disconnection […]

Team ZONTAL February 2, 2023 No Comments

DiffDock – A Diffusion Model for Molecular Docking

Molecular docking is a critical task in drug design, as it involves predicting the binding structure of a small molecule ligand to a protein. Traditional methods for molecular docking rely on search-based algorithms and scoring functions to estimate the correctness of a proposed structure. However, these methods can be slow and inaccurate, especially for high-throughput […]

Team ZONTAL January 19, 2023 No Comments

RFDiffusion – Leveraging the Power of DDPMs to Generate Protein Sequences and Structures

RFDiffusion is a new method for protein design that leverages the power of denoising diffusion probabilistic models (DDPMs) to generate protein sequences and protein structures. This approach represents a significant advance in the field of protein design, as it allows for the design of complex protein architectures and functions from simple molecular specifications. Figure 1: RFDiffusion […]

Team ZONTAL January 10, 2023 No Comments

MILCDock – Machine Learning Consensus Docking

Molecular docking tools are commonly used in drug discovery to computationally identify new molecules through virtual screening. However, these tools often suffer from inaccurate scoring functions that can vary in performance across different proteins. To address this issue, researchers at Brigham Young University have developed MILCDock, a machine learning consensus docking tool that uses predictions from […]

Team ZONTAL September 1, 2022 No Comments

DALL·E 2, Imagen, and Applications to Chemistry

In the past two months, DALL·E 2 has taken over the internet. From Bart Simpson edited into Egyptian art to Donald Trump as the Lorax, text-to-image AI produces amazing results. Caption: “Panda weaving a basket made of cyclohexane”, DALL·E 2 Are these an impressive-but-gimmicky party trick? Or can these innovations be harnessed for applications in scientific domains? Many […]

Team ZONTAL August 23, 2022 No Comments

Making Chemistry Knowledge Machine-Actionable

The history of chemistry has been epitomized by individual chemists coming up with hypotheses, running experiments at lab-scale, and producing discoveries. But in 2022, chemistry data is generated at a scale previously unseen, computers can rapidly process that data, and the data can be widely distributed at relatively minimal cost. This new frontier of global-scale […]

Team ZONTAL August 12, 2022 No Comments

Transformer Retrosynthesis

In drug discovery, there are two main approaches to hit finding: 1) virtual screening of existing small molecule libraries and 2) generative design of new molecules. Generative molecule design can result in better binders, but it may be unknown how to synthesize them. The task of retrosynthesis – designing a synthesis pathway for a molecule […]

Coarse-grained Molecular Dynamics with Geometric Machine Learning

We live an a world where chemistry computation is increasingly competitive with experimentation. AlphaFold predicts protein structure with accuracy sufficient for many applications. In the limit scenario, computational chemists envision biochemistry simulations on a scale that allows them to trace exact mechanisms of disease. A recent pre-print achieves molecular simulation with nanosecond time steps, which is 1000 […]

SELFIES and the future of molecular string representations

Neural sequence models have recently produced astonishing results in domains ranging from natural language to proteins and biochemistry. Current sequence models trained on text can explain jokes, answer trivia, and even write code. AlphaFold is a sequence model trained to predict protein structure with near-experimental accuracy. In the chemistry domain, sequence models have also been used for learning problems on […]

Neural sequence models have recently produced astonishing results in domains ranging from natural language to proteins and biochemistry. Current sequence models trained on text can explain jokesanswer trivia, and even write code. AlphaFold is a sequence model trained to predict protein structure with near-experimental accuracy.

In the chemistry domain, sequence models have also been used for learning problems on small, drug-like molecules. However, the most common syntax for representing small molecules as sequences runs into syntactic errors, limiting the usefulness of neural sequence models for generating new small molecules. In response, Krenn et. al. developed a new syntax for representing small molecules called SELFIES, for which all strings represent valid molecules. This makes it possible to leverage the power of neural sequence models to generate new molecules, with promising implications for drug discovery.
In this post, we review not the original SELFIES paper, but a review paper from 31 researchers across 29 institutions, advocating the use of SELFIES and outlining opportunities and open problems related to this research direction.
The authors present 16 different research directions, but we’ll focus on those most relevant to drug discovery.

 

Benchmarks for generative molecular design

One research direction the authors propose is to create new benchmarks for generative molecular design. Benchmarks for generative molecular design currently involve distribution-matching or goal-oriented molecular design. However, current methods achieve perfect scores on these benchmarks, presenting the need for more difficult benchmarks to evaluate current models. These benchmarks might involve evaluation of ADMET properties, synthesizability, and protein binding affinity.

 

Smoothness of generated molecules with respect to SELFIES-based representations

Representation learning is the task of learning vector representations of data to reduce noise and make downstream learning problems easier. With good representations, you can interpolate between the vector representations of two molecules, and the molecules corresponding to the interpolated vectors should transition smoothly from one molecule into the other. Previous approaches to representation learning of molecules have trained VAEs on SMILES molecular strings, but since not all representation vectors corresponded to valid SMILES strings, it was difficult to measure the smoothness of the transitions between molecules.

Captura de pantalla 2022-06-13 141209

If a is the vector representing the first molecule, b is the vector representing the last, and the vector representing each intermediate molecule is a + (1-λ)b , SELFIES strings that are smooth with respect to the latent space might correspond to something like the molecules above, where structures smoothly transition into each other.

 

Smoothness of molecular properties with respect to SELFIES-based representations

In representation learning, the space of all representations is called the “latent space”. Similarly to the previous point, it would be helpful to know if the latent space is organized such that there is a smooth relationship between latent representation and molecular properties.

Why does smoothness matter? If molecules and molecular properties are smooth with respect to representations, we can apply gradient-based techniques to find a vector in the latent space that maximizes certain desirable molecular properties, and then translate that vector into a molecule with those properties.

 

Learning what the machine has learned in the latent space

It is common to see visualizations of a neatly organized latent space, though the actual smoothness could be quite different. Understanding the proximity of different molecules in the latent space is an important step in answering questions about smoothness. It could also provide insight about the structural motifs that lead to certain drug-like properties, aiding in new hypotheses about drug-like molecules.

Captura de pantalla 2022-06-13 141519An example visualization of a neatly organized 2-dimensional latent space, with smooth transitions between molecules.

Applications of SELFIES

These are all ideas that are focused on research that builds on SELFIES, not necessarily including applications of SELFIES. However, SELFIES provide opportunities for improved models for many applications.

Chemical reaction modeling could be aided by the use of SELFIES. In reaction prediction, a product is predicted from the reactants, agents, and conditions. In retrosynthesis, reactants are predicted from the product. In reaction property prediction, the entire reaction is given and yield, energy profile, or other properties are predicted.

SELFIES also lend themselves to generative modeling of drugs. Given a protein active site, you could train a model to generate potential binders in SELFIES format.

Robust molecular string representations could help in all these tasks. They eliminate the problem of hack-ish proposal-rejection methods for generating molecular strings. If you are using a neural sequence model for small molecules, you should consider using the SELFIES representation.