TEAM ZONTAL January 4, 2024 No Comments

From Imbalance to Harmony: Empowering Models with Synthetic Data

Who doesn’t care about larger and more balanced training data? While over and under-sampling are established methods to deal with the imbalance issue, Ye-Bin et al. propose an alternative way to overcome imbalances through generative models in a fresh arxiv paper ( In the realm of visual tasks, deep neural networks (DNNs) have exhibited impressive […]

Team ZONTAL August 15, 2023 No Comments

Bayesian Flow Networks: A Paradigm Shift in Generative Modeling

Generative modeling has undergone a transformative journey in recent times, thanks to the emergence of powerful neural networks. These networks possess an unprecedented ability to capture intricate relationships among diverse variables, revolutionizing our capacity to create coherent models for high-resolution images and complex data. This shift is attributed to the art of decomposing joint distributions […]

Beyond Traditional AI: Embracing Multimodal Challenges with Meta-Transformer

Introduction In the realm of pharmaceutical research, the quest to unlock groundbreaking insights often requires navigating through a vast sea of diverse data modalities. Imagine a powerful tool that can seamlessly process and integrate information from text, images, audio, 3D point clouds, video, graphs, and more, transcending the limitations of conventional AI approaches. Enter “Meta-Transformer: […]

Team ZONTAL September 1, 2022 No Comments

DALL·E 2, Imagen, and Applications to Chemistry

In the past two months, DALL·E 2 has taken over the internet. From Bart Simpson edited into Egyptian art to Donald Trump as the Lorax, text-to-image AI produces amazing results. Caption: “Panda weaving a basket made of cyclohexane”, DALL·E 2 Are these an impressive-but-gimmicky party trick? Or can these innovations be harnessed for applications in scientific domains? Many […]

Team ZONTAL August 23, 2022 No Comments

Making Chemistry Knowledge Machine-Actionable

The history of chemistry has been epitomized by individual chemists coming up with hypotheses, running experiments at lab-scale, and producing discoveries. But in 2022, chemistry data is generated at a scale previously unseen, computers can rapidly process that data, and the data can be widely distributed at relatively minimal cost. This new frontier of global-scale […]

Team ZONTAL August 12, 2022 No Comments

Transformer Retrosynthesis

In drug discovery, there are two main approaches to hit finding: 1) virtual screening of existing small molecule libraries and 2) generative design of new molecules. Generative molecule design can result in better binders, but it may be unknown how to synthesize them. The task of retrosynthesis – designing a synthesis pathway for a molecule […]

Coarse-grained Molecular Dynamics with Geometric Machine Learning

We live an a world where chemistry computation is increasingly competitive with experimentation. AlphaFold predicts protein structure with accuracy sufficient for many applications. In the limit scenario, computational chemists envision biochemistry simulations on a scale that allows them to trace exact mechanisms of disease. A recent pre-print achieves molecular simulation with nanosecond time steps, which is 1000 […]

SELFIES and the future of molecular string representations

Neural sequence models have recently produced astonishing results in domains ranging from natural language to proteins and biochemistry. Current sequence models trained on text can explain jokes, answer trivia, and even write code. AlphaFold is a sequence model trained to predict protein structure with near-experimental accuracy. In the chemistry domain, sequence models have also been used for learning problems on […]

Neural sequence models have recently produced astonishing results in domains ranging from natural language to proteins and biochemistry. Current sequence models trained on text can explain jokesanswer trivia, and even write code. AlphaFold is a sequence model trained to predict protein structure with near-experimental accuracy.

In the chemistry domain, sequence models have also been used for learning problems on small, drug-like molecules. However, the most common syntax for representing small molecules as sequences runs into syntactic errors, limiting the usefulness of neural sequence models for generating new small molecules. In response, Krenn et. al. developed a new syntax for representing small molecules called SELFIES, for which all strings represent valid molecules. This makes it possible to leverage the power of neural sequence models to generate new molecules, with promising implications for drug discovery.
In this post, we review not the original SELFIES paper, but a review paper from 31 researchers across 29 institutions, advocating the use of SELFIES and outlining opportunities and open problems related to this research direction.
The authors present 16 different research directions, but we’ll focus on those most relevant to drug discovery.


Benchmarks for generative molecular design

One research direction the authors propose is to create new benchmarks for generative molecular design. Benchmarks for generative molecular design currently involve distribution-matching or goal-oriented molecular design. However, current methods achieve perfect scores on these benchmarks, presenting the need for more difficult benchmarks to evaluate current models. These benchmarks might involve evaluation of ADMET properties, synthesizability, and protein binding affinity.


Smoothness of generated molecules with respect to SELFIES-based representations

Representation learning is the task of learning vector representations of data to reduce noise and make downstream learning problems easier. With good representations, you can interpolate between the vector representations of two molecules, and the molecules corresponding to the interpolated vectors should transition smoothly from one molecule into the other. Previous approaches to representation learning of molecules have trained VAEs on SMILES molecular strings, but since not all representation vectors corresponded to valid SMILES strings, it was difficult to measure the smoothness of the transitions between molecules.

Captura de pantalla 2022-06-13 141209

If a is the vector representing the first molecule, b is the vector representing the last, and the vector representing each intermediate molecule is a + (1-λ)b , SELFIES strings that are smooth with respect to the latent space might correspond to something like the molecules above, where structures smoothly transition into each other.


Smoothness of molecular properties with respect to SELFIES-based representations

In representation learning, the space of all representations is called the “latent space”. Similarly to the previous point, it would be helpful to know if the latent space is organized such that there is a smooth relationship between latent representation and molecular properties.

Why does smoothness matter? If molecules and molecular properties are smooth with respect to representations, we can apply gradient-based techniques to find a vector in the latent space that maximizes certain desirable molecular properties, and then translate that vector into a molecule with those properties.


Learning what the machine has learned in the latent space

It is common to see visualizations of a neatly organized latent space, though the actual smoothness could be quite different. Understanding the proximity of different molecules in the latent space is an important step in answering questions about smoothness. It could also provide insight about the structural motifs that lead to certain drug-like properties, aiding in new hypotheses about drug-like molecules.

Captura de pantalla 2022-06-13 141519An example visualization of a neatly organized 2-dimensional latent space, with smooth transitions between molecules.

Applications of SELFIES

These are all ideas that are focused on research that builds on SELFIES, not necessarily including applications of SELFIES. However, SELFIES provide opportunities for improved models for many applications.

Chemical reaction modeling could be aided by the use of SELFIES. In reaction prediction, a product is predicted from the reactants, agents, and conditions. In retrosynthesis, reactants are predicted from the product. In reaction property prediction, the entire reaction is given and yield, energy profile, or other properties are predicted.

SELFIES also lend themselves to generative modeling of drugs. Given a protein active site, you could train a model to generate potential binders in SELFIES format.

Robust molecular string representations could help in all these tasks. They eliminate the problem of hack-ish proposal-rejection methods for generating molecular strings. If you are using a neural sequence model for small molecules, you should consider using the SELFIES representation.