Beyond Traditional AI: Embracing Multimodal Challenges with Meta-Transformer

Introduction In the realm of pharmaceutical research, the quest to unlock groundbreaking insights often requires navigating through a vast sea of diverse data modalities. Imagine a powerful tool that can seamlessly process and integrate information from text, images, audio, 3D point clouds, video, graphs, and more, transcending the limitations of conventional AI approaches. Enter “Meta-Transformer: […]

Team ZONTAL September 1, 2022 No Comments

DALL·E 2, Imagen, and Applications to Chemistry

In the past two months, DALL·E 2 has taken over the internet. From Bart Simpson edited into Egyptian art to Donald Trump as the Lorax, text-to-image AI produces amazing results. Caption: “Panda weaving a basket made of cyclohexane”, DALL·E 2 Are these an impressive-but-gimmicky party trick? Or can these innovations be harnessed for applications in scientific domains? Many […]

Team ZONTAL August 23, 2022 No Comments

Making Chemistry Knowledge Machine-Actionable

The history of chemistry has been epitomized by individual chemists coming up with hypotheses, running experiments at lab-scale, and producing discoveries. But in 2022, chemistry data is generated at a scale previously unseen, computers can rapidly process that data, and the data can be widely distributed at relatively minimal cost. This new frontier of global-scale […]

Team ZONTAL August 12, 2022 No Comments

Transformer Retrosynthesis

In drug discovery, there are two main approaches to hit finding: 1) virtual screening of existing small molecule libraries and 2) generative design of new molecules. Generative molecule design can result in better binders, but it may be unknown how to synthesize them. The task of retrosynthesis – designing a synthesis pathway for a molecule […]

Coarse-grained Molecular Dynamics with Geometric Machine Learning

We live an a world where chemistry computation is increasingly competitive with experimentation. AlphaFold predicts protein structure with accuracy sufficient for many applications. In the limit scenario, computational chemists envision biochemistry simulations on a scale that allows them to trace exact mechanisms of disease. A recent pre-print achieves molecular simulation with nanosecond time steps, which is 1000 […]

SELFIES and the future of molecular string representations

Neural sequence models have recently produced astonishing results in domains ranging from natural language to proteins and biochemistry. Current sequence models trained on text can explain jokes, answer trivia, and even write code. AlphaFold is a sequence model trained to predict protein structure with near-experimental accuracy. In the chemistry domain, sequence models have also been used for learning problems on […]

Machine Learning for Drug Discovery at ICLR 2022

For the last decade, the field of deep learning and AI has been dominated by applications to images and text. However, in the past two years, the field has seen an upsurge of chemical and biological applications. The international conference on learning representations [ICLR], is the largest academic AI conference in the world, with an h5-index […]

Fragment Ligand Generation

Extremely data-efficient ligand generation What is a sufficient number of data points to train a deep learning algorithm? 1,000? 1 million? 1 billion? Of course, it depends on the problem. But it also depends on the neural network architecture and training algorithm chosen to solve the problem. Powers et. al. recently published a preprint describing a ligand […]

Extremely data-efficient ligand generation

What is a sufficient number of data points to train a deep learning algorithm? 1,000? 1 million? 1 billion?

Of course, it depends on the problem. But it also depends on the neural network architecture and training algorithm chosen to solve the problem.

Powers et. al. recently published a preprint describing a ligand optimization scheme that generates drug-like molecules with high accuracy, while training on only 4000 protein-ligand examples.


How do they do it? Inductive bias. Inductive bias is real-world knowledge that is built into the neural network, making the learning problem simpler. Ideally, the inductive bias of the architecture reduces the space of learnable information to only things that humans do not know about the task. Good inductive bias reduces the difficulty of the learning problem, reduces the number of data points necessary for training, and increases the generality of trained models.

The problem of designing a ligand that binds to a protein does not seem difficult in principle. An expert could probably come up with some good heuristics – providing non-polar interfaces with non-polar regions of the protein surface, designing complementary charged regions for polar regions of the protein surface, and choosing a geometry that maximizes contact. One could imagine building up a ligand from fragments such that it satisfies these properties.

Captura de pantalla 2022-05-05 150727

Caption: The fragment optimization process. Above: The protein and starting fragment. Below: The fragment generation process.


The authors crafted an algorithm around much of that intuition. The starting problem was fragment-based ligand optimization – crafting a ligand with strong binding affinity from a starting fragment. They began by reducing this problem from a ligand generation problem to a fragment scoring problem. Rather than predicting an entire ligand, they recognized that ligand generation can be framed as repetitive addition of atoms to a starting molecule. This allowed their network to learn a much simpler task, and share neural network parameters for each fragment scoring step. Additionally, it allowed them to augment their data by extracting 100,000 fragment addition steps from the original dataset of 4,000 protein-ligand pairs.

The fragment scoring model works in two steps. In step 1, the model scores the available fragment attachment locations, which are the ligand hydrogen atoms. In step 2, the model scores possible fragment-geometry pairs for the probability of binding at that location. The same “Embedding Model” architecture is used for both steps, differing only in the projection layers that predict probability of binding or fragment score, respectively.


Caption: Left: Step 1, location scoring. Right: Step 2, fragment scoring.

In addition, the embedding model is invariant to orientation and position. This means that regardless of the starting orientation or location of the input atom coordinates, the same output is produced. This is another form of inductive bias, recognizing that the model should not predict different fragments if the inputs are presented in a different orientation.

To summarize: the authors used multiple inductive biases that allowed their model to achieve high accuracy using a small amount of training data:


They compared the characteristics of molecules generated by their model with the characteristics of molecules generated by a physics-based fragment optimization algorithm. They found that across 12 metrics, the characteristics of their generated molecules resembled the molecules in the test set much more strongly than the physics-based generated molecules.



It is clear that inductive bias is necessary and useful especially in chemistry, where labeled data is extremely costly. The more that researchers employ inductive bias to simplify the learning problem, the more chemistry problems will come within reach of deep learning.