AI Enhanced Molecular Discovery and Optimization

We’re on the verge of a new scientific method, one that utilizes the power of artificial intelligence to accelerate the development of scientific discoveries.

“The amount of data being generated by experiments and simulations has given rise to the fourth paradigm of science over the last few years, which is data driven science, and it unifies the first three paradigms of theory, experiment, and computation/simulation.” — Ankit Agrawal of Northwestern University

A visual representation of science (colorized, circa. 2003)

Science can often be like a stick in the mud, hence the first two paradigms of theoretical and experimental science proved to be slow or two reasons:

  1. Lack of empirical evidence in theoretical knowledge (1st paradigm)
  2. The abundance of human bias in experiments (2nd paradigm)

Computers, however, changed the game for science, and we began to fully appreciate the accuracy and speed these incredible devices bring. The 3rd paradigm of computation and simulation was one of great promise, however, limited by the technology of its time. This leads us to the 4th paradigm.

The 4th paradigm took advantage of a digital byproduct from the computation paradigm: data. Data science, while not an exact science, can lead to performance and accuracy so high that often times, they can identify new methods, equations, and ideas that we have yet to discover with science.

This is how and why a computer can:

  1. Beat the world’s best Chess, Go, and DOTA players
  2. Discover new drugs, materials, and molecules
  3. Translate languages, drive cars, trade stocks

and the list grows every day. They can do these things better than we can, and do some things we can’t, with just data and a model/architecture. The only equation, intuition, or knowledge it needs is generated by observing patterns in data.

Patterns are very prevalent in chemistry and material science. From the crystalline structures of diamonds to the branched chains of lipids, these patterns define various properties of molecules. It’s up to the A.I to recognize and learn from these patterns. Although there are setbacks like not enough data for training, A.I fits the problem of molecular discovery very nicely. Today, researchers at the bleeding edge of science are beginning to turn to A.I for more efficient, more accurate, and more diversified results.

AI can do it better (Photo by Franki Chamaki)

Why should I care?

In mere decades, the demand for stronger building materials, longer lasting batteries, and more personalized drugs will increase. Our ability to supply these demands depends on how fast researchers can supply the right materials to the companies that are trying to solve the world’s biggest problems. If a chain is only as strong as its weakest link, then:

“The rate of innovation and progress is only as fast as the rate at which we discover and optimize materials”

History has been defined on the fundamental materials that were predominantly used in that time period. The stone, bronze, and iron age were the predecessors to the silicon age we live in today. Tomorrow we may find ourselves in the carbon age. Collectively, we can’t wait for these new scientific eras, but it still takes decades for a new molecule to reach the masses. AI is how we will make up the difference.

Every year, the number of research papers using the latest and greatest machine learning tools increases. From reinforcement learning enhanced recurrent neural networks to generative adversarial autoencoders, and other A.I shenanigans, it’s going to be a decade of wild times.

If this is your first introduction to A.I, awesome! Hopefully, this will give you a picture of where A.I will go in the near future. If you’re a chemist or physicist or biologist, excellent! One of the things everyone in this niche agrees on is the need for awareness of the possibilities. If you’re a developer or machine learner, hurrah! It will be you that shapes tomorrow’s world.

 Awesome! But how do we get there?

Researchers at the forefront of their fields have been trying to use the existing tools we have on hand to solve this problem. There is a pattern in the modus operandi of current research, and the same general process applies to any A.I based science project.

Researchers are carpenters, these are the materials, tools, and instructions.

Crafting the perfect molecule (Photo by Barn Images)

First, we’ll need some high quality wood…

We need data. More importantly, we need clean, labeled, and plentiful data. The data will have to be represented in a format that can be interpreted by a computer. Thankfully, we have just the thing; simplified molecular input-line-entry system (SMILES) strings.

SMILES strings are one of 4 ways molecules can be interpreted by a computer. Listen in order of lowest level to highest (and therefore the simplest to the most complex):

  1. Molecular Fingerprints (Numerical sequence of a specified length)
  2. Strings representations (Notation to describe structure and component)
  3. Molecular Graphs (Same concept as trees and graphs in data structures)
  4. Simulation (Next level stuff. Enough said)
Pick your poison: molecules like drugs reap the benefits of being represented by all 4 formats

90% of all research projects use SMILES string representations, although there are arguably more applicable formats like SMARTS. This is because it strikes a balance between accuracy (in terms of representing the nuances of given molecules) and complexity (in terms of how easy it is for a computer to interpret the data). The strings themselves look a bit like chemical gibberish:




They look a bit like the chemical formulas taught in a typical high school science class. This is because each SMILES string is generated by a depth first tree traversal of a given molecule’s molecular graph. So in a way, a SMILES string is the formula of a molecule, spelled out with more detail.

SMILES string data is best used when it has been cleaned. This includes normalizing, canonizing (there can be multiple SMILES strings for the same molecule), and removing duplicates. You can think of it like removing splinters from the wood. That way it’s easier and safer to work with.

Of course, a computer doesn’t think alphabetically; it thinks numerically. We must, therefore, turn all these letters, symbols and characters into either an integer representation, binary fingerprint, one-hot encoding, or any other numerical format that a computer can digest.

Now we’ll need a hammer and some nails…

We’re going to need tools to work our data. In machine learning, these tools are the various algorithms that each serve to predict, classify, generate, regress, etc. Every tool serves a different purpose so It’s important to choose the right tool for the right problem. In the domain of science, the application of machine learning in any project can be categorized as either a:

Forward Model: For Property Prediction


Inverse Model: For Molecular Discovery/Optimization

The type of model varies, but the inputs and outputs remain the same. In a forward model, the molecule is the input and the property is the output. In an inverse model, this is turned on its head; the property is the input, and the molecule is the output.

There are relatively simple types of machine learning models like linear regressions and K-nearest neighbors and there are more complex algorithms like decision trees and forests. Sometimes these tools suffice, but where there is enough data, a more powerful model could be considered; neural networks.

Neural Networks are the Swiss-army-knife of machine learning algorithms. Neural networks can classify, predict, generate, reduce dimensionalities, and more, thanks to the magic of matrix multiplication and a super special learning algorithm called “back-propagation”.

Neural Networks themselves come in different architectures, each optimized for a different purpose. Recurrent neural networks (RNNs) are the perfect tool for manipulating time-series data like text, or in the case of molecules, means SMILES strings representations. RNNs are versatile, and can be used to predict properties (input SMILES string molecule and output property) or to generate molecules (input property and output SMILES string molecule).

We’re going to need the instructions to build this thing…

We now have both the materials and tools we need to build something new! We just need to know a couple more details. These details are known as hyperparameters, such as the number of neurons, the number of layers, the learning rate, etc. Hyperpameters tell us how we use our tools (machine learning algorithms), to work our wood (data). Most of the time, the instructions are up to the carpenter and their intuition. It’s your job as the architect to decide on some hyperparameter values, while others are learned during training.

Join our weekly newsletter to receive:

  1. Latest articles & interviews
  2. AI events: updates, free passes and discount codes
  3. Opportunities to join AI Time Journal initiatives

But when you do, choose wisely. (courtesy of Daniel Shapiro)

These parameters have significant sway over the results of the model, so it’s up to the machine learner’s intuition and experience to pick a good starting point. Often times the difference between success and failure lies in the change of just a single hyperparameter. Optimization is all about changing these hyperparameters until the resulting output of the model is the most accurate as if can be.

We’ll have to choose our loss function (the way we determine how well or how bad our model is doing). There are also plenty of activation functions to choose from (to keep values intact when passed between each layer of the neural network. Developing an intuition for how to choose all these hyperparameters requires an understanding of how and why each choice works, sometimes you just gotta guess and check!

I could really use a saw and some polisher…

A carpenter’s job isn’t just to build the thing; it is also their job to make it presentable. The final step is therefore to return the output of the model in a comprehensive format.

In a forward model, this means presenting the properties of a given material most accurately and with the appropriate units of measurement.

In an inverse model, this means presenting the generated molecule in correct SMILES string notation.

There are ways we can improve the finish of our product. When looking for potential drug candidates in particular, this is the most dangerous stage. There is no way to know if a molecule is steady without actually testing it, which is why certainties and error rates are so important in science. This is where many researchers employ another add-on to their original model. It’s very rare to find a one-shot solution and there is no silver bullet. Good solutions for molecular discovery implement multiple machine learning algorithms, also known as ensemble algorithms.

The ReLeaSE architecture (Mariya Popova et al) combines RNNs with reinforcement learning techniques. The RNN is comprised of 2 distinct networks that work to generate valid molecules. The reinforcement training then biases these results towards desired properties.

The ECAAE architecture (Daniil Polykovskiy et al)first separates the latent distribution code of the autoencoder from properties before modifying the latent code to match a prior distribution code. This is trained with an adversarial network until the discriminator can no longer distinguish the latent from the prior.

The ORGANIC architecture (Benjamin Sanchez-Lengeling et al) uses generative adversarial networks (GANs) with reinforcement learning techniques. Similar to the ReLeaSE architecture, the GAN would generate valid molecules before reinforcement learning (named “objective reinforcement”) shifts the output towards desired properties.

That’s about it


This generalized process comes not from intuition, but rather from the patterns developed in the numerous papers, projects, and case studies generously provided by 2 factions; academia and industry.

University Research

The aforementioned architectures like ReLeaSE, ECAAE, and ORGANIC are all state of art examples of supervised deep learning with a twist. The incredible institutions behind these innovations are some of the world’s top Universities.

Harvard University

Papers like “What is high throughput virtual screening…”, and the aforementioned ORGANIC architecture come from the top ranked university in the world. Harvard’s clean energy project is an example of research pushed to the bleeding edge of A.I. Contributors include people from the likes of chemistry, A.I, data science, and numerous other fields. This sort of collaboration is necessary if we are to continue growing the applications in these fields.

University of Cambridge

The simply “Machine learning for material science” is an in depth paper that covers all the recent innovations in the space. Cambridge is also the place of very specific applications, like the probabilistic design of alloys using neural networks. With companies like Deep Mind stationed in the UK, its no surprise that Cambridge continues to put out quality content.

Northwestern University

The holistic idea of data driven science is a highlight of the work that Northwestern University has put out in recent times. Research ranging from high throughput-DFT for Molecular Discovery to the Prediction of the High-Dimensional Thermal History in Directed Energy Deposition Process via recurrent neural networks originate from the research conducted at Northwestern University.

Startups and Companies

My mentor, Navid Nathoo, gave me a piece of grounded advice;

“A problem becomes an opportunity when people are willing to pay for it to be solved; there must be an economic incentive.”

Without the money, everything up to this point is a fun science project that sounds cool but is of no interest from a business standpoint. That being said, here are some companies, big and small (the size of which should be telling of how much economic incentive there is), that are looking to shake things up.


One of the leaders of the industry specifically in cheminformatics

Citrine Informatics

This incredible company working out of San Francisco has made great strides in the molecular discovery and optimization research fields. I would place particular emphasis on their methods of operation. Citrine understands that the scientific community isn’t as privileged as other fields in terms of the size, quality, and consistency of data. We may have enormous datasets of images, text, and audio, but you’d be hard pressed to find a solid dataset of carbon molecules, forget cleaned or labeled.

Citrine cuts through the “small data” problem by taking advantage of as many techniques as possible. Techniques like data augmentation, transfer learning, and stacked architectures squeeze every ounce of value from existing datasets.

IBM Research

Little is known about the elusive research centers of fortune 500 companies like Microsoft, Facebook, Google, and especially IBM (who still remembers what IBM stands for?). Since losing the bid for computing and missing out on mobile, IBM has since shifted focus on what is to come instead of what’s happening. IBM still fights to be relevant today, but not as the computer company we once knew, but rather, a quantum computing, A.I researching, and technology innovating company that’s looking to make a comeback in the near future.

Recently, IBM released a free tool that predicts chemical reactions, as per the majority of such projects, SMILES strings is the chosen molecular representation. With 2 million datapoints on chemical reactions, the A.I managed to get considerably accurate results.

Google Research

Unsurprisingly, the world’s most influential company happens to also have its hand in the cookie jar. Google’s A.I research has a special team called Gogle Accelerated Science who are working on computational chemistry and biology, with goals to advance scientific research and accelerate scientific innovation. They’ve collaborated with Deep Mind on several occasions, putting out mind-blowing work.

Rumors of their recent work involves using #3 of the 4 possible molecular representations; molecular graphs. This is a natural result of their research, as Geometric Deep Learning is beginning to gain ground and the benefits become more clear. Google consistently puts out their research publicationsand sometimes their relevant code. Keep an eye out for news, if it’s anyone that’s able to pull off the next big thing, it’s Google.

Key Takeaways

  • We are now entering the 4th paradigm of science, one that is data driven instead of theory, experiment, or computation
  • AI is the determining factor in how this change will impact science and what it means for society
  • Current research follows a process and is limited to the currently available tools in AI
  • There are many real world problems being solved by both academia and industry currently, and it’s only growing
AI Molecular Discovery
The future is bright (courtesy of the Science magazine)

What’s to come

Chamath Palihapitiya believes that while Google may be the master of search data, Facebook may be the master of communication data, and Amazon may be the master of consumerism data, there has yet to be a clear master of health care data, molecular data, and plenty of other growing fields.

Search, communication, and consumerist data is flashy and superficially important, but there aren’t enough people working on the world’s toughest problems.

Artificial intelligence can change that.

You can change that.

Original. Reposted with permission.

Opinions expressed by AI Time Journal contributors are their own.

About Flawnson Tong

Contributor Machine Learning Developer | Computational Chemistry Researcher

View all posts by Flawnson Tong →