- The journey of getting into data science and AI
- The most relevant breakthroughs in data science and AI
- The growth of JADBio, the JADBio platform, and advances in drug discovery
- How machine learning has helped tackle Covid-19
- A learning pathway for aspiring data scientists
#Getting Started with Data Science & AI
At what point did you realise that you wanted to pursue a career in data science (data & AI), and how did you get into it?
I was drawn to AI from a very early age, practically as soon as I started being serious about computer science in my teenage years, my mind would drift to how we build intelligent systems. I read the books “Godel, Escher, Bach” and “Mind’s I” and that was a turning point for me: I would become an AI researcher. When I was ready for graduate studies, I was drawn to machine learning. The term “data science” was not invented back then, and machine learning was quite different from other sciences that deal with data, like statistics and pattern recognition. But they assigned me to a supervisor whose expertise was on AI Planning.
AI Planning deals with solving problems of deciding what actions a system should do, like a robot or a softbot, to achieve certain goals. I did find the subject interesting and stuck to that. I had an internship at NASA and that inspired the topics and problems I tackled in my masters and Ph.D. theses. But, at the back of my head, machine learning was my secret passion. When I graduated with a Ph.D. bioinformatics was on the rise. We had just decoded the human genome, so the hype was high. Applying machine learning to biology and the massive data it started producing seemed overly exciting. I had an invitation to join a friend’s lab at Vanderbilt University and work on these problems. For better or worse, I declined an offer to join NASA and became a Professor at Vanderbilt.
Data science was not my expertise, so it was an audacious decision, but it turned out exceptionally good for me. There are no regrets.
In your opinion, what have been the most relevant breakthroughs in data science impacting our world in the last 1-2 years, and what trends do you see emerging going forward?
There are amazing inventions and discoveries coming up all the time in this very vibrant field. It is hard to separate the most important breakthroughs. Deep Learning of course is a technology and methodology that keeps spawning breakthroughs at a constant rate. The newer types of Generative Adversarial Networks (GANs) that can translate between data distributions are the most exciting to me because they connect different groups and types of data. But there are other technologies that are coming up and we should keep an eye on them.
Of course, I am very biased in my opinion, but I consider Causal Discovery and Automated Machine Learning some of the next big things in the field. Causal Discovery deals with inducing causal relations from data, not just predictive ones. The major difference is that predictive relations, statistical correlations, tell you how to predict the future. Causal relations tell you how to influence the future. So, medicine, biology, business decisions, economic decisions, and most of our science is seeking out causal models, not just predictive models. We need to know causality to make sensible decisions. There have recently been some major advances in this field, and many superstars of machine learning are now working on causal discovery problems.
Finally, Automated Machine Learning (AutoML) is the next big thing. It can automate a large part of machine learning – not everything yet – and give a big boost to productivity. It can bring machine learning to the hands of everybody and democratize data science. I started my entrepreneurship efforts trying to build a startup offering products that do Causal Discovery; I talked to investors and I got the impression that the market is still immature. So, I founded a start-up on AutoML instead, but what I really want to build soon is automated causal discovery.
What lessons have you learned on getting the company’s bought in to leading through data versus gut?
Well, I am trained to analyze data and rely on data for decisions. So, as early as we could afford it, we put in place mechanisms to measure data about the use of our product, marketing results, users’ behavior, and the like. But when you are a start-up you have to take the first few steps based on guts. It may seem arbitrary, but it is not exactly. I had 15 years of experience in the field, I talked to numerous conferences, demoed our prototype to numerous people, so I did have some data being imprinted in my brain to guide me. They were not hard data though. Nevertheless, when you do start analyzing the actual data on your clients 99 out of 100 times you will be surprised. And surprises will keep coming up. So overall, you must collect data and base your decisions on that, but when you are venturing in new directions and have new ideas, there is no substitute for intuition.
#About JADBio, Machine Learning in Drug Discovery and Covid-19
Leaders must change their attitude and acknowledge they do not know everything about their business and their clients; the data may still hold new knowledge and a few surprises for them.Ioannis Tsamardinos
How was JADBio born and what is the vision of the JADBio platform?
The roots of JADBio go back a long long time. While I was still a graduate student, we had discussions with my fellow Ph.D. students about combining Planning and Machine Learning, specifically automatically creating plans for how to perform an analysis for a specific dataset and problem at hand. But I had to finish my Ph.D. so I didn’t act on it. Then, I became a faculty member at Vanderbilt University, and we really needed to automate analyses. We created the Gene Expression Model Selector or GEMS that was actually the first AutoML tool ever; the name AutoML was not coined up yet. Unfortunately, it was exceedingly difficult to explain what it does and why it is useful to most biologists and life scientists at the time. The market was not ready.
We got to hear a lot about how you “cannot build a predictive model with more predictors than samples” back then, which classical statisticians would advocate. We won an award at a major conference, but nothing more than that. It was a flop. But, 10 years later, I decided to give it another chance. We did not build an independent product, but an AutoML add-on for another company called CLC Bio. By the time we had the add-on ready, CLC was sold to another company and again, our product was a flop. Eventually, I decided to do it the right way: get some serious funding, get a business plan in place, and create a shiny new independent product and JADBio was born.
Our vision is to empower scientists, organizations, and institutions with correct and high-quality data analysis. This will lead to discovering new science and making evidence-based decisions and policies. But, more than that, we want AutoML to become so accessible that it will enable the participation of the ordinary citizen in the upcoming revolution of a data-based world. We want everybody on board analyzing their own data or public data that they find interesting. To achieve such goals, we need to create software that is intelligent enough to know how to handle all types of data and problems.
For the moment, we focus on biomedicine. Our hope is to save lives and discover new medicine and biology. We plan to expand to other types of data to encompass an ever-growing type of problems in biomedicine involving sequence data, medical signals, images, medical notes, single-cell data, and any new type emerging technologies will bring our way. We want to augment JADBio with other types of analytics like Causal Discovery and Causal Analytics. We are about to introduce technologies that analyze not just your dataset in isolation but in the context of thousands of other publicly available datasets in biorepositories. There is no shortage of ideas or ambition.
How has advances in machine learning accelerated the process of drug discovery?
We typically do not realize it, but drug discovery, and many other fields, are profoundly affected by machine learning and data science at all levels and stages of the process. To understand the biological mechanisms and identify possible drug targets, we use machine learning. To predict whether a compound will bind on the target, and hence it is worth trying, some methodologies use machine learning. To analyze clinical trial data to identify possible adverse or toxic effects, we use machine learning, in addition of course to standard statistics. To do precision medicine and predict the effectiveness of treatment on an individual … machine learning. To explore drug repurposing, machine learning. Of course, machine learning is not always the standard tool in each of these problems, but it is employed more and more frequently. So, it is safe to say I believe that drug discovery would be much slower or less effective without machine learning. Think about that when you administer yourself a medicine pill.
Could you shed some light on how data science and machine learning helped in the various aspects related to COVID-19?
In many countries, the progress of the disease was closely monitored and measured, and data were collected. Researchers around the world jumped into the opportunity of analyzing such data and contributing to fighting the pandemic. As of now, there are about 500 scientific papers on Covid-19 using the keyword “machine learning” in 2021 alone, and of course, probably thousands more that employ other data science techniques without explicitly referring to machine learning. Predictive models, causal models, and mathematical models (i.e., based on differential equations) were built to predict not only the rate of the epidemic to the general populations but also in sub-groups.
Researchers also tried to do some causal attribution and figure out what is causing the spread of the disease to these subgroups, decide the best course of actions and interventions, and inform public policy. For example, fellow Greek MIT Professor Dimitris Bertsimas developed a predictive and causal model named DELPHI that has been accurately predicting the rate of disease in the US. The model suggests ventilator pooling interventions to reduce the mortality of COVID-19. Of course, public policy is not only a matter of science, but … politics, economics, and other factors as well. Other types of models focused on the biology of the disease and identifying the sub-groups that are susceptible to a severe reaction to COVID-19. Such models could be employed to protect the most sensitive parts of the population and save lives. Using JADBio we have recently developed and published such a model.
What advice would you give to other business leaders who would like to step into realising data science use cases?
On one hand, one must take their business to the data science era to stay competitive. This means recording all data of their business with the intention to be analyzed. The analysis should not be an afterthought; you must set up your dataset correctly right from the start. The data must be recorded systematically, with well-defined procedures, meaning, and structure. One must be proactive in what they would like to learn from the data to measure the right quantities to allow predictions. If you do not learn from your business data, you simply will not survive the competition in the next decade. Business leaders must acknowledge that their data are valuable; mining and analyzing them will help them improve their business and make evidence-based business decisions. Gut and intuition are not enough. Leaders must change their attitude and acknowledge they do not know everything about their business and their clients; the data may still hold new knowledge and a few surprises for them.
On the other hand, they must be aware of the dangers, as there are some. First, data science is costly, data scientists scarce, and setting up a quality data science team is not easy. Hopefully, the costs of analysis will go down as AutoML products reach maturity. In addition, there is a great range of quality data scientists out there! Business people must be careful. Numerous people write “data scientist” in their resume just because they know how to call python libraries, while they have no idea about mathematics, statistics, and machine learning in-depth. I have worked with analysts in businesses that built predictive models that were on par with random guessing! Of course, they didn’t know that, and their boss didn’t know that because they were never taught the right methodology to measure the predictive performance of a model in certain atypical situations. It is not always as trivial as one may think. Subtle statistical phenomena lurk to threaten the analysis.
#Words for aspiring data scientists
What skills and attitudes do you look for when hiring data scientists?
They do not call Data Science a science by accident. Analyzing data always has a research component. You are discovering the patterns, you are exploring. You do not necessarily always know what exactly you are looking for, or what is interesting, or what is the best way to visualize results. So, some major major skills for a data scientist are to be inquisitive, take the initiative, and be research-oriented. Not necessarily in the sense that they are inventing new algorithms and methods but researching what the data is trying to tell you. As a supervisor, you cannot always give precise steps and algorithms to a data scientist and handhold them the whole way through. They must have the natural inclination to dig deeper, to try to understand, to visualize data in many ways, and come up with ideas, explanations, and interpretations. It is an art and not an exact science. The ideal data scientist to me is a scientist, is an artist, is a researcher.
What are some resources that you would recommend young data scientists to tap onto?
I am not going to recommend the next python library to use or an R package or a new programming language for machine learning. These change all the time and can easily be found by google search. I think a major problem for new scientists in the field is discovering the gems and throwing away the dirt because there is just so much hype, articles, and videos out there. I would recommend young data scientists learn statistics and machine learning theory behind the tools they use. I would recommend reading classic books like “The Elements of Statistical Learning” by Friedman et al. Do not get caught up with the hype of Deep Learning alone. One must have a more complete background that includes basic statistical and machine learning concepts and techniques. Learning causal discovery and modeling completely changes your view and perspective of data science and is definitely worth it. The book “Causation, Prediction, and Search” is a must on the subject.
Tag one or two data science leaders that you would like to see answer these questions.
Well, some people I respect, and I find not only knowledgeable and successful, but visionaries and having a broad view of the field are Bernhard Schölkopf in the Max Planck Institute and Yoshua Bengio. And recently they co-authored an interesting article combining causal discovery and deep neural networks that points to an interesting direction at the heart of what could be the next breakthrough in the field.