Abstract

GEN BiotechnologyVol. 1, No. 3 Asked & AnsweredFree AccessA Data-Driven Lens to Understand Human Biology: An Interview with Daphne KollerDaphne Koller and Malorye A. BrancaDaphne Kollerinsitro, South San Francisco, California, USA.Search for more papers by this author and Malorye A. BrancaContributing editor, GEN Biotechnology.Search for more papers by this authorPublished Online:14 Jun 2022https://doi.org/10.1089/genbio.2022.29028.dkoAboutSectionsPDF/EPUB ToolsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Daphne Koller (Founder and CEO of insitro).With healthy financing ($500 million and up), big deals with pharma and Genomics England, insitro is one of the brightest lights among the growing number of companies seeking to deploy artificial intelligence (AI) in drug discovery and development. Part of that is surely due to its founder's passion for combining computational science with biology and chemistry.Although she was not trained as a biologist, machine learning specialist Daphne Koller was drawn to drug discovery and development because it was more technological and aspirational than “telling spam from nonspam.” Koller, who has a PhD in computer science, was a Stanford professor for ∼20 years, and was also cofounder and co-CEO of online learning site Corsera, among other endeavors.But human health became her passion, and in 2018 she founded insitro. The data are there, Koller says. And insitro is combining induced pluripotent stem cells (iPSCs), genome editing, high content cellular phenotyping, machine learning, and other data-generating tools to build in vitro models of disease to be maximally predictive of human clinical outcomes.Now the company is applying that platform to data from partners such as Genomics England, which has ∼150,000 complete genome sequences with corresponding phenotypic data from rare disease patients and their families. Another partner is Bristol Myers Squibb, with which insitro has a 5-year discovery collaboration for discovery and development of novel therapies for the treatment of amyotrophic lateral sclerosis (ALS) and frontotemporal dementia.In this exclusive interview (originally recorded for GEN Edge and lightly edited for length and clarity), Malorye A. Branca talks to Koller about what is finally propelling AI in drug discovery and development and how insitro aims to become a leader in this emerging field.There is a lot of excitement around AI. Why do you think it is coming to fruition? Is it more than just hype?Koller: Boy, there is a lot to unpack in that question! Let me start with the “why now.” I think that machine learning has made a tremendous amount of progress in the past decade—way more than I would have anticipated and across multiple different domains. I think we are finally in a world where machine learning has demonstrated the promise that had been in place for decades, but we are finally there.As to machine learning and drug discovery, I think these are much earlier days, partly because the amount of data available for training models is much more abundant in areas such as natural language processing or image recognition. Biological data and chemical data are hard to create and hard to come by. I think there is a lot of potential and we are starting to see some very large data sets, although maybe not by the scale of images on the web, that enable machine learning to be appropriately done.Now, as to the question about hype, the answer is yes! There is a tremendous amount of hype out there that is often quite hyperbolic and misleading to people in ways that are counterproductive. There is a lot of good work that is happening, but if you exceed the work that is happening with hyperbolic promises that are not likely to come true in the coming years—like we are going to have 1000 drugs in the clinic in the next 3 years—you are not.Drug discovery is really hard. It is important to stay balanced, conveying the promise, while also conveying the challenges of what is fundamentally really hard problem for us to solve—AI notwithstanding.“Drug discovery is really hard. It's important to stay balanced, conveying the promise, while also conveying the challenges of what is fundamentally really hard problem for us to solve.”What brought you into this field initially?I have been working in this field for a little >20 years. I got into it in 1999–2000. I was actually a fairly traditional machine learning person, to the extent that traditional machine learning people were around in the mid-1990s. I was one of the first people into the field. But I was not interested in biology when I started. I was mostly working on more standard applications, such as computer vision and robotics.But the data sets that were available to machine learning people at the time were not nearly as interesting as what we have today. They were very small and frankly unaspirational—how excited can you get about classifying spam versus nonspam? I initially became interested in biology because it was more technologically interesting and also more aspirational than some of those other applications. Over time, I became interested in the field in its own right, despite not having any training in biology at the start, and just taught myself biology over the past 20 years.As I started to get more and more into biology, my Stanford laboratory had a bifurcated existence. Half my laboratory did core machine learning and published in computer science venues, whereas the other half published in biology journals. My computer science friends did not even realize that I did biology. My biology friends did not imagine that I was in a computer science department. So those were interesting entry points into the field.What has changed with the data?I have seen change on two fronts. First, on the clinical side, we are seeing more and more high-quality, high-content clinical data acquired from people. The U.K. Biobank is a wonderful example of that and has unlocked so much value in terms of discovery. We are now seeing similar biobanks, even in the United States. For example, the All of Us project just recently released some of the early versions of their data set. We are starting to see the availability of electronic health records, certainly in the United Kingdom, through the connection to the National Health Service, but it is even happening here in the United States.I think that the amount of clinical data is growing quite dramatically and we are only at the beginning of that inflection curve.Often that has aligned with genetics, which really unlocks the capabilities for drug discovery. The other form of data that is becoming more available is in vitro laboratory data. When I started working on machine learning and biomedical data sets—back in the late 1990s/early 2000s—a large data set was one that had 200 samples. Now we have data sets where people are doing single-cell RNA sequencing and you have hundreds of millions of cells that you are sequencing, imaging, or whatever. In many cases, those are also much more relevant to human biology. We are no longer doing experiments in yeast cells or even in cancer lines. That is where we are starting to see the other side of data—on laboratory data that are much more abundant and much more relevant to human biology.You have a huge deal with the United Kingdom. Where else are you getting your data from?Some of our data comes through partnerships, including our partnership with Gilead. We were very excited about the deal that we had with Gilead in nonalcoholic steatohepatitis (NASH) to get access to some of their clinical trial data. The trials, as it happened, were not successful, but the data quality was incredible. We were able to extract many insights on the progression and genetics of NASH in that analysis. Even though those were not huge data sets by machine learning standards, they were still quite valuable because of the quality and density of the data, such as histopathology images from patients at the beginning and the end of the trial.Fortunately, there are other public or nonprofit organizations that collect data with the promise of unlocking value for patients in many indications.The other advance is that we are making our own data. At insitro, we have built a considerable wet laboratory infrastructure with automation, iPSCs, microscopy, transcriptomics, and more. We are generating data at scale that are specifically relevant to unlocking our understanding of the biology of the diseases that we are studying.What would you consider a big enough data set?People always ask that of machine learning people, and there is no single answer, because it depends how subtle and complex the problem is that you are looking to solve.If the thing that separates your positives from negatives or predicts your quantitative trait is relatively straightforward to read from your data, you can make do with a few hundred data points. But if it is a really subtle, complex signature in a very convoluted space, which in many cases is true for chemistry, where the space of chemical compounds people say is 1080, and if you are trying to predict what makes a tiny molecule with some variable conformation bind to all the similarly moving pockets of a protein, you may need more data than that.That is why we created a chemistry infrastructure at scale using DNA-encoded libraries, whose primary purpose is to create data to train machine-learning models on binding affinity.What different types of data are you focused on?That is a great question. I briefly alluded to most of the data types that we care about. In our own wet laboratory environment, we have efforts in both biology and chemistry, creating data using DNA-encoded libraries that allow us to create incredible scale measurements of what compound binds to a particular protein target.On the biology side, where most of our efforts have gone, we create cellular models of disease based on iPSCs, which carry the genetics of different people with or without disease. We phenotype those cells, using a multitude of high-content modalities—fixed microscopy with stains, live-cell microscopy, single-cell transcriptomics, and multiple other readouts of those cells—to understand how disease genetics might manifest in cellular phenotypes.That is all great, but ultimately, disease models are only as good as their ability to predict disease in humans. The other form of data, as mentioned, is high-content data from human clinical outcomes. The part that we really care about—and ties into the deal that we have with the United Kingdom—is high-content data from humans. Not just the relatively limited and often subjective ascertainment of disease/no disease, but something that is measured objectively with a lot of information about the underlying biology.Histopathology data obtained from biopsy samples are one incredibly rich source of data. We also found that there is a lot of information in brain magnetic resonance imaging (MRI) that is getting lost when people summarize the MRI output down to one or two summary statistics. There is also an increased collection of things such as serum proteomics and transcriptomics, which measure molecular data from blood. All those are data modalities that we think shed light on underlying biological processes that we can then align to what we see in our cellular data. The experiments in the cell become translatable to what is likely to happen in the human.But aren't the algorithms where the buck stops? How have they advanced?In machine learning, 80% of the value is from having better data and 20% is from having a better algorithm. A great algorithm on a lousy data set can only go so far. We invested a lot of effort in data creation and data collection, so that we can have really good data sets. Once you have those, then better algorithms can unlock value. Once we have those better data sets, we have made a very significant investment in better machine learning models.“In machine learning, 80% of the value is from having better data and 20% is from having a better algorithm.”For example, our live-cell microscopy, which is a highlight of our company's technology stack, is a really sophisticated microscope that shines light into cells at different angles on a very quick rotation. It turns out that light refracts in different ways, depending on what exactly in the cell it hits. Although a person cannot make sense of that blur, we can use machine learning to create a much higher resolution and higher content read out of what is happening in the cell. Then we can impute things such as cellular compartments, lipids, cell membranes, and all sorts of things that are just really not perceivable by the human eye.Machine learning comes in all sorts of different places for us. It comes in the raw interpretation of data, as in this example. It also comes up in looking at cells, for example, for high-content data that come from, say, patients versus healthy individuals and asking what is it that makes them different? Do we see a signature of disease that really is capturing the underlying pathogenic processes? With that, can we then search through some of our wet laboratory tools for something that seems to revert that disease signature closer to a healthy state?That is another place where machine learning comes in—creating disease models that are unbiased and rich in terms of capturing the biological state of the patient.How do you really know that they are unbiased?Nothing is ever entirely unbiased. You decide what to measure and that introduces a certain bias into the analysis. The good news is that with machine learning thinking, you can ask the question whether what you are measuring in the cellular system is truly predictive of human clinical outcome. You can say I have learned a healthy versus disease in one subset of patients. To what extent does that actually predict disease state in a different subset of patients?That allows the machine to prove, in some sense, that it has learned something that is meaningful versus having it be something that is necessarily just imposed by human intuition.What do you see as the major hurdles for you?I think those data remain a challenge. Biology is really challenging—you are dealing with live systems, everything influences something. Breathing in a different way changes the experiment. So creating high-quality data that are not confounded and that are sufficient and at scale remains a challenge. It is something we have spent a lot of effort on and made progress, but there is a long way to go.Another important challenge in this space is the lack of availability of talent. That is an issue for machine learning in general. The war for talent in this space is unbelievably hard. We need a unique subset of those individuals who are either knowledgeable in or want to become knowledgeable in, biology and chemistry, so we are drawing from a much smaller pool of talent. This is a place where academic institutions could be doing a much better job of creating a talent pool of what I call bilingual people—people who speak computing biology and chemistry. Those people are really hard to come by and having more of them would, I think, unlock a tremendous amount of value in the space of what I call digital biology—the ability to take a very data-driven lens to understanding biology and human disease.Have you seen substantial progress in digital biology?We have seen a lot of progress in the last few years. If we come back to some of the examples around the U.K. Biobank, many published articles have emerged from that. It is all enabled by computational methods that understand the connection between genetics and a whole array of very diverse and sometimes quite complex phenotypes. If we look at the work that has been happening around understanding cell biology and measuring things at the single-cell level, such as the human cell atlas and understanding cellular states, all of that requires extensive computational methods.There has been a huge amount of progress in this field, broadly construed, and some early successes on the drug discovery side. Although, as we know, in drug discovery, the proof is really when you put the drug in a person and it works. That takes years. Going from a new insight to an approved drug is going to take a while.But honestly, if you think about digital biology in the broader sense, even the work that happened during COVID-19 by companies such as Moderna and Pfizer-BioNtech on the one side, and on the antibody design by companies such as Vir and AbCellera on the other, that was really digital design of therapeutic matter. It was not that they just took something as it was in nature. There was a lot of fine tuning of the compound in ways that really thought about it as a digital object. In that respect, even in drug discovery, while not the full promise of machine learning-discovered drugs, there was a lot of data science that went into the design of the specific compounds there.What are your goals for insitro and how are you going to achieve them?We are really looking to use some of these high-content data sets to inform our understanding of human biological state and how that might manifest in disease. Right now, I believe that our taxonomy of human disease is incredibly obsolete—it is derived from clinical symptoms that are not reflective in many ways of the underlying biology. It is also very coarse grained and filtered through the subjective lens of a patient and often the clinician, so there is a lot of subjectivity in how you interpret what is actually happening to the patient's body.That basically means that we are often taking things that are quite distinct biology's and calling them by the same name. We have seen in oncology how much power we get by understanding that breast cancer is not one thing, it is multiple different things, and each of those is best treated by a completely different therapeutic. Chemotherapy, which is the lowest common denominator, is really not very effective compared with these modern-day treatments.We have not done that for [most] diseases. We have not understood the subtypes or the intervention nodes for each of those subtypes. We are taking many of these high-content data sets from both humans and cells and uncovering what are the underlying biological processes, and for each of those, what is the right intervention node?One of the places that we have done a lot of work in that way is in neuroscience. Tuberous sclerosis complex (TSC) is a monogenic disease—one of two genes [TSC1 or TSC2] has a mutation. Using some of our high-content phenotypes, we have identified new intervention nodes that are potentially modulators of the disease. That is something that we have put into drug discovery using our DNA-encoded library platform.We also have a very exciting partnership with Bristol Myers Squibb in ALS that uses an extended version of what we did with TSC, in which we identify the high-penetrance variants that are known familial drivers of ALS. Those provide a clear signal of subpopulations that are actually quite different across the variants that are drivers, and potentially treatments that might help those subsets of patients.Then, with the ability to interrogate the phenotypic landscape using our cell-based systems, we might be able to say, what are subsets of patients that, among the sporadic occurrences, are more similar to this familial variant versus that one. That way we can figure out how to expand the set of patients who are treated by each intervention.We have made a tremendous amount of progress toward creating the infrastructure to do that, in terms of phenotype but also creating arguably one of the largest banks of ALS-relevant human cell lines—isogenic lines—a wild type and then one that is exactly the same except with a familial mutation introduced. You can have a one-to-one comparison without a lot of variability thrown in. We have well >100 ALS lines like that and are now asking “what does one of those familial variants do?” and “how can we revert the motor neurons back to a healthier state?”You have explained it beautifully, but can you boil it down? How does this change drug discovery?Most of our drugs currently fail, typically in phase II or phase III, which is when people squint at the data and push something that should not have been advanced. They fail because of lack of efficacy because our understanding of disease biology is very limited and we use either our intuitive cartoon pathways that are drawn on the board, or sometimes animal models that frankly do not get the disease in question—animals do not get Alzheimer's disease or ALS. Or, we introduce a phenotypic copy into the animal and then pretend that we are curing a disease, when really what we are potentially curing is some kind of variant that is not translatable to humans.What we hope to do is to use humans as a model for humans, to really focus on human biology as the basis for target selection. Hopefully by doing so, we can reduce the probability of failure, which is currently 95%! That is the probability of failure. Honestly if you can reduce the probability of failure from 95% to 90%, you have doubled productivity, so there is a lot of headroom there in terms of how much one can improve.How high can we go? I do not know but God, it is worth trying!FiguresReferencesRelatedDetailsCited byAl-novation: Finding New Uses for Artificial Intelligence Across Industries Sean Ekins24 October 2022 | GEN Biotechnology, Vol. 1, No. 5 Volume 1Issue 3Jun 2022 InformationCopyright 2022, Mary Ann Liebert, Inc., publishersTo cite this article:Daphne Koller and Malorye A. Branca.A Data-Driven Lens to Understand Human Biology: An Interview with Daphne Koller.GEN Biotechnology.Jun 2022.230-233.http://doi.org/10.1089/genbio.2022.29028.dkoPublished in Volume: 1 Issue 3: June 14, 2022PDF download

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call