Traditional synthetic biology takes a trial-and-error approach, suffering from inefficiency and local optima. Recent advances in high-throughput experimental techniques generate a huge amount of biological data, which enables the use of machine learning to close the “design-build-test-learn” loop. Machine learning, especially deep learning, is a data-driven modeling method, which extracts useful patterns from big data and then leverages learned knowledge to tackle specific tasks. In this review, we aim to provide a brief primer of machine learning to synthetic biologists. Starting with common taxonomy, we introduce representative methods, pipelines, and underlying principles of machine learning that can be applied in synthetic biology. We include typical methods such as support vector machine, deep neural networks, generative adversarial nets, transfer learning and reinforcement learning. In particular, discriminative models, including convolutional neural networks and support vector machine, are appropriate for predicting sequence-function relationship. Generative models, including generative adversarial nets (GANs) and deep generative models for graph generation, are suitable for sequence or network design. Next, we review the recent applications of machine learning in studying synthetic biology parts and modules, including promoters, bioactive peptides, enzymes, metabolic pathways, and genetic circuits. For example, DeePromoter combined a convolutional neural network and a long-short term memory to achieve an accuracy as high as 90% when predicting promoter sequences. For enzyme design, a Gauss Process model was proposed with Bayesian optimization by upper confidence bound method, which resulted in the engineering of thermostable P450 enzymes. For antimicrobial peptides, a generative GAN model enhanced with a feedback mechanism was trained to design peptide sequences with new functions. Finally, we conclude with future challenges and directions. Particularly, interpretable machine learning models are desirable to guide mechanistic investigation. Moreover, it is necessary to develop new machine learning methods that are more compatible with biological data, which are heterogeneous, multi-modal (such as sequence, network, image, and structure), and lack of proper labels. With the increasing availability of big biological data and development of machine learning methods tailored for synthetic biology, we envision a paradigm shift towards a closed cycle of “design-build-test-learn” in creating artificial life with predictable functions.
Read full abstract