Abstract

Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call