Abstract

Directed evolution, a strategy for protein engineering, optimizes protein properties (i.e., fitness) by expensive and time-consuming screening or selection of large mutational sequence space. Machine learning-assisted directed evolution (MLDE), which screens sequence properties in silico, can accelerate the optimization and reduce the experimental burden. This work introduces a MLDE framework, cluster learning-assisted directed evolution (CLADE), that combines hierarchical unsupervised clustering sampling and supervised learning to guide protein engineering. The clustering sampling selectively picks and screens variants in targeted subspaces, which guides the subsequent generation of diverse training sets. In the last stage, accurate predictions via supervised learning models improve final outcomes. By sequentially screening 480 sequences out of 160,000 in a four-site combinatorial library with five equal experimental batches, CLADE achieves the global maximal fitness hit rate up to 91.0% and 34.0% for GB1 and PhoQ datasets, respectively, improved from 18.6% and 7.2% obtained by random-sampling-based MLDE.

Highlights

  • Directed evolution, a strategy for protein engineering, optimizes protein properties by expensive and timeconsuming screening or selection of a large mutational sequence space

  • The cluster learning-assisted directed evolution (CLADE) framework is a two-stage procedure consisting of three components: experimental screening, unsupervised clustering and supervised learning

  • Similar searching approaches that use a hierarchical tree, such as hierarchical optimistic optimization (HOO)[47], deterministic optimistic optimization (DOO) and simultaneous optimistic optimization (SOO)[48], were previously proposed to optimize a smooth black-box function defined on continuum space

Read more

Summary

Introduction

A strategy for protein engineering, optimizes protein properties (that is, fitness) by expensive and timeconsuming screening or selection of a large mutational sequence space. Active learning is a popular approach in MLDE, where sequential selections of sequences are decided by the combination of a surrogate model and an acquisition function The former is used to learn the sequence-to-fitness map from labeled data and the latter utilizes the predictions from the surrogate model to prioritize a set of sequences to be screened at the round of experiments[37]. Rather than making use of sequential iterations in experiments, focused training of the MLDE method was proposed to minimize the experimental burden to only two iterations[2] This utilizes unsupervised zero-shot predictors[19,22,40,41] to predict fitness without experiments, and is used to restrict the training set selection within a small informative subset.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.