Abstract

Abstract Protein engineering increasingly relies on machine learning models to computationally pre-screen variants to identify those that meet the target requirements. Although machine learning approaches have proven effective, their performance on prospective screening data has room for improvement. Prediction accuracy can vary greatly from one variant to the next. So far, it is unclear what characterizes variants that are associated with large model error. We designed and generated a dataset that can be stratified based on four structural characteristics (buriedness, number of contact residues, proximity to the active site and presence of secondary structure), to answer this question. We found that variants with multiple mutations that are buried, closely connected with other residues or close to the active site, which we call challenging mutations, are harder to model than their counterparts (i.e. exposed, loosely connected, far from the active site). This effect emerges only for variants with multiple challenging mutations, since single mutations at these sites were not harder to model. Our findings indicate that variants with challenging mutations are appropriate benchmarking targets for assessing model quality and that stratified dataset design can be leveraged to highlight areas of improvement for machine learning guided protein engineering.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.