Rstoolbox - a Python library for large-scale analysis of computational protein design data and structural bioinformatics

Jaume Bonet,Fabian Sesterhenn,Andreas Scheck,Zander Harteveld,Bruno E Correia

doi:10.1186/s12859-019-2796-3

Abstract

BackgroundLarge-scale datasets of protein structures and sequences are becoming ubiquitous in many domains of biological research. Experimental approaches and computational modelling methods are generating biological data at an unprecedented rate. The detailed analysis of structure-sequence relationships is critical to unveil governing principles of protein folding, stability and function. Computational protein design (CPD) has emerged as an important structure-based approach to engineer proteins for novel functions. Generally, CPD workflows rely on the generation of large numbers of structural models to search for the optimal structure-sequence configurations. As such, an important step of the CPD process is the selection of a small subset of sequences to be experimentally characterized. Given the limitations of current CPD scoring functions, multi-step design protocols and elaborated analysis of the decoy populations have become essential for the selection of sequences for experimental characterization and the success of CPD strategies.ResultsHere, we present the rstoolbox, a Python library for the analysis of large-scale structural data tailored for CPD applications. rstoolbox is oriented towards both CPD software users and developers, being easily integrated in analysis workflows. For users, it offers the ability to profile and select decoy sets, which may guide multi-step design protocols or for follow-up experimental characterization. rstoolbox provides intuitive solutions for the visualization of large sequence/structure datasets (e.g. logo plots and heatmaps) and facilitates the analysis of experimental data obtained through traditional biochemical techniques (e.g. circular dichroism and surface plasmon resonance) and high-throughput sequencing. For CPD software developers, it provides a framework to easily benchmark and compare different CPD approaches. Here, we showcase the rstoolbox in both types of applications.Conclusionsrstoolbox is a library for the evaluation of protein structures datasets tailored for CPD data. It provides interactive access through seamless integration with IPython, while still being suitable for high-performance computing. In addition to its functionalities for data analysis and graphical representation, the inclusion of rstoolbox in protein design pipelines will allow to easily standardize the selection of design candidates, as well as, to improve the overall reproducibility and robustness of CPD selection processes.

Highlights

Large-scale datasets of protein structures and sequences are becoming ubiquitous in many domains of biological research
The fast-increasing amounts of biomolecular structural data are enabling an unprecedented level of analysis to unveil the principles that govern structure-function relationships in biological macromolecules. This wealth of structural data has catalysed the development of computational protein design (CPD) methods, which has become a popular tool for the structure-based design of proteins with novel functions and optimized properties [1]
The OSPREY design suite, which combines Dead-End Elimination theorems combined with A* search (DEE/A*) [4], is one of the most used software relying on this approach

Summary

Introduction

Large-scale datasets of protein structures and sequences are becoming ubiquitous in many domains of biological research. Computational protein design (CPD) has emerged as an important structure-based approach to engineer proteins for novel functions. The fast-increasing amounts of biomolecular structural data are enabling an unprecedented level of analysis to unveil the principles that govern structure-function relationships in biological macromolecules. This wealth of structural data has catalysed the development of computational protein design (CPD) methods, which has become a popular tool for the structure-based design of proteins with novel functions and optimized properties [1]. Deterministic algorithms provide a sorted, continuous list of results This means that, according to their energy function, one will find the best possible solution for a design problem. Despite notable successes [7,8,9], the time requirements for deterministic design algorithms when working with large proteins or de novo design approaches limits their applicability, prompting the need for alternative approaches for CPD

Results

Discussion

Conclusion