Abstract

Intrinsically disorder proteins (IDPs) constitute a significant part of proteins that exist and act in cells of living organisms. IDPs play key roles in central cellular processes and some of them are closely related to various human diseases, like cancer or neurodegenerative disorders. Identification of IDPs and studying their structural characteristics have become an important part of structural bioinformatics and structural genomics. However, growing amount of genomic and protein sequences in public repositories pose a pressure on existing methods for identification of IDPs. Large volumes of protein amino acid sequences need to be analyzed in terms of propensity to form disordered regions, and this task requires novel tools and scalable platforms to cope with this big biological data challenge. In this paper, we show how the identification of disordered regions of 3D protein structures can be efficiently accelerated with the use of Apache Spark cluster established and scaled on the public Cloud. For this purpose, we propose Spark-based meta-predictor (Spark-IDPP), which enables efficient prediction of disordered regions of proteins on a large-scale. Results of our performance tests show that, for large data sets, our method achieves almost linear speedup, when scaling out the computations on the 32-node Spark cluster located in the Azure cloud. This proves that through appropriate partitioning of data and by increasing the degree of parallelism, we can significantly improve efficiency of IDP predictions. Additionally, by using several basic predictors, aggregating their ranks in various consensus modes, and filtering the final outcome with a dedicated fuzzy filter, the Spark-IDPP increases the quality of predictions.

Highlights

  • International efforts focused on understanding living organisms at various levels of molecular organization, including genomic, proteomic, methabolomic, and cell signaling levels, lead to huge proliferation of biological data collected in dedicated, and frequently, public repositories

  • The Spark cluster used for most of the performed experiments was established on the Microsoft Azure public cloud as the HDInsight service (HDI 3.4) hosted on D13v2-sized virtual machines (VMs) with Linux operating system

  • Prediction of disorder regions for protein amino acid sequences became an important branch of 3D protein structure prediction and modeling

Read more

Summary

Introduction

International efforts focused on understanding living organisms at various levels of molecular organization, including genomic, proteomic, methabolomic, and cell signaling levels, lead to huge proliferation of biological data collected in dedicated, and frequently, public repositories. Since deep insight into 3D protein structures is a key for understanding molecular mechanisms of many civilization diseases and for the production of effective drugs, structural genomics tries to determine and describe the 3D structure of every protein that is encoded by a given sequenced genome This is done by combining traditional experimental methods, like X-ray crystallography or Nuclear Magnetic Resonance (NMR), with computational modeling approaches that use various prediction methods for structure determination [18, 54, 76, 80]. Determination of 3D structures of intrinsically disordered proteins with traditional methods, like the X-ray crystallography or Nuclear Magnetic Resonance (NMR), is difficult, since, e.g., the lack of electron density in crystal structures (which is marked in PDB files describing protein macromolecules as REMARK465 record) For this reason, IDP predictors have become playing an important role in determination of unstructured regions. These tools should be able to scale the computational procedure in order to accommodate the growing volume of DNA, and protein sequences

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call