Abstract

By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules could produce distinctive results for protein sequence analysis. Thus, it is urgent to construct a systematical frame for reduced alphabets. In this work, we constructed a comprehensive web server called RAACBook for protein sequence analysis and machine learning application by integrating reduction alphabets. The web server contains three parts: (i) 74 types of reduced amino acid alphabet were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with unique protein problems. It is easy for users to select desired RAACs from a multilayer browser tool. (ii) An online tool was developed to analyze primary sequence of protein. The tool could produce K-tuple reduced amino acid composition by defining three correlation parameters (K-tuple, g-gap, λ-correlation). The results are visualized as sequence alignment, mergence of RAA composition, feature distribution and logo of reduced sequence. (iii) The machine learning server is provided to train the model of protein classification based on K-tuple RAAC. The optimal model could be selected according to the evaluation indexes (ROC, AUC, MCC, etc.). In conclusion, RAACBook presents a powerful and user-friendly service in protein sequence analysis and computational proteomics. RAACBook can be freely available at http://bioinfor.imu.edu.cn/raacbook.Database URL: http://bioinfor.imu.edu.cn/raacbook

Highlights

  • With the development of various biotechnologies, the number of protein sequences is growing at a rapid pace

  • RAACBook is an online repository of reduced amino acid alphabets

  • The reduced amino acid clusters (RAACs) database provides a comprehensive resource of reduced amino acid alphabets

Read more

Summary

Introduction

With the development of various biotechnologies, the number of protein sequences is growing at a rapid pace. The three-dimensional structures and function of most proteins are still not determined. The gaps between structure data, function data and protein sequences are increasing fast. X-ray crystallography is a powerful tool in determining these structures, it is timeconsuming and expensive, and not all proteins can be successfully crystallized. Few membrane protein structures have been determined. NMR is a very powerful tool in determining the 3D structures of membrane proteins [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21], but it is time-consuming and costly. It is urgent to design efficient computational methods based on sequence information for rapidly and accurately identifying biological features in primary protein sequences

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call