Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Highlights
In the biological world, bacteriophages are ubiquitous, with different genomes and lifestyles
Our results show that, among the eight physicochemical properties of amino acids, the charge property has the greatest influence on the classification of bacteriophage proteins
Based on the feature extraction methods described in section Feature extraction, We extracted a 188-dimensional, 400dimensional feature set based on sequence information, and a 473-dimensional data set based on sequence and secondary structure information representing the entire bacteriophage protein sequence dataset
Summary
Bacteriophages are ubiquitous, with different genomes and lifestyles. Faced with a large volume of data, traditional biological experimental methods could no longer keep up with the post-gene era (Chen W. et al, 2016; Cheng et al, 2019; Mrozek et al, 2016; Hu et al, 2018) For this reason, researchers introduced different machine learning algorithms into bacteriophage classification and prediction research. The random forest algorithm (Breiman, 2001; Yao et al, 2017) combines multiple weak classifiers to produce a final result that has higher accuracy and better generalization performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.