Abstract
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for theUPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.