Abstract

BackgroundEffector proteins of bacteria infect their hosts by specific dedicated machinery identified as secretion systems. Currently, no mechanism to identify the effector proteins based on their 3D structure has been reported in the literature. In order to identify effector proteins, extraction of features from their 3D structure is crucial. However, effector protein datasets are highly imbalanced. State-of-the-art oversampling algorithms are incapable of dealing with such datasets. They usually eliminate samples as noise. They do not ensure generation of synthetic samples strictly in the vicinity of the minority class samples. In effector protein datasets, deletion of any samples as noise would lead to loss of crucial information. Furthermore, generation of synthetic samples of the minority class in the vicinity of majority class samples would lead to an inept classifier. MethodIn this paper, we introduce an algorithm called Cluster Quality based Non-Reductional (CQNR) oversampling technique. Its novelty lies in generating new samples proportional to the distribution of samples of the minority classes, without eliminating any sample as noise. Utilizing CQNR, we develop a novel Effector Protein Predictor based on the 3D (EPP3D) structure of proteins. EPP3D is trained on a feature set, balanced by CQNR, comprising 3D structure-based features, namely, convex hull layer count, surface atom composition, radius of gyration, packing density and compactness, derived from the 3D structure of the experimentally verified effector proteins. ResultFscore and Gmean demonstrate that CQNR has outperformed some well-established oversampling methods by approximately 3–5%, with respect to classification accuracy, on five benchmark datasets and three other highly imbalanced synthetically generated datasets. Likewise, for classification of pathogenic effector proteins, a significant improvement of 7–9% in accuracy has been noticed, on the application of CQNR followed by EPP3D. Moreover, EPP3D has exhibited an improvement of 2–4% in classifying effector proteins based on their 3D structure compared to the classification of effector proteins based on their amino acid sequences. The software for CQNR and EPP3D are available at http://projectphd.droppages.com/CQNR.html.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call