Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods.

Jamal Ahmad,Maqsood Hayat,Faisal Javed

doi:10.1016/j.artmed.2017.05.001

Abstract

Golgi is one of the core proteins of a cell, constitutes in both plants and animals, which is involved in protein synthesis. Golgi is responsible for receiving and processing the macromolecules and trafficking of newly processed protein to its intended destination. Dysfunction in Golgi protein is expected to cause many neurodegenerative and inherited diseases that may be cured well if they are detected effectively and timely. Golgi protein is categorized into two parts cis-Golgi and trans-Golgi. The identification of Golgi protein via direct method is very hard due to limited available recognized structures. Therefore, the researchers divert their attention toward the sequences from structures. However, owing to technological advancement, exploration of huge amount of sequences was reported in the databases. So recognition of large amount of unprocessed data using conventional methods is very difficult. Therefore, the concept of intelligence was incorporated with computational model. Intelligence based computational model obtained reasonable results, but the gap of improvement is still under consideration. In this regard, an intelligent automatic recognition model is developed in order to enhance the true classification rate of sub-Golgi proteins. In this approach, discrete and evolutionary feature extraction methods are applied on the benchmark Golgi protein datasets to excerpt salient, propound and variant numerical descriptors. After that, an oversampling technique Syntactic Minority over Sampling Technique is employed to balance the data. Hybrid spaces are also generated with combination of these feature spaces. Further, Fisher feature selection method is utilized to reduce the extra noisy and redundant features from feature vector. Finally, k-nearest neighbor algorithm is used as learning hypothesis. Three distinct cross validation tests are used to examine the stability and efficiency of the proposed model. The predicted outcomes of proposed model are better than the existing models in the literature so far. Finally, it is anticipated that the proposed model will provide the foundation to pharmaceutical industry in drug design and research community to innovate new ideas in the area of computational biology and bioinformatics.

Full Text