A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification

Abdulaziz Yousef,Nasrollah Moghadam Charkari

doi:10.1016/j.jbi.2015.06.018

Abdulaziz Yousef, Nasrollah Moghadam Charkari

Open Access

https://doi.org/10.1016/j.jbi.2015.06.018

Copy DOI

Journal: Journal of Biomedical Informatics	Publication Date: Jul 2, 2015
Citations: 58	License type: publisher-specific-oa

Affiliation: Tarbiat Modares University

Abstract

Identifying the genes that cause disease is one of the most challenging issues to establish the diagnosis and treatment quickly. Several interesting methods have been introduced for disease gene identification for a decade. In general, the main differences between these methods are the type of data used as a prior-knowledge, as well as machine learning (ML) methods used for identification. The disease gene identification task has been commonly viewed by ML methods as a binary classification problem (whether any gene is disease or not). However, the nature of the data (since there is no negative data available for training or leaners) creates a major problem which affect the results. In this paper, sequence-based, one class classification method is introduced to assign genes to disease class (yes, no). First, to generate feature vector, the sequences of proteins (genes) are initially transformed to numerical vector using physicochemical properties of amino acid. Second, as there is no definite approach to define non-disease genes (negative data); we have attempted to model solely disease genes (positive data) to make a prediction by employing Support Vector Data Description algorithm. The experimental results confirm the efficiency of the method with precision, recall and F-measure of 79.3%, 82.6% and 80.9%, respectively.

Full Text