Abstract

Identifying the genes that cause disease is one of the most challenging issues to establish the diagnosis and treatment quickly. Several interesting methods have been introduced for disease gene identification for a decade. In general, the main differences between these methods are the type of data used as a prior-knowledge, as well as machine learning (ML) methods used for identification. The disease gene identification task has been commonly viewed by ML methods as a binary classification problem (whether any gene is disease or not). However, the nature of the data (since there is no negative data available for training or leaners) creates a major problem which affect the results. In this paper, sequence-based, one class classification method is introduced to assign genes to disease class (yes, no). First, to generate feature vector, the sequences of proteins (genes) are initially transformed to numerical vector using physicochemical properties of amino acid. Second, as there is no definite approach to define non-disease genes (negative data); we have attempted to model solely disease genes (positive data) to make a prediction by employing Support Vector Data Description algorithm. The experimental results confirm the efficiency of the method with precision, recall and F-measure of 79.3%, 82.6% and 80.9%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call