Prediction of Cancer Proteins by Integrating Protein Interaction, Domain Frequency, and Domain Interaction Data Using Machine Learning Algorithms

Chien-Hung Huang,Huai-Shun Peng,Ka-Lok Ng

doi:10.1155/2015/312047

Chien-Hung Huang, Huai-Shun Peng + Show 1 more

Open Access

https://doi.org/10.1155/2015/312047

Copy DOI

Abstract

Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.

Highlights

It has been known for a long time that cancer is a result of loss of cell cycle control
Given the cancerous proteinprotein interaction (PPI), a set of domain-domain interaction (DDI) rules for cancer proteins are derived. In addition to this set of DDI, we considered other features: the weighted domain frequency scores (DFS): DFS C for cancer proteins and DFS X for noncancer proteins and the cancer linker degree (CLD) score
The weighted DFS is adopted to measure the propensity of domain occurrence in cancer and noncancer proteins

Summary

Introduction

It has been known for a long time that cancer is a result of loss of cell cycle control. The loss of control is a result of series of genetic mutations involving activation of proto-oncogenes to oncogenes and inactivation of tumor-suppressing genes. Oncogenes and tumor suppressors may cause cancer by alternating the transcription factors, such as the p53 and ras oncoproteins, which in turn control expression of other genes. Predicting novel cancer-related proteins is an important topic in biomedical research; experimental techniques such as microarrays are being used to characterize cancer. Nagaraj and Reverter [1] proposed a Boolean logic based approach to predict colorectal cancer genes. Li et al [2] took GO enrichment scores and KEGG enrichment scores as features to predict retinoblastoma related genes. We extended Aragues’s study by employing PPI data and domain information to attain improved performance

Results

Discussion

Conclusion