Abstract

Disease classification based on machine learning has become a crucial research topic in the fields of genetics and molecular biology. Generally, disease classification involves a supervised learning style; i.e., it requires a large number of labelled samples to achieve good classification performance. However, in the majority of the cases, labelled samples are hard to obtain, so the amount of training data are limited. However, many unclassified (unlabelled) sequences have been deposited in public databases, which may help the training procedure. This method is called semi-supervised learning and is very useful in many applications. Self-training can be implemented using high- to low-confidence samples to prevent noisy samples from affecting the robustness of semi-supervised learning in the training process. The deep forest method with the hyperparameter settings used in this paper can achieve excellent performance. Therefore, in this work, we propose a novel combined deep learning model and semi-supervised learning with self-training approach to improve the performance in disease classification, which utilizes unlabelled samples to update a mechanism designed to increase the number of high-confidence pseudo-labelled samples. The experimental results show that our proposed model can achieve good performance in disease classification and disease-causing gene identification.

Highlights

  • Bioinformatics technologies have provided efficient ways to diagnose diseases, and machine learning methods applied in bioinformatics have achieved remarkable breakthroughs in the field of disease diagnosis [1]

  • Three common methods are used for comparison to assess the performance of our approach: deep neural networks (DNNs), logistic regression (LR), support vector machine (SVM) and random forest (RF)

  • We proposed deep forest and semi-supervised with self-training

Read more

Summary

Introduction

Bioinformatics technologies have provided efficient ways to diagnose diseases, and machine learning methods applied in bioinformatics have achieved remarkable breakthroughs in the field of disease diagnosis [1]. Disease classification based on gene expression levels can efficiently distinguish disease-causing genes efficiently, so it has become an effective method in disease diagnosis and gene expression levels assessment for different conditions [2,3,4]. The combination of data preprocessing and machine learning is an essential approach that improves the performances of many computer-aided diagnosis applications [5,6], including for log-count normalized original data in linear modelling [7]. A multiple feature evaluation approach (MFEA) of a multi-agent system has been proposed to improve the diagnoses of Parkinson’s disease [8]. Supervised discriminative sparse principal component analysis (SDSPCA) has been used to study the pathogenesis of diseases and gene selection [10]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.