Observing the Effect of the Choice of Classifier on Bioinformatics Data with Varying Levels of Data Quality and Class Balance

Alireza Fazelpour,David J Dittman,Ahmad Abu Shanab,Taghi M Khoshgoftaar

doi:10.1109/iri.2015.63

Abstract

Noise is a prominent challenge found in many bioinformatics datasets and it refers to erroneous or missing data. The presence of noise in gene expression datasets has adverse effects on machine-learning techniques, such as supervised classification algorithms and feature selection techniques. Additionally, the identification of noise and its quantification are challenging tasks that require a proper mechanism to manage them in order to improve the performance of classifiers and feature selection methods. In this study, our motivation is to investigate the effects of class noise on the classification performance of various learners using multiple derived datasets with varying degrees of data quality and class imbalance. Class imbalance is another challenging characteristic that occurs when one class has many more instances than the other class(es). To this end, we conducted experiments using a filter-based subset selection method applied to multiple derived datasets generated by injecting artificial class noise in a controlled manner creating three levels of data quality: High-Quality, Average-Quality, and Low-Quality. Our results along with statistical analysis show that Random Forest outperforms other learners without any exceptions for all levels of balance and data quality. Therefore, we recommend using Random Forest as the noise-tolerant and robust classifier when dealing with varying degrees of quality for bioinformatics datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Observing the Effect of the Choice of Classifier on Bioinformatics Data with Varying Levels of Data Quality and Class Balance

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Reinforcement learning-based cell selection in sparse mobile crowdsensing
Wenbin Liu ... Daqing Zhang
Computer Networks | VOL. 161
Wenbin Liu, et. al.Wenbin Liu ... Daqing Zhang
12 Jun 2019
Computer Networks | VOL. 161

Evaluation of quality control techniques utilized by soil testing laboratories
K Topper
Communications in Soil Science and Plant Analysis | VOL. 21
K TopperK Topper
01 Aug 1990
Communications in Soil Science and Plant Analysis | VOL. 21

Towards Process Patterns for Processing Data Having Various Qualities
Agung Wahyudi ... Marijn Janssen
-
Agung Wahyudi, et. al.Agung Wahyudi ... Marijn Janssen
01 Jan 2015
01 Jan 2015

Is Data Sampling Required When Using Random Forest for Classification on Imbalanced Bioinformatics Data?
David J Dittman ... Amri Napolitano
-
David J Dittman, et. al.David J Dittman ... Amri Napolitano
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Observing the Effect of the Choice of Classifier on Bioinformatics Data with Varying Levels of Data Quality and Class Balance

Abstract

Talk to us

Similar Papers