Abstract

Noise is a prominent challenge found in many bioinformatics datasets and it refers to erroneous or missing data. The presence of noise in gene expression datasets has adverse effects on machine-learning techniques, such as supervised classification algorithms and feature selection techniques. Additionally, the identification of noise and its quantification are challenging tasks that require a proper mechanism to manage them in order to improve the performance of classifiers and feature selection methods. In this study, our motivation is to investigate the effects of class noise on the classification performance of various learners using multiple derived datasets with varying degrees of data quality and class imbalance. Class imbalance is another challenging characteristic that occurs when one class has many more instances than the other class(es). To this end, we conducted experiments using a filter-based subset selection method applied to multiple derived datasets generated by injecting artificial class noise in a controlled manner creating three levels of data quality: High-Quality, Average-Quality, and Low-Quality. Our results along with statistical analysis show that Random Forest outperforms other learners without any exceptions for all levels of balance and data quality. Therefore, we recommend using Random Forest as the noise-tolerant and robust classifier when dealing with varying degrees of quality for bioinformatics datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.