Abstract
Many classification problems in bioinformatics use datasets that are characterized by class imbalance. This unequal class distribution can adversely affect the classification performance on the minority class (having a very high rate of false negatives), which is usually the class of interest. While this challenge is prevalent among bioinformatics datasets, a majority of practitioners and researchers focused their efforts on coping with a different problem, namely high dimensionality (too many independent variables). As a result, class imbalance has been almost completely neglected. In this work, we investigate the importance of alleviating class imbalance (by applying data sampling) for classification problems on bioinformatics datasets. To investigate this importance, we compare the classification performance after applying data sampling and feature selection to the classification performance when using feature selection alone. We employ six widely used classification algorithms as well as three major forms of feature selection. Our results show that the classification models built with feature selection alone perform worse than those built when data sampling is incorporated with feature selection. Statistical analysis shows that the increase in performance when performing data sampling along with feature selection is significant. Therefore, it is essential to place special focus on the problem of class imbalance in bioinformatics and this experiment shows why it is important to apply techniques (e.g. data sampling) to alleviate class imbalance.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.