Abstract
Artificial sampling is one of the main approaches to dealing with imbalanced data. However, despite a vast amount of research on sampling techniques, there is little known about the choice of the optimal sampling ratio which can significantly improve the classification accuracy. In this paper, we attempt to fill the gap in the literature by conducting both mathematical and numerical analysis. Concretely, we conduct a large-scale empirical study on the relationship between the sampling ratio and classification accuracy. In addition, we investigate the theoretical sampling ratio using the Bayesian approach and obtain the optimal ratio of √1 e ≈ 0.6065 which is in line with the results of the numerical experiments. We find that while factors such as the original imbalance ratio or the number of features do not play a discernible role in determining the optimal ratio, the number of samples in the dataset may have a tangible effect. We hope that the insights revealed in this study will help researchers and practitioners select the optimal sampling ratio when dealing with imbalanced data.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have