Most statistical and machine learning models used for binary data modeling and classification assume that the data are balanced. However, this assumption can lead to poor predictive performance and bias in parameter estimation when there is an imbalance in the data due to the threshold election for the binary classification. To address this challenge, several authors suggest using asymmetric link functions in binary regression, instead of the traditional symmetric functions such as logit or probit, aiming to highlight characteristics that would help the classification task. Therefore, this study aims to introduce new classification functions based on the Lomax distribution (and its variations; including power and reverse versions). The proposed Bayesian functions have proven asymmetry and were implemented in a Stan program into the R workflow. Additionally, these functions showed promising results in real-world data applications, outperforming classical link functions in terms of metrics. For instance, in the first example, comparing the reverse power double Lomax (RPDLomax) with the logit link showed that, regardless of the data imbalance, the RPDLomax model assigns effectively lower mean posterior predictive probabilities to failure and higher probabilities to success (21.4% and 63.7%, respectively), unlike Logistic regression, which does not clearly distinguish between the mean posterior predictive probabilities for these two classes (36.0% and 39.5% for failure and success, respectively). That is, the proposed asymmetric Lomax approach is a competitive model for differentiating binary data classification in imbalanced tasks against the Logistic approach.
Read full abstract