A Monte Carlo fuzzy logistic regression framework against imbalance and separation
This article proposes a new fuzzy logistic regression framework with high classification performance against imbalance and separation while keeping the interpretability of classical logistic regression. Separation and imbalance are two core problems in logistic regression, which can result in biased coefficient estimates and inaccurate predictions. Existing research on fuzzy logistic regression primarily focuses on developing possibilistic models instead of using a logit link function that converts log-odds ratios to probabilities. At the same time, little consideration is given to issues of separation and imbalance. Our study aims to address these challenges by proposing new methods of fuzzifying binary variables and classifying subjects based on a comparison against a fuzzy threshold. We use combinations of fuzzy and crisp predictors, output, and coefficients to understand which combinations perform better under imbalance and separation. Numerical experiments with synthetic and real datasets are conducted to demonstrate the usefulness and superiority of the proposed framework. Seven crisp machine learning models are implemented for benchmarking in the numerical experiments. The proposed framework shows consistently strong performance results across datasets with imbalance or separation and performs equally well when such issues are absent. Meanwhile, the considered machine learning methods are significantly impacted by the imbalanced datasets.
65
- 10.1201/9781315119588
- Sep 19, 2017
59
- 10.1166/asl.2016.7980
- Oct 1, 2016
- Advanced Science Letters
127
- 10.1016/j.compbiomed.2017.01.001
- Jan 6, 2017
- Computers in Biology and Medicine
582
- 10.1002/wics.82
- May 1, 2010
- WIREs Computational Statistics
9
- 10.1016/j.rmal.2023.100044
- Mar 2, 2023
- Research Methods in Applied Linguistics
56
- 10.1109/tkde.2019.2898861
- Jun 1, 2020
- IEEE Transactions on Knowledge and Data Engineering
15
- 10.1016/j.nut.2013.08.008
- Jan 30, 2014
- Nutrition
3148
- 10.1371/journal.pone.0118432
- Mar 4, 2015
- PLOS ONE
247
- 10.1093/pan/mpi009
- Jan 1, 2005
- Political Analysis
29
- 10.1007/s00500-014-1418-2
- Aug 13, 2014
- Soft Computing
- Research Article
- 10.1016/j.ecoinf.2025.103091
- Jul 1, 2025
- Ecological Informatics
Metric learning unveiling disparities: A novel approach to recognize false trigger images in wildlife monitoring
- Research Article
4
- 10.1186/s12874-024-02270-x
- Jul 5, 2024
- BMC Medical Research Methodology
BackgroundIn binary classification for clinical studies, an imbalanced distribution of cases to classes and an extreme association level between the binary dependent variable and a subset of independent variables can create significant classification problems. These crucial issues, namely class imbalance and complete separation, lead to classification inaccuracy and biased results in clinical studies.MethodTo deal with class imbalance and complete separation problems, we propose using a fuzzy logistic regression framework for binary classification. Fuzzy logistic regression incorporates combinations of triangular fuzzy numbers for the coefficients, inputs, and outputs and produces crisp classification results. The fuzzy logistic regression framework shows strong classification performance due to fuzzy logic’s better handling of imbalance and separation issues. Hence, classification accuracy is improved, mitigating the risk of misclassified conditions and biased insights for clinical study patients.ResultsThe performance of the fuzzy logistic regression model is assessed on twelve binary classification problems with clinical datasets. The model has consistently high sensitivity, specificity, F1, precision, and Mathew’s correlation coefficient scores across all clinical datasets. There is no evidence of impact from the imbalance or separation that exists in the datasets. Furthermore, we compare the fuzzy logistic regression classification performance against two versions of classical logistic regression and six different benchmark sources in the literature. These six sources provide a total of ten different proposed methodologies, and the comparison occurs by calculating the same set of classification performance scores for each method. Either imbalance or separation impacts seven out of ten methodologies. The remaining three produce better classification performance in their respective clinical studies. However, these are all outperformed by the fuzzy logistic regression framework.ConclusionFuzzy logistic regression showcases strong performance against imbalance and separation, providing accurate predictions and, hence, informative insights for classifying patients in clinical studies.
- Conference Article
- 10.1063/5.0222468
- Jan 1, 2024
Prediction of cardiovascular disease based on logistic regression model
- Research Article
- 10.1038/s41598-025-01064-5
- May 21, 2025
- Scientific Reports
Power Line Communication (PLC) facilitates the usage of power cables to transmit data. The issue is that sending data to unavailable nodes is time-consuming. Machine Learning has solved this by predicting a node having optimum readings. The more the machine learning models learn, the more accurate they become, as the model becomes always updated with the node’s continuous availability status, so self-training algorithms have been used. A dataset of 2000 instances of a node of a 500-node implemented PLC network has been collected. These instances consist of CINR(Carrier-to-Interference plus Noise Ratio), SNR(Signal-to-Noise Ratio), and RSSI(Received Signal Strength Indicator) as features for the label, which is a node is UP/Down. The data set has been split into 85% as a training set and 15% as a testing set. 15% of the training data are unlabeled. Self-training classifier has been used to allow Light Gradient Boosting Machine (LGBM) and Support Vector Machine (linear and non-linear kernel) to behave in a self-training manner as well as the training of label propagation and label spreading algorithms. Supervised Learning algorithms (Random Forest and logistic regression) have been trained on the dataset to compare the results. The best model is the Label Spreading, which resulted in accuracy equals 94.67%, f1-score equals 0.947, precision is 0.946, and recall equals 0.947 with training time equals 0.018 sec. and memory consumption equals 0.99 MB.
- Research Article
7
- 10.1016/j.iswa.2024.200378
- Apr 26, 2024
- Intelligent Systems with Applications
A learning system-based soft multiple linear regression model
- New
- Research Article
- 10.1016/j.engappai.2025.111319
- Nov 1, 2025
- Engineering Applications of Artificial Intelligence
A novel framework for assessing determinant risk factors on cyber (dis)trust behaviors of netizens in deepfakes
- Research Article
1
- 10.2139/ssrn.4997557
- Jan 1, 2024
- SSRN Electronic Journal
Elementary Operations with Gaussian Fuzzy Numbers
- Research Article
- 10.1007/s44196-024-00723-1
- Jan 20, 2025
- International Journal of Computational Intelligence Systems
Over the last two decades, the panel data model has become a focus of applied research. While there are numerous proposals for soft regression models in the literature, only a few linear regression models have been proposed based on fuzzy panel data. However, these models have serious limitations. This study is an attempt to propose a kind of two-way fuzzy panel regression model with crossed effects, fuzzy responses and crisp predictors to overcome the shortcomings of these models in real applications. The corresponding parameter estimation is provided based on a three-step procedure. For this purpose, the conventional least absolute error technique is employed. Two real data sets are analyzed to investigate the fitting and predictive capabilities of the proposed fuzzy panel regression model. These real data applications demonstrate that our proposed model has good fitting accuracy and predictive performance.
- Research Article
5
- 10.1186/s13040-024-00384-y
- Sep 4, 2024
- BioData Mining
ObjectiveData imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.MethodsWe collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.ResultsThe logistic model’s performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.ConclusionsThe study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.
- Research Article
4
- 10.1186/s12874-024-02270-x
- Jul 5, 2024
- BMC Medical Research Methodology
BackgroundIn binary classification for clinical studies, an imbalanced distribution of cases to classes and an extreme association level between the binary dependent variable and a subset of independent variables can create significant classification problems. These crucial issues, namely class imbalance and complete separation, lead to classification inaccuracy and biased results in clinical studies.MethodTo deal with class imbalance and complete separation problems, we propose using a fuzzy logistic regression framework for binary classification. Fuzzy logistic regression incorporates combinations of triangular fuzzy numbers for the coefficients, inputs, and outputs and produces crisp classification results. The fuzzy logistic regression framework shows strong classification performance due to fuzzy logic’s better handling of imbalance and separation issues. Hence, classification accuracy is improved, mitigating the risk of misclassified conditions and biased insights for clinical study patients.ResultsThe performance of the fuzzy logistic regression model is assessed on twelve binary classification problems with clinical datasets. The model has consistently high sensitivity, specificity, F1, precision, and Mathew’s correlation coefficient scores across all clinical datasets. There is no evidence of impact from the imbalance or separation that exists in the datasets. Furthermore, we compare the fuzzy logistic regression classification performance against two versions of classical logistic regression and six different benchmark sources in the literature. These six sources provide a total of ten different proposed methodologies, and the comparison occurs by calculating the same set of classification performance scores for each method. Either imbalance or separation impacts seven out of ten methodologies. The remaining three produce better classification performance in their respective clinical studies. However, these are all outperformed by the fuzzy logistic regression framework.ConclusionFuzzy logistic regression showcases strong performance against imbalance and separation, providing accurate predictions and, hence, informative insights for classifying patients in clinical studies.
- Research Article
15
- 10.1016/j.nut.2013.08.008
- Jan 30, 2014
- Nutrition
Effect of folic acid on appetite in children: Ordinal logistic and fuzzy logistic regressions
- Research Article
2
- 10.3233/idt-150247
- Mar 1, 2016
- Intelligent Decision Technologies
In some practical situations, it is not possible to categorize samples into one of two response categories because of the vague nature of the response variable. Statistical logistic regression models are, therefore, not appropriate for modeling such response variables. Moreover, the small sample size in most cases limits the use of statistical logistic regression models. Fuzzy logistic regression models, instead, can overcome these problems. In order to investigate the use of fuzzy logistic regression, the present study is designed and implemented to evaluate the relationship between dietary pattern and a set of risk factors of interest. Since it is not possible to define a healthy dietary pattern precisely, therefore, the possibility of having the healthy diet is reported for each subject as a number between zero and one. The conventional logistic model is not appropriate and fails in dealing with such imprecise data; hence, a possibilistic approach is used to model the available data and to estimate the fuzzy parameters of the model. For evaluating the model, a goodness-of-fit index and an appropriate predictive capability criterion with cross validation technique is developed. The logistic model investigated here is found to be general and inclusive enough to be recommended for modeling vague observations or ambiguous relations in any field of medical sciences.
- Research Article
2
- 10.1007/s40815-019-00615-z
- Apr 2, 2019
- International Journal of Fuzzy Systems
The logistic regression analysis is a popular method for describing the relation between variables. However, when there are a big number of variables in the regression model, the selection of the best model becomes a major problem. In this condition, the question is which subset of predictors can best predict the response pattern, and which process can be used to achieve such a subset. This article is written to answer this questioning fuzzy logistic regression models. To this end, based on the existing criteria of regression models, three goodness-of-fit criteria, namely MSEF, AICF, and $$C_{p}^{\text{F}}$$ , are proposed. These criteria are helpful to select the best-fitted model among all possible fuzzy logistic regression models with fuzzy covariates and responses. In addition, based on the concepts of efficiency level and MSEF, a forward model selection method for fuzzy logistic regression is proposed. The proposed method is justified by some simulation studies, indicating the good performance and efficiency of the method. In addition, we applied the presented methods in a clinical trial study.
- Research Article
- 10.2139/ssrn.1595710
- Apr 27, 2010
- SSRN Electronic Journal
Bootstrapping Fuzzy-GARCH Regressions on the Day of the Week Effect in Stock Returns: Applications in MATLAB
- Research Article
16
- 10.1016/j.asej.2021.06.004
- Jan 1, 2022
- Ain Shams Engineering Journal
A novel technique for parameter estimation in intuitionistic fuzzy logistic regression model
- Research Article
27
- 10.13189/ms.2021.090320
- May 1, 2021
- Mathematics and Statistics
An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages.
- Research Article
1
- 10.1088/1742-6596/1524/1/012124
- Apr 1, 2020
- Journal of Physics: Conference Series
Diabetes mellitus disease is disease which abnormal metabolism for a long time, because pancreas can not be able to produce insulin hormone be enough, or because body can not be able to use insulin hormone has been produced by effective. A stroke occurs if the flow of oxygen-rich blood to a portion of the brain is blocked. Without oxygen, brain cells start to die after a few minutes. Sudden bleeding in the brain also can cause a stroke if it damages brain cells. These objectives are finding significant factors which cause diabetes mellitus disease and determine ordinal regression model. Ordinal regression model is used to look for probability and reliability functions of a patient has stroke disease. The method used to three link functions, that are logit link function, normit link function, and cloglog link function. Testing of homogeneity prediction result of link functions uses linear hypothesis test. Factors caused diabetes mellitus are body mass index, high density lipoprotein, and albuminuria. These factors cause to diabetes mellitus and stroke could be used to prevent diseases, in order to all persons are healthy and happy. The result that probability of a patient with macroalbuminuria has stroke greater than microalbuminuria and a patient with microalbuminuria has stroke greater than normal. Probability of patient with macroalbuminuria by logit, normit, and clogloc link functions is decrease, respectively. Probability of patient with microalbuminuria by logit, normit, and cloglog link functions is increase, respectively. Reliability of a patient with macroalbuminuria, normal, and microalbuminuria have stroke, respectively, is decrease. Reliability of patient with macroalbuminuria by logit, normit, and clogloc link functions, respectively, is increase. Reliability of patient with microalbuminuria by logit, normit, and clogloc link functions, respectively, is zero. All of link function methods yield estimation probability value is the same. AIC value of logit link function, normit link function, and cloglog link function are, respectively, 167.6826, 168.3965, and 169.6107. These results are same by the result of linear hypothesis analysis that AIC values are not different meanwhile their AIC values are not equal. Therefore, logit model, normit model and cloglog model could be used to predict probability with result almost same.
- Dissertation
- 10.53846/goediss-8574
- Feb 21, 2022
Imbalance Learning and Its Application on Medical Datasets
- Research Article
42
- 10.1016/j.camwa.2011.08.050
- Sep 18, 2011
- Computers & Mathematics with Applications
Fuzzy logistic regression based on the least squares approach with application in clinical studies
- Research Article
125
- 10.1016/j.asoc.2016.02.025
- Feb 24, 2016
- Applied Soft Computing
Technology credit scoring model with fuzzy logistic regression
- Research Article
1
- 10.22037/jps.v8i1.11921
- Jan 17, 2017
- Journal of paramedical sciences
Chest tube removal pain is one of the important complications after open heart surgery. The removal of a chest tube is a painful and frightening experience and should be managed with as little pain and distress as possible. The aim of this study is to assess the effect of beloved person’s voice on chest tube removal pain in patients undergoing open heart surgery. 128 patients were randomly assigned to two groups: one group listened to beloved person’s voice during the procedure, and the other did not. Since pain was measured by linguistic terms, a fuzzy logistic regression was applied for modeling. After controlling for the potential confounders, based on fuzzy logistic regression, the beloved person’s voice reduced the risk of pain. Therefore, using beloved person’s voice could be effective, inexpensive and safe for distraction and reduction of pain.
- Research Article
7
- 10.1088/1742-6596/1306/1/012027
- Aug 1, 2019
- Journal of Physics: Conference Series
Hypertension called as the silent killer, is the number one non-infectious disease that causes death in the world every year. There are 185,857 cases recorded in 2018 in Indonesia. In this study, we model the hypertension risk by considering age, heart rate, hypertension history of family, eating salty foods, and smoking or exposure to cigarette smoke as the influence factors of hypertension risk. A cross-sectional survey was conducted in August 2018 at the Haji Hospital of Surabaya. Logistic regression is used to analyse the influence of various risk factors on hypertension and non-hypertension. In addition, we compare between logit and gompit link functions in logistic regression to build the modelling of hypertension risk factors based on the accuracy of the classification model. By using logit and gompit link functions, we obtain percentage of the classification accuracy are 85.2 % and 81.5 %, respectively. It means that the logit link function is better than the gompit link function for modelling hypertension risk factors. For these link functions, the significant factors that influence hypertension are age and heart rate.
- Research Article
10
- 10.1016/j.eij.2020.07.001
- Jul 28, 2020
- Egyptian Informatics Journal
The fuzzy common vulnerability scoring system (F-CVSS) based on a least squares approach with fuzzy logistic regression
- Research Article
73
- 10.1016/j.ins.2007.03.002
- Mar 23, 2007
- Information Sciences
Fuzzy nonparametric regression based on local linear smoothing technique
- New
- Research Article
- 10.1016/j.ins.2025.122431
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122366
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122456
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122325
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122331
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122303
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122291
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122373
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122424
- Nov 1, 2025
- Information Sciences
- New
- Research Article
- 10.1016/j.ins.2025.122301
- Nov 1, 2025
- Information Sciences
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.