Imbalanced Data Classification Research Articles

Introduction Children being treated for acute lymphoblastic leukemia (ALL) are frequently affected by asparaginase-associated pancreatitis. Additionally, pancreatitis is among the most troublesome and frequent side effects of asparaginase therapy and is a significant contributor to early drug discontinuation and poor outcomes. There are inadequate odds ratios for known risk factors, such as asparaginase dosage, advanced age, and single nucleotide polymorphisms, to predict pancreatitis occurrence. The goal of this study was to use machine learning to develop a predictive model for asparaginase-induced pancreatitis in pediatric ALL patients. Methods Data were collected from 711 patients who had childhood ALL and received asparaginase. Pancreatitis was defined as serum amylase and/or lipase levels greater than three times the upper limit of normal or acute pancreatitis on abdominal images. One month from the time of asparaginase administration for each patient was defined as one “timestep”, and when asparaginase was administered thereafter, it was defined as a new individual timestep. Each timestep was defined as one training case, and a case in which pancreatitis occurred at that timestep was defined as an event. Finally, 3193 training cases were defined in a total of 711 patients. The physical measurement results, prescription codes, blood test results, and blood transfusion history data were collected from electronic health records (HER) during the entire treatment period of the patients. Among these are age, body mass index, body surface area, gender, type of asparaginase (Native, Erwinia, or Pegylated), previous history of pancreatitis, cumulative number of asparaginase administrations, and asparaginase change history before the current time point. The results of 47 blood tests on the start date of asparaginase in each timestep were also used as predictive variables (Figure 1). Using logistic regression, Random forest, and XCBoost as machine learning methods, we assessed a model predicting asparaginase-associated pancreatitis through 5-fold cross-validation. Performance indicators such as area under the receiver operating characteristic curve (AUC) score, Precision Recall (PR) score, F0.5 score, and F2 score were employed to evaluate the binary classification of imbalanced data. The selection of the model was determined based on these two criteria. Results When considering the F(0.5+F2.0)/2 score as the basis for model selection, the logistic regression model demonstrated an AUC of 81% (PR 32.86%, F(0.5+F2.0)/2 score 23.23%). On the other hand, the XGboost model exhibited an AUC of 79% (PR 33.7%, F(0.5+F2.0)/2 score 32.07%), while the Random Forest model achieved an AUC of 84% (PR 33.34%, F(0.5+F2.0)/2 score 39.48%). Among these models, the Random Forest model demonstrated the highest predictive power. When the model was chosen using the PR score, the logistic regression model achieved an AUC of 80% (PR 34.97%, F(0.5+F2.0)/2 score 22.08%), whereas the XGboost model achieved an AUC of 79% (PR 31.58%, F(0.5+F2.0)/2 score 31.6%). Also, it was seen that the Random Forest model had the best performance across all metrics, with an AUC of 85%, a precision-recall (PR) score of 32.26%, and a F(0.5+F2.0)/2 score of 36.4% (Figure 2, left). According to Shapley values, it is evident that some parameters, namely greater lipase levels, higher cumulative asparaginase dosages, higher amylase levels, higher glucose levels, and older age, have significantly contributed to the occurrence of asparaginase-associated pancreatitis (Figure 2, right). Conclusions A machine learning model was employed to successfully forecast the occurrence of acute pancreatitis following the administration of asparaginase in pediatric patients with AAL. This study specifically focused on making predictions regarding pancreatitis within a month based on the test results obtained at the commencement of asparaginase treatment. This approach offers the potential for promptly predicting the development of pancreatitis. In further stages, following external validation and prospective observational clinical trials, the prediction model has the potential to be included in the EHR and serve as a Clinical Decision Support System (CDSS).

Read full abstract

For imbalanced data, classification efficiency degrades significantly due to the missing information for the positive class, and existing sampling schemes do not consider the distributions of samples. Additionally, the global parameters of fuzzy neighborhoods are set manually. These defects affect the effectiveness of classifier. To address these problems, we offer an adaptive fuzzy multi-neighborhood feature selection methodology with intercluster distance-based hybrid sampling for class-imbalanced data. First, the number of clusters can be defined in terms of the number of samples in the negative or positive class. The initial centers of the clusters are determined according to the number of clusters, and the dissimilarity and similarity measures are calculated by using the intercluster distances between samples. Then, the cluster center, fuzzy membership matrix, and intercluster distance are studied, and then the optimization objective function is designed. The hybrid sampling scheme can be used to combine the generated positive class samples and negative class samples and obtain a class-balanced system. Second, according to the sample distribution, the standard deviation and a set of adaptive fuzzy multi-neighborhood radii are designed. A fuzzy multi-neighborhood similarity relation is defined by introducing a Gaussian kernel model to obtain a fuzzy multi-neighborhood granule, and an improved fuzzy multi-neighborhood rough set model is provided. Uncertain measures of fuzzy neighborhood systems are evaluated by the positive region and dependency. Third, by integrating fuzzy dependence with fuzzy complementary condition entropy, fuzzy multi-neighborhood complementary mutual information is provided on two viewpoints of algebra and information. Finally, a heuristic feature subset selection methodology for imbalanced classification with hybrid sampling using fuzzy c-means clustering is studied to obtain this excellent set of features. Experiments on 26 imbalanced datasets show the effectiveness of our designed algorithm.

Read full abstract

Imbalanced Data Classification Research Articles

Related Topics

Articles published on Imbalanced Data Classification

Domain adaptation with label-aligned sampling (DALAS) for cross-domain fault diagnosis of rotating machinery under class imbalance

VGAN-BL: imbalanced data classification based on generative adversarial network and biased loss

HSNF: Hybrid sampling with two-step noise filtering for imbalanced data classification

Modeling of class imbalance handling with optimal deep learning enabled big data classification model

Chinese relation extraction in military field based on multi-grained lattice transformer and imbalanced data classification

Learning from class-imbalanced data using misclassification-focusing generative adversarial networks

Irrelevant attribute resistance approach to binary classification for imbalanced data

Adaptive SV-Borderline SMOTE-SVM algorithm for imbalanced data classification

Enhancing anomaly detection accuracy and interpretability in low-quality and class imbalanced data: A comprehensive approach

A Machine Learning Approach for Predicting the Occurrence of Asparaginase-Associated Pancreatitis in Pediatric Patients with Acute Lymphoblastic Leukemia

A Hybrid Resampling Approach for Multiclass Skewed Datasets and Experimental Analysis with Diverse Classifier Models

The Performance Comparison between C4.5 Tree and One-Dimensional Convolutional Neural Networks (CNN1D) with Tuning Hyperparameters for the Classification of Imbalanced Medical Data

Adaptive fuzzy multi-neighborhood feature selection with hybrid sampling and its application for class-imbalanced data

Class-imbalanced time series anomaly detection method based on cost-sensitive hybrid network

How Far Have We Progressed in the Sampling Methods for Imbalanced Data Classification? An Empirical Study

A sparrow search algorithm-optimized convolutional neural network for imbalanced data classification using synthetic minority over-sampling technique

Crack segmentation of imbalanced data: The role of loss functions

Data-level Hybrid Strategy Selection for Disk Fault Prediction Model Based on Multivariate Gan

A Survey of Methods for Handling Disk Data Imbalance

Imbalanced data classification using improved synthetic minority over-sampling technique

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Imbalanced Data Classification Research Articles

Related Topics

Articles published on Imbalanced Data Classification

Domain adaptation with label-aligned sampling (DALAS) for cross-domain fault diagnosis of rotating machinery under class imbalance

VGAN-BL: imbalanced data classification based on generative adversarial network and biased loss

HSNF: Hybrid sampling with two-step noise filtering for imbalanced data classification

Modeling of class imbalance handling with optimal deep learning enabled big data classification model

Chinese relation extraction in military field based on multi-grained lattice transformer and imbalanced data classification

Learning from class-imbalanced data using misclassification-focusing generative adversarial networks

Irrelevant attribute resistance approach to binary classification for imbalanced data

Adaptive SV-Borderline SMOTE-SVM algorithm for imbalanced data classification

Enhancing anomaly detection accuracy and interpretability in low-quality and class imbalanced data: A comprehensive approach

A Machine Learning Approach for Predicting the Occurrence of Asparaginase-Associated Pancreatitis in Pediatric Patients with Acute Lymphoblastic Leukemia

A Hybrid Resampling Approach for Multiclass Skewed Datasets and Experimental Analysis with Diverse Classifier Models

The Performance Comparison between C4.5 Tree and One-Dimensional Convolutional Neural Networks (CNN1D) with Tuning Hyperparameters for the Classification of Imbalanced Medical Data

Adaptive fuzzy multi-neighborhood feature selection with hybrid sampling and its application for class-imbalanced data

Class-imbalanced time series anomaly detection method based on cost-sensitive hybrid network

How Far Have We Progressed in the Sampling Methods for Imbalanced Data Classification? An Empirical Study

A sparrow search algorithm-optimized convolutional neural network for imbalanced data classification using synthetic minority over-sampling technique

Crack segmentation of imbalanced data: The role of loss functions

Data-level Hybrid Strategy Selection for Disk Fault Prediction Model Based on Multivariate Gan

A Survey of Methods for Handling Disk Data Imbalance

Imbalanced data classification using improved synthetic minority over-sampling technique