Cluster-based Undersampling Research Articles

Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.

More than 1 billion people suffer from chronic respiratory diseases worldwide, accounting for more than 4 million deaths annually. Inhaled corticosteroid is a popular medication for treating chronic respiratory diseases. Its side effects include decreased bone mineral density and osteoporosis. The aims of this study are to investigate the association of inhaled corticosteroids and fracture and to design a clinical support system for fracture prediction. The data of patients aged 20 years and older, who had visited healthcare centers and been prescribed with inhaled corticosteroids within 2002–2010, were retrieved from the National Health Insurance Research Database (NHIRD). After excluding patients diagnosed with hip fracture or vertebrate fractures before using inhaled corticosteroid, a total of 11645 patients receiving inhaled corticosteroid therapy were included for this study. Among them, 1134 (9.7%) were diagnosed with hip fracture or vertebrate fracture. The statistical results showed that demographic information, chronic respiratory diseases and comorbidities, and corticosteroid-related variables (cumulative dose, mean exposed daily dose, follow-up duration, and exposed duration) were significantly different between fracture and nonfracture patients. The clinical decision support systems (CDSSs) were designed with integrated genetic algorithm (GA) and support vector machine (SVM) by training and validating the models with balanced training sets obtained by random and cluster-based undersampling methods and testing with the imbalanced NHIRD dataset. Two different objective functions were adopted for obtaining optimal models with best predictive performance. The predictive performance of the CDSSs exhibits a sensitivity of 69.84–77.00% and an AUC of 0.7495–0.7590. It was concluded that long-term use of inhaled corticosteroids may induce osteoporosis and exhibit higher incidence of hip or vertebrate fractures. The accumulated dose of ICS and OCS therapies should be continuously monitored, especially for patients with older age and women after menopause, to prevent from exceeding the maximum dosage.

Cluster-based Undersampling Research Articles

Related Topics

Articles published on Cluster-based Undersampling

Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism.

Highly Imbalanced Railway Station Structural Damage Monitoring Based on Cluster-Based Undersampling and Siamese Artificial Neural Network

New boosting approaches for improving cluster-based undersampling in problems with imbalanced data

A Cluster-based Undersampling Technique for Multiclass Skewed Datasets

Bankruptcy prediction using synthetic sampling

A resampling approach to disaggregate analysis of bus-involved crashes using panel data with excessive zeros

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction

Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

A Boosting-Aided Adaptive Cluster-Based Undersampling Approach for Treatment of Class Imbalance Problem

Enhancement of conformational B-cell epitope prediction using CluSMOTE.

Multi-Feature Approach for Bug Severity Assignment

Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method.

Design of a Clinical Decision Support System for Fracture Prediction Using Imbalanced Dataset.

Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cluster-based Undersampling Research Articles

Related Topics

Articles published on Cluster-based Undersampling

Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism.

Highly Imbalanced Railway Station Structural Damage Monitoring Based on Cluster-Based Undersampling and Siamese Artificial Neural Network

New boosting approaches for improving cluster-based undersampling in problems with imbalanced data

A Cluster-based Undersampling Technique for Multiclass Skewed Datasets

Bankruptcy prediction using synthetic sampling

A resampling approach to disaggregate analysis of bus-involved crashes using panel data with excessive zeros

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction

Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

A Boosting-Aided Adaptive Cluster-Based Undersampling Approach for Treatment of Class Imbalance Problem

Enhancement of conformational B-cell epitope prediction using CluSMOTE.

Multi-Feature Approach for Bug Severity Assignment

Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method.

Design of a Clinical Decision Support System for Fracture Prediction Using Imbalanced Dataset.

Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction