Machine learning-based biomarkers for breast cancer molecular subtypes HER2+ and TNBC

Seyma Yasar

doi:10.5455/medscience.2023.11.219

Abstract

Breast cancer is the most common type of cancer and the leading cause of death in women in Türkiye and worldwide. Since breast cancer tumor cells have different genetic characteristics, diagnosis and treatment options vary according to molecular subtypes. The primary objective of this study is to classify human epidermal receptor 2 positive (n=71) and triple-negative breast cancer (n=25) molecular subtypes using two different machine learning algorithms based on tissue proteomics data of 96 breast cancer patients. The secondary aim was to identify possible protein biomarkers that could be used to determine the diagnosis and treatment of these molecular subtypes. The upper sampling method was used to overcome the class imbalance in the study data. The least absolute shrinkage and selection operator (Lasso) was used as the variable selection method. In the modeling phase, Extreme Gradient Boosting (XGBoost) and Bootstrap Aggregating Classification and Regression Trees (Bagged CART) machine learning methods were used. The best classification performance is XGBoost, whose accuracy, balanced accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, MCC and G-mean values are 96.4%, 96.4%, 100%, 92.9%, 93.3%, 100%, 96.6%, 93.1%, 96.6%, respectively. The three protein access codes found to be most important for the classification of the two molecular subtypes in the optimal model XGBoost result are P02042, P00441 and P20231. These three proteins are thought to be clinically useful markers for early diagnosis and individualized treatment.

Full Text