Abstract

This study proposes a hybrid model that combines K-Means clustering and Random Forest classification as an approach for breast cancer classification. The objective is to exploit the advantages of unsupervised clustering and supervised classification techniques to enhance the accuracy and robustness of classification models. The dataset underwent preprocessing procedures encompassing the handling of missing values, feature normalization, and feature selection. Missing values were addressed through appropriate methods, and features were scaled and selected based on variance threshold or correlation analysis. Subsequently, K-Means clustering was applied to the preprocessed data to assign cluster labels to each sample. The study then proceeded to train a Random Forest classifier by incorporating both the cluster labels and the raw gene eigenvalues as mixed features. This integration of gene expression values and cluster labels provides supplementary information to the classifier, enabling the capture of more intricate patterns within the data. The Random Forest classifier was trained using optimized parameters determined through parameter tuning, including the number of trees, maximum depth, and minimum number of split samples. Extensive experiments and evaluations conducted in this study revealed that the hybrid model outperformed the standalone Random Forest classification. The incorporation of K-Means clustering facilitated the discovery of underlying data structures and patterns, ultimately enhancing the classifier's discriminatory ability. The hybrid model exhibited superior accuracy, precision, recall, and F1 scores, demonstrating its efficacy in accurately classifying breast cancer samples.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call