Abstract

Breast cancer is a neoplastic disease which seriously threatens women’s health. It is regard as the most common cause of cancer death in women. Accurate detection and effective treatment are of vital significance to lower the death rate of breast cancer. In recent years, machine learning technique has been considered to be an effective method for accurate diagnosis of various diseases, among which Random Forest (RF) has been widely applied. However, decision trees with poor classification performance and high similarity may be generated during the training process, which affects the overall classification performance of the model. In this paper, a Hierarchical Clustering Random Forest (HCRF) model is developed. By measuring the similarity among all the decision trees, the hierarchical clustering technique is used to carry out clustering analysis on decision trees. The representative trees are selected from divided clusters to construct the hierarchical clustering random forest with low similarity and high accuracy. In addition, we use Variable Importance Measure (VIM) method to optimize the selected feature number for the breast cancer prediction. Wisconsin Diagnosis Breast Cancer (WDBC) database and Wisconsin Breast Cancer (WBC) database from the UCI (University of California Irvine) Machine Learning repository are employed in this study. The performance of the proposed method is evaluated by utilizing accuracy, precision, sensitivity, specificity and AUC (Area Under ROC Curve). Experimental results indicate that the classification based on HCRF algorithm with VIM as a feature selection method reaches the best accuracy of 97.05% and 97.76% compared to Decision Tree, Adaboost and Random Forest on both the WDBC and WBC datasets. The method proposed in this study is an effective tool for diagnosing breast cancer.

Highlights

  • Breast cancer is one of the most important problems in women's health and has become the highest incidence of malignant tumor in women globally [1, 2]

  • Fine Needle Aspiration biopsy (FNA) is a minimally invasive pathological diagnosis method based on cell morphology [7], which have great potential to provide high accuracy and low false positive diagnosis

  • Experimental results show that the classification based on Hierarchical Clustering Random Forest (HCRF) algorithm with Variable Importance Measure (VIM) as a feature selection method is a practical way for in the early diagnosis of breast cancer

Read more

Summary

INTRODUCTION

Breast cancer is one of the most important problems in women's health and has become the highest incidence of malignant tumor in women globally [1, 2]. A breast cancer diagnosis methodlogy that uses VIM for feature selection and Hierarchical Clustering Random Forest (HCRF) for classification is proposed. Experimental results show that the classification based on HCRF algorithm with VIM as a feature selection method is a practical way for in the early diagnosis of breast cancer. Hierarchical clustering is introduced to improve the diversity and classification ability of decision trees in the random forest This proposed method has great reference value for designing structural diversity using other types of basic learners or other ensemble learning algorithms. Assume that the training set has N features {N1, N2, ..., NN}, we use the "Gini Index" to select the optimal partitioning feature at each node when constructing decision trees in the random forest. I 1 j 1 i 1 where K is the number of decision trees in the random forest and N is the number of input features on the training set

HIERARCHICAL CLUSTERING RANDOM FOREST CLASSIFIER
12: Update the matrix Sim
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call