Abstract

Abstract: Background: Breast cancer is one of the leading causes of death of women in the United States and also one of the most malignant cancer among women worldwide. Early, more accurate detection of breast cancer enables extended longevity at a reduced cost. Towards this, analyzing the available big data using tools, such as Machine learning-based decision support systems can improve the speed and accuracy of early detection of breast cancer. In this paper, we examined the prediction performance of various state-of-theart machine learning models and a decision support system based on these models that provided the predicted category along with a prediction confidence measure. Methods: The various machine learning (ML) algorithms applied include Decision Tree, Naïve Bayes, k-Nearest Neighbors (kNN) and Support Vector Machine (SVM). We also analyzed the effect of multiple feature selection approaches on the prediction performance. We used the Breast Cancer Wisconsin Dataset from Wisconsin Prognostic Breast Cancer (WPBC) with 569 digitized images of a fine needle aspirate (FNA) of breast mass and 10 realvalued feature information. The performance of the ML model was evaluated using the ten-fold cross-validation approach and also on a prediction set comprising of 20% data with the models trained on remaining 80% data. Sensitivity and Specificity were used as the primary measures of performance. Results: Among all five machine learning methods, SVM had the best performance. Except for the kNN algorithm, the performance of the other three algorithms, Logistic Regressions, Naïve Bayes and Decision Trees, were also quite close to SVM. The prediction performance of the decision support system was better than any individual ML model where the prediction confidence was “High” or “Medium”. Conclusion: We found that feature selection improved the performance and computation cost for all ML models. By building the ML-based decision support system with the optimal feature subset, the prediction performance for breast cancer can be improved to 96% which means it can provide powerful assistance to doctors and patinets. On the other hand, as the size of the data set increases, the processing of data with a lot of features can increase the computation cost as well as the possibility of classification errors. Key words: Breast cancer, Data analysis, Machine learning, Feature selection, Decision support system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call