Abstract

Breast cancer is the second most commonly diagnosed cancer in women throughout the world. It is on the rise, especially in developing countries, where the majority of cases are discovered late. Breast cancer develops when cancerous tumors form on the surface of the breast cells. The absence of accurate prognostic models to assist physicians recognize symptoms early makes it difficult to develop a treatment plan that would help patients live longer. However, machine learning techniques have recently been used to improve the accuracy and speed of breast cancer diagnosis. If the accuracy is flawless, the model will be more efficient, and the solution to breast cancer diagnosis will be better. Nevertheless, the primary difficulty for systems developed to detect breast cancer using machine-learning models is attaining the greatest classification accuracy and picking the most predictive feature useful for increasing accuracy. As a result, breast cancer prognosis remains a difficulty in today's society. This research seeks to address a flaw in an existing technique that is unable to enhance classification of continuous-valued data, particularly its accuracy and the selection of optimal features for breast cancer prediction. In order to address these issues, this study examines the impact of outliers and feature reduction on the Wisconsin Diagnostic Breast Cancer Dataset, which was tested using seven different machine learning algorithms. The results show that Logistic Regression, Random Forest, and Adaboost classifiers achieved the greatest accuracy of 99.12%, on removal of outliers from the dataset. Also, this filtered dataset with feature selection, on the other hand, has the greatest accuracy of 100% and 99.12% with Random Forest and Gradient boost classifiers, respectively. When compared to other state-of-the-art approaches, the two suggested strategies outperformed the unfiltered data in terms of accuracy. The suggested architecture might be a useful tool for radiologists to reduce the number of false negatives and positives. As a result, the efficiency of breast cancer diagnosis analysis will be increased.

Highlights

  • Data Pre-ProcessingPurification and modification of the dataset are required before applying machine learning (ML) algorithms to the dataset, it is a necessary step to pre-process the data

  • Breast cancer is the second most commonly diagnosed cancer in women throughout the world

  • The remaining instances of the outliers technique were subjected to Pearson Correlation Feature Selection, which resulted in the selection of 10 features, which is known as the OCFS approach

Read more

Summary

Data Pre-Processing

Purification and modification of the dataset are required before applying ML algorithms to the dataset, it is a necessary step to pre-process the data. Performance and accuracy of the predictive model are affected by the algorithms used and by the quality of the dataset and pre-processing. The phases of pre-processing used in this investigation are as follows: ventional technique), filtered (outliers approach), and Outliers Correlation Feature Selection (OCFS) datasets. The findings were assessed and compared using seven classifiers: Logistic Regression (LR), K-Nearest Neighbor (KNN), Support Vector Machines (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boost (GB), and Adaboost (AB)

Missing Values Checking
Encoding data
Dimension Reduction
Linear regression
Performance evaluation metric
Random Forest
Result and Discussion
Comparison between Different Machine Learning Algorithms Based on Accuracy
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call