Breast cancer is a widespread and serious condition that poses a significant threat to women's health globally, contributing significantly to mortality rates. Machine learning tools play a critical role in both the effective management and early detection of this disease. Feature selection (FS) methods are essential for identifying the most impactful features to improve breast cancer diagnosis. These methods reduce data dimensionality, eliminate irrelevant information, enhance learning accuracy, and improve the comprehensibility of results. However, the increasing complexity and dimensionality of cancer data pose substantial challenges to many existing FS methods, thereby reducing their efficiency and effectiveness. To overcome these challenges, numerous studies have demonstrated the success of nature-inspired optimization (NIO) algorithms across various domains. These algorithms excel in mimicking natural processes and efficiently solving complex optimization problems. Building on these advancements, we propose an innovative approach that combines powerful feature selection methods based on NIO techniques with a soft voting classifier. The NIO techniques employed include the Genetic Algorithm, Cuckoo Search, Salp Swarm, Jaya, Flower Pollination, Whale Optimization, Sine Cosine, Harris Hawks, and Grey Wolf Optimization algorithms. The Soft Voting Classifier integrates various machine learning models, including Support Vector Machines, Gaussian Naive Bayes, Logistic Regression, Decision Tree, and Gradient Boosting. These are used to improve the effectiveness and accuracy of breast cancer diagnosis. The proposed approach has been empirically evaluated using a variety of evaluation measures, such as F1 score, precision, recall, accuracy and Area Under the Curve (AUC), for performance comparison with individual machine learning techniques. The results demonstrate that the soft-voting ensemble technique, particularly when combined with feature selection based on the Jaya algorithm, outperforms all individual classifiers on the breast cancer dataset. It achieves the highest scores for accuracy, precision, recall, F1-score, and AUC, with values of 99.6 %, 99.21 %, 100 %, 99.6 %, and 99.6 %, respectively. Following closely, both the Genetic Algorithm and Salp Swarm feature selection demonstrate strong performance, with scores of 99.5 %, 100 %, 99 %, 99.5 %, and 99.5 %, respectively. Furthermore, the approach's effectiveness in identifying the minimal set of relevant features has been evaluated. Given the critical importance of selecting relevant features in complex cancer datasets, the proposed method proves to be a valuable tool for achieving early detection and improved interpretability. This method can also support healthcare professionals in making well-informed decisions regarding cancer diagnosis and treatment strategies.
Read full abstract