Abstract

Breast cancer is one of the most common cancers diagnosed in women. For preventive diagnosis, feature selection is an essential step to construct the breast cancer classifier. The features of a real breast cancer dataset are usually composed of discrete and continuous ones. Also, the Area Under the Curve (AUC) of the receiver operating characteristic receives more attention in such a medical field. The existing research work is insufficient to take into account both the hybrid trait of the features and the specific classification objective. We have proposed a wrapper method, i.e., a integrated framework in which Bayesian classifiers are embedded for the feature selection of breast cancer datasets. To deal with both the discrete features and the continuous features, we adopt the naive approach for the discrete features but the kernel probability density estimation for the continuous ones, respectively, which leads to feature-type-aware hybrid Bayesian classifiers. All the classifiers are fed with different feature subsets and evaluated by their AUC metrics as the fitness indexes. Thus, with the genetic algorithm, we can obtain a near optimal feature subset, which yields a good AUC metric with its corresponding classifiers. Moreover, the one-class F-score is used to help enhance the convergence of the algorithm. Experiments are done both with the continuous Wisconsin diagnostic breast cancer dataset and the real breast cancer dataset for Chinese women. The results prove that the proposed wrapper is feasible, accurate and efficient, compared with the related genetic algorithm based approaches.

Highlights

  • Breast cancer has become the most commonly diagnosed cancer in women, especially in women older than 40 [1], [2]

  • According to the latest statistics from the International Agency for Research on Cancer (IARC), which is the affiliated institution of the World Health Organization (WHO), the global incremental number of women with breast cancers exceeds 2, 080, 000, accounting for 24.2% among women with cancers [3]

  • Karabatak [27] presented a new weighted naive Bayesian classifier to reinforce the effects of the crucial features, and the results showed that the classifier performed better than the traditional naive Bayesian classifier

Read more

Summary

INTRODUCTION

Breast cancer has become the most commonly diagnosed cancer in women, especially in women older than 40 [1], [2]. The fitness functions of the existing GA-based wrappers merely consider the accuracy of the classifiers, e.g., the misclassified probability In this way, some special characteristics which can influence the minority class of the real lopsided datasets cannot be fully reflected. We propose a Bayesian classifier-embedded Integrated Generic-driven Framework for feature selection based on Kernel probability density estimation (BIG-F). We apply one of the most typical metrics, say AUC, as the fitness function to fully reflect the integral accuracy and the diagnosis performance of the extremely lopsided and huge real data collected from epidemiological survey in the breast cancer diagnosis. Experimental results show that the proposed algorithm performs well in the feature selection of the breast cancer and is more efficient than the traditional GA-based approaches.

RELATED WORKS
FEATURE SELECTION FRAMEWORK
KERNEL-BASED BAYESIAN CLASSIFIER
PROBLEM FORMULATION
ONE-CLASS F-SCORE
EXPERIMENTS WITH WISCONSIN DIAGNOSTIC BREAST CANCER DATASET
24: Sort the genes according to
4: Calculate
FEASIBILITY AND ACCURACY OF THE ALGORITHM
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call