A Comparative Analysis of Data Mining Techniques on Breast Cancer Diagnosis Data using WEKA Toolbox

Majdah Alshammari,Mohammad Mezher

doi:10.14569/ijacsa.2020.0110829

Abstract

Breast cancer is considered the second most common cancer in women compared to all other cancers. It is fatal in less than half of all cases and is the main cause of mortality in women. It accounts for 16% of all cancer mortalities worldwide. Early diagnosis of breast cancer increases the chance of recovery. Data mining techniques can be utilized in the early diagnosis of breast cancer. In this paper, an academic experimental breast cancer dataset is used to perform a data mining practical experiment using the Waikato Environment for Knowledge Analysis (WEKA) tool. The WEKA Java application represents a rich resource for conducting performance metrics during the execution of experiments. Pre-processing and feature extraction are used to optimize the data. The classification process used in this study was summarized through thirteen experiments. Additionally, 10 experiments using various different classification algorithms were conducted. The introduced algorithms were: Naive Bayes, Logistic Regression, Lazy IBK (Instance-Bases learning with parameter K), Lazy Kstar, Lazy Locally Weighted Learner, Rules ZeroR, Decision Stump, Decision Trees J48, Random Forest and Random Trees. The process of producing a predictive model was automated with the use of classification accuracy. Further, several experiments on classification of Wisconsin Diagnostic Breast Cancer and Wisconsin Breast Cancer, were conducted to compare the success rates of the different methods. Results conclude that Lazy IBK classifier k-NN can achieve 98% accuracy among other classifiers. The main advantages of the study were the compactness of using 13 different data mining models and 10 different performance measurements, and plotting figures of classifications errors.

Highlights

Worldwide, breast cancer has become one of the most common cancers [1]
The classification process was summarized through 13 experiments, including three experiments using the Bayes Net algorithm by three different search mechanisms and ten experiments using classification algorithms, Naïve Bayes (NB), Logistic Regression, Lazy IBK (Instance-Bases learning with parameter K), Lazy Kstar, Lazy Locally Weighted Learner (LWL), Rules ZeroR, Decision Stump, Decision Trees J48, Random Forest, and Random Trees to create a predictive www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol 11, No 8, 2020 model that can be tested with new records and that can obtain classification accuracy, and compare the results obtained after implementing different algorithms compared to the slow algorithm IBK and k-Nearest Neighbors (k-NN)
The results showed that classification algorithms have better performance than the clustering where Support Vector Machine (SVM) and Decision Trees (DT) (C5.0) had the best accuracy of 81% and Fuzzy c-means resulted in the lowest accuracy of 37%, among the tested algorithms

Summary

INTRODUCTION

Breast cancer has become one of the most common cancers [1]. It originates in the area of the breast tissue that has a concentration of milk ducts. The rapid spread of breast cancer and the inability to accurately diagnose and recognize its presence represents a challenge for researchers and developers in biomedical engineering [2] This challenge leads to deploying new data mining techniques. A data mining-based experiment for breast cancer classification mechanism is introduced with different types of classifiers. In addition to identifying the best classifier model that introduces higher classification accuracy for the predefined dataset used in this study, the data mining process is implemented by applying pre-processing operations and extracting features to the specified data records from the data set using WEKA. By looking at the literature, it is noticeable that there have been many efforts to use data mining for breast cancer datasets; previous studies lack in comparing WEKA with different parametric values and attributes.

RELATED WORK

CLASSIFICATION ALGORITHM

METHODOLOGY

COMPARISON WITH OTHER ALGORITHMS

Findings

CONCLUSIONS