Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features.

Soumya Deep Roy,Ram Sarkar,Friedhelm Schwenker,Devroop Kar,Soham Das

doi:10.3390/s21113628

Soumya Deep Roy, Ram Sarkar + Show 3 more

Open Access

https://doi.org/10.3390/s21113628

Copy DOI

Abstract

Breast cancer, like most forms of cancer, is a fatal disease that claims more than half a million lives every year. In 2020, breast cancer overtook lung cancer as the most commonly diagnosed form of cancer. Though extremely deadly, the survival rate and longevity increase substantially with early detection and diagnosis. The treatment protocol also varies with the stage of breast cancer. Diagnosis is typically done using histopathological slides from which it is possible to determine whether the tissue is in the Ductal Carcinoma In Situ (DCIS) stage, in which the cancerous cells have not spread into the encompassing breast tissue, or in the Invasive Ductal Carcinoma (IDC) stage, wherein the cells have penetrated into the neighboring tissues. IDC detection is extremely time-consuming and challenging for physicians. Hence, this can be modeled as an image classification task where pattern recognition and machine learning can be used to aid doctors and medical practitioners in making such crucial decisions. In the present paper, we use an IDC Breast Cancer dataset that contains 277,524 images (with 78,786 IDC positive images and 198,738 IDC negative images) to classify the images into IDC(+) and IDC(-). To that end, we use feature extractors, including textural features, such as SIFT, SURF and ORB, and statistical features, such as Haralick texture features. These features are then combined to yield a dataset of 782 features. These features are ensembled by stacking using various Machine Learning classifiers, such as Random Forest, Extra Trees, XGBoost, AdaBoost, CatBoost and Multi Layer Perceptron followed by feature selection using Pearson Correlation Coefficient to yield a dataset with four features that are then used for classification. From our experimental results, we found that CatBoost yielded the highest accuracy (92.55%), which is at par with other state-of-the-art results—most of which employ Deep Learning architectures. The source code is available in the GitHub repository.

Highlights

With the widespread digitization of health records, computer aided disease detection (CADD) systems that employ data mining and Machine Learning (ML) techniques have become increasingly commonplace
Considering the monstrosity of the disease, it comes as little surprise that the earliest efforts in CADD [1] started with mammography for the detection of breast cancer and were later extended to other types of cancer as well
It is possible to determine whether the tissue is in the Ductal Carcinoma In Situ (DCIS) stage or in the Invasive Ductal Carcinoma (IDC) stage from the slide

Summary

Introduction

With the widespread digitization of health records, computer aided disease detection (CADD) systems that employ data mining and Machine Learning (ML) techniques have become increasingly commonplace. Considering the monstrosity of the disease, it comes as little surprise that the earliest efforts in CADD [1] started with mammography for the detection of breast cancer and were later extended to other types of cancer as well. As is the case in most types of cancer, the treatment of breast cancer is greatly aided by early detection but involves surgical or intensive medical procedures if not diagnosed early. Since treatment depends on the stage of cancer, one of the preliminary tasks of any pathologist involves a visual analysis of a histopathological slide stained with hematoxylin and eosin (H&E). It is possible to determine whether the tissue is in the Ductal Carcinoma

Methods

Results

Conclusion