Abstract

Automated machine learning (AutoML) has been recognized as a powerful tool to build a system that automates the design and optimizes the model selection machine learning (ML) pipelines. In this study, we present a tree-based pipeline optimization tool (TPOT) as a method for determining ML models with significant performance and less complex breast cancer diagnostic pipelines. Some features of pre-processors and ML models are defined as expression trees and optimal gene programming (GP) pipelines, a stochastic search system. Features of radiomics have been presented as a guide for the ML pipeline selection from the breast cancer data set based on TPOT. Breast cancer data were used in a comparative analysis of the TPOT-generated ML pipelines with the selected ML classifiers, optimized by a grid search approach. The principal component analysis (PCA) random forest (RF) classification was proven to be the most reliable pipeline with the lowest complexity. The TPOT model selection technique exceeded the performance of grid search (GS) optimization. The RF classifier showed an outstanding outcome amongst the models in combination with only two pre-processors, with a precision of 0.83. The grid search optimized for support vector machine (SVM) classifiers generated a difference of 12% in comparison, while the other two classifiers, naïve Bayes (NB) and artificial neural network—multilayer perceptron (ANN-MLP), generated a difference of almost 39%. The method’s performance was based on sensitivity, specificity, accuracy, precision, and receiver operating curve (ROC) analysis.

Highlights

  • Breast cancer has been recorded as the most frequently diagnosed type of cancer among women

  • On the basis of the results, we found that naïve Bayes (NB)-grid search (GS) performance was the lowest compared to support vector machine (SVM)-GS and artificial neural network—multilayer perceptron (ANN-MLP)-GS

  • We showed that the default tree-based pipeline optimization tool (TPOT) model for the selected data set produced classification pipelines that exceeded the performance of the controlled TPOT configuration-based model and grid search optimization-based model

Read more

Summary

Introduction

Breast cancer has been recorded as the most frequently diagnosed type of cancer among women. Imaging techniques and assisted cancer diagnosis approaches have been extensively developed to detect and treat breast cancer early to reduce mortality rates [1]. Data mining and computer-aided techniques have been developed for detecting and classifying breast cancer, including several stages: pre-processing, the extraction of functions, and classification [2,3,4]. Feature extraction in the detection of breast cancer is highly important as it helps to differentiate benign from malignant tumors. Health researchers are well-acquainted with clinical data, they still often lack in the ML expertise needed to apply these techniques to big data sources. The interactive process between data scientist and healthcare researchers requires a large amount of time and effort from both sides

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call