Abstract

Machine learning methods are increasingly applied for medical data analysis to reduce human efforts and improve our understanding of disease propagation. When the data is complicated and unstructured, shallow learning methods may not be suitable or feasible. Deep learning neural networks like multilayer perceptron (MLP) and convolutional neural network (CNN), have been incorporated in medical diagnosis and prognosis for better health care practice. For a binary outcome, these learning methods directly output predicted probabilities for patient’s health condition. Investigators still need to consider appropriate decision threshold to split the predicted probabilities into positive and negative regions. We review methods to select the cut-off values, including the relatively automatic methods based on optimization of the ROC curve criteria and also the utility-based methods with a net benefit curve. In particular, decision curve analysis (DCA) is now acknowledged in medical studies as a good complement to the ROC analysis for the purpose of decision making. In this paper, we provide the R code to illustrate how to perform the statistical learning methods, select decision threshold to yield the binary prediction and evaluate the accuracy of the resulting classification. This article will help medical decision makers to understand different classification methods and use them in real world scenario.

Highlights

  • Data science has expanded quickly due to the increase in data storage capacities and exploration of computational technologies and algorithms

  • We review methods to select the cut-off values, including the relatively automatic methods based on optimization of the Receiver Operating Characteristic (ROC) curve criteria and the utility-based methods with a net benefit curve

  • While the Pima Indian diabetes data allow shallow learning, we focus on a case study with deep learning

Read more

Summary

Introduction

Data science has expanded quickly due to the increase in data storage capacities and exploration of computational technologies and algorithms. The data mining techniques help to obtain the significant information from the patient health data and make promising predictions. When data are in the standard format, e.g., accessible via an Excel sheet, most shallow learning tools can be readily applied, including the familiar logistic regression, and classification trees for example. These methods are traditionally covered in the course curriculum in most graduate programs for statistics and biostatistics. On the other hand, when the data become complicated, we may Received May 10, 2021; Accepted August 18, 2021

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call