Abstract

Healthcare data collected in electronic health record systems is expected to increase by 36% per year from 2018 to 2025.1 While many traditional analytical methods have remained standards in the statistician's toolbox (e.g., χ² tests, linear and logistic regression), these techniques were not developed for this volume or complexity of data. Significant advances in computing power have opened doors to analyzing big data in new ways. These newer analytic methods (e.g., machine learning [ML] techniques) aim to develop models to understand complex relationships in data and rely heavily on novel applications of statistical concepts. What is ML? The Oxford English Dictionary defines it as, “the capacity of computers to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and infer from patterns in data.”2 ML methods have already changed the face of several industries. Banks have become more adept at fraud detection using neural network-based prediction models to improve the rate of detection and decrease “false alarms” that used to plague the industry. Netflix uses ML-based algorithms to personalize suggested content for users and rapidly adapts to user preferences. There is also a fair chance that many readers of this article found it using ML-based Google searches. ML, despite its great promise, should still be considered a complement to, rather than a replacement for, traditional statistical methods. Such was the case for a recent report in the Journal of Hospital Medicine from Yaeger et al.3 who developed a predictive model to detect invasive bacterial infections in febrile infants using traditional logistic regression and an ML model (super learner) and compared their performance. In this Methods Progress Note, we aim to describe several classes of ML techniques that hold promise for performing research with large, complex data sets. We will compare these methods with traditional statistical methods and provide some guidance for interpretation for the general clinician. Models describe how a dependent variable (DV; a.k.a. response or outcome variable) depends on changes in the values of the independent variables (IV; a.k.a. explanatory variables). They can also be used to predict the DV for new data based on the values of the IVs. The primary purpose of the model should be considered when selecting between traditional statistical and ML models. Table 1 details the strengths and weaknesses of traditional statistical models compared to ML methods. Traditional statistical models typically provide information regarding how the IVs are related to the DV. However, ML methods focus less on understanding how the IVs are related to the DV and prioritize improved prediction for future data.4 Linear and logistic regression are two common statistical models used in medical research. Linear regression is a statistical model used to predict the value of a continuous DV using any number of IVs. However, some assumptions must be met for linear regression to produce reliable estimates: normality (the model residuals [the differences between the observed value and the model predicted value] must be normally distributed), homoskedasticity (the model residuals must have constant variance or spread across the prediction range), independence (IVs are not influenced by other variables), and linearity (the DV must have a linear relationship with the IVs). Logistic regression seeks to predict a categorical variable (e.g., mortality), rather than a continuous one (e.g., hospital length of stay). It uses a logarithmic curve to calculate the probability of the dependent variable falling into a category and estimates the coefficients for each IV.5 Like linear regression, the coefficients have a practical interpretation regarding the relationship between the DV and the IVs. One of the main drawbacks of traditional models is that a human must prespecify the form of the relationships. In other words, the IVs to be included, the interactions between the IVs, and how they are related to the DV must be specified in the model. While there are tools to help with this process (e.g., stepwise variable selection), it is subject to bias. ML methods make no underlying assumptions about the data or the form of the relationship between the IVs and the DV, which has the potential to remove biases and find relationships between variables not apparent with traditional methods. In addition, these methods can also continue to improve and “learn” from additional data and are more adaptable to new kinds of data. There are two main categories of ML methods: Supervised and Unsupervised. Table 2 describes several common ML methods. Supervised methods are more like traditional methods which include a DV (e.g., length of stay, cost, mortality). Unsupervised methods are unique as there is no DV, but observations are grouped together based on mathematical distance in multiple dimensions, or “similarity” in the variables.6 Additionally, supervised methods typically make use of a training data set to build the model and a separate validation data set to test the model's performance while unsupervised do not. There are several common ML methods used in healthcare research. Support vector machines (SVMs) are powerful tools for classification prediction.7 SVMs attempt to find the best-dividing line by maximizing the distance between the categorical observations in multiple dimensions to predict a binary outcome, such as malignancy or mortality. Classification and regression trees (CARTs) are commonly used ML methods in current healthcare research.8 CART models recursively split the data into two groups (known as child nodes) based on the IV that does the best job at creating “purity” among the two child nodes. In classification trees (i.e., the DV is binary), purity means that as many observations with one level of the DV are placed in one child node and as few as possible are in the other. In regression trees (i.e., the DV is continuous), the means of the two child nodes are as different as possible. CART will search through the IVs to find the best option and then reiterate the process with the two resulting child nodes until told to stop. Random Forest and gradient-boosted machines are extensions of CART known as ensemble methods that involve resampling the data and fitting hundreds to thousands of individual decision trees to produce a better decision tree model. These trees are combined to create a model that provides more accurate predictions than the individual trees. Overall, these methods are adaptable, flexible, and provide some measure of interpretability. Neural networks are the most adaptable to large amounts of data and have high predictive power. However, they are also the least interpretable models and are often referred to as “black-boxes” since they contain hidden layers that create complex interrelationships between the IVs and the DV that may make sense mathematically but have no clinical meaning. Finally, there are additional ensemble models like the super learner that was used by Yaeger et al.3 The “Super Learner” method used by Yaeger et al. builds an ML model using training data, cross-validates the model using resampling methods for optimization, then stores the results of the model. The algorithm then creates another ML model and processes it in the same way. The algorithm next selects either the best-performing individual model or a mixture of models. This technique holds promise as new ML methods are being developed. However, as each model is only tested on the training data set, there is a risk of overfitting, where the optimal model performs well on training data but has lower performance when applied to a new data set. Hierarchical and K means clustering are unsupervised ML methods that attempt to group observations together based on the similarity between a large number of variables. Most of these methods start with each observation being its own cluster, then group similar observations together based on the variables being considered. These methods can determine how many unique/distinct groups there are in the data and provide measures of how similar members are within a group and how different groups are from each other. These are often used to discover novel disease phenotypes.9, 10 For example, Seymour et al. used K means clustering to identify four novel phenotypes of adult sepsis patients that had different laboratory marker patterns as well as outcomes including overall mortality.11 These methods are also useful for grouping similar hospitals together based on their characteristics or utilization patterns.12 With the gamut of statistical methods available, how should clinician researchers decide which to use? The first consideration should be the research question the study is attempting to answer. If quantifying relationships is important, such as, “Is vasopressin use associated with a reduced risk of mortality?” then traditional statistical methods may be the most appropriate, as they will provide the most information regarding the effect of vasopressin on mortality. If the question is, “Can I use administrative health data to accurately predict a patient's hospital length of stay?” then ML methods may be better, as they are more adaptable to complex relationships between the independent variables and large volumes of data to predict outcomes. Second, a researcher should assess if the assumptions of traditional models are reasonably met. For example, if the researcher is modeling length of stay, but the data is highly skewed and cannot be normalized using a transformation, then the assumptions of linear regression are not met, and an ML method may be more appropriate. Additionally, if the researcher prefers not to specify how the IVs are related to the DV, an ML model may be a better option so that these relationships can be learned from the data. Next, the researcher should consider the volume and complexity of the data. If the data available contain only a small number of variables (typically less than 20–30) and/or a small number of observations (typically less than a thousand), then ML methods may not perform well, and traditional statistical methods are likely preferred. Yaegar et al. compared the use of a traditional statistical method (logistic regression) with an ML method (super learner) to predict bacterial infections in febrile neonates. The ML method was superior in this case, likely due to a large number of observations and available variables that were more conducive to an ML approach. Traditional statistical models and ML methods are tools in a toolbox. Which tool is best depends on the research goals and the available data. The dramatic increase in accessible healthcare data stored in electronic health records is providing researchers with new opportunities for innovative research that was not possible 20, or even 10 years ago. As more research is conducted using these methods, it is of benefit to practicing clinicians to gain some understanding of what these methods are, and how they can be effectively applied in healthcare research. The authors declare no conflict of interest.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call