Abstract

The access to increasing volumes of scientific and clinical data, particularly with the implementation of electronic health records, has reignited an enthusiasm for artificial intelligence and its application to the health sciences. This interest has reached a crescendo in the past few years with the development of several machine learning– and deep learning–based medical technologies. The impact on research and clinical practice within gastroenterology and hepatology has already been significant, but the near future promises only further integration of artificial intelligence and machine learning into this field. The concepts underlying artificial intelligence and machine learning initially seem intimidating, but with increasing familiarity, they will become essential skills in every clinician's toolkit. In this review, we provide a guide to the fundamentals of machine learning, a concentrated area of study within artificial intelligence that has been built on a foundation of classical statistics. The most common machine learning methodologies, including those involving deep learning, are also described. The access to increasing volumes of scientific and clinical data, particularly with the implementation of electronic health records, has reignited an enthusiasm for artificial intelligence and its application to the health sciences. This interest has reached a crescendo in the past few years with the development of several machine learning– and deep learning–based medical technologies. The impact on research and clinical practice within gastroenterology and hepatology has already been significant, but the near future promises only further integration of artificial intelligence and machine learning into this field. The concepts underlying artificial intelligence and machine learning initially seem intimidating, but with increasing familiarity, they will become essential skills in every clinician's toolkit. In this review, we provide a guide to the fundamentals of machine learning, a concentrated area of study within artificial intelligence that has been built on a foundation of classical statistics. The most common machine learning methodologies, including those involving deep learning, are also described. Since the start of the 21st century, there has been an increased impetus to integrate the field of artificial intelligence (AI), including machine learning (ML), into the medical sciences. This interest has gained momentum over the last few years with myriad discoveries in AI-based methodologies for clinical practice and decision support1Gulshan V. Peng L. Coram M. et al.Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.JAMA. 2016; 316: 2402Crossref PubMed Scopus (3255) Google Scholar, 2Attia Z.I. Kapa S. Lopez-Jimenez F. et al.Screening for cardiac contractile dysfunction using an artificial intelligence–enabled electrocardiogram.Nat Med. 2019; 25: 70-74Crossref PubMed Scopus (352) Google Scholar, 3Singh R. Kalra M.K. Nitiwarangkul C. et al.Deep learning in chest radiography: detection of findings and presence of change.PLoS One. 2018; 13e0204155Crossref Scopus (91) Google Scholar; significantly, the impact of these technologies has been especially broad in gastroenterology and hepatology.4Mori Y. Kudo S. Misawa M. et al.Real-time use of artificial intelligence in identification of diminutive polyps during colonoscopy.Ann Intern Med. 2018; 169: 357Crossref PubMed Scopus (233) Google Scholar, 5de Groof A.J. Struyvenberg M.R. van der Putten J. et al.Deep-learning system detects neoplasia in patients with Barrett's esophagus with higher accuracy than endoscopists in a multistep training and validation study with benchmarking.Gastroenterology. 2020; 158: 915-929.e4Abstract Full Text Full Text PDF PubMed Scopus (135) Google Scholar, 6Eaton J.E. Vesterhus M. McCauley B.M. et al.Primary sclerosing cholangitis risk estimate tool (PREsTo) predicts outcomes of the disease: a derivation and validation study using machine learning.Hepatology. 2020; 71: 214-224Crossref PubMed Scopus (53) Google Scholar, 7Ahn J.C. Connell A. Simonetto D.A. et al.Application of artificial intelligence for the diagnosis and treatment of liver diseases.Hepatology. 2021; 73: 2546-2563Crossref PubMed Scopus (37) Google Scholar While clinical applications of AI have gathered special attention, basic science disciplines have also readily adopted ML techniques,8Libbrecht M.W. Noble W.S. Machine learning applications in genetics and genomics.Nat Rev Genet. 2015; 16: 321-332Crossref PubMed Scopus (904) Google Scholar, 9Ching T. Himmelstein D.S. Beaulieu-Jones B.K. et al.Opportunities and obstacles for deep learning in biology and medicine.J R Soc Interface. 2018; 15: 20170387Crossref PubMed Scopus (829) Google Scholar, 10Greener J.G. Kandathil S.M. Moffat L. et al.A guide to machine learning for biologists.Nat Rev Mol Cell Biol. 2021; 0123456789PubMed Google Scholar using these tools to expand into new facets of genomics11Yip K.Y. Cheng C. Gerstein M. Machine learning and genome annotation: a match meant to be?.Genome Biol. 2013; 14: 205Crossref PubMed Scopus (57) Google Scholar and proteomics.12Swan A.L. Mobasheri A. Allaway D. et al.Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology.OMICS. 2013; 17: 595-610Crossref PubMed Scopus (138) Google ScholarThe relationship between AI and medicine, however, is not new, and it extends back many decades. The origins of AI can be traced to the fictional writings of Isaac Asimov13Toosi A. Bottino A.G. Saboury B. et al.A brief history of AI: how to prevent another winter (a critical review).PET Clin. 2021; 16: 449-469Abstract Full Text Full Text PDF PubMed Scopus (6) Google Scholar in the 1940s and the seminal work of Alan Turing on computing machines during World War II.14Turing A.M. On computable numbers, with an application to the entscheidungsproblem.Proc Lond Math Soc. 1937; s2-42: 230-265Crossref Scopus (3630) Google Scholar The term “artificial intelligence”, however, was not used until 1956, when John McCarthy organized the conference which established the field of AI—the Dartmouth Summer Research Project on Artificial Intelligence. Medical schools in the United States were quick to partner with pioneers in this nascent field, and health science research became the driving force for AI innovation in the 1970s with the development of several “expert systems” to assist with scientific15Lindsay R.K. Buchanan B.G. Feigenbaum E.A. et al.DENDRAL: a case study of the first expert system for scientific hypothesis formation.Artif Intell. 1993; 61: 209-261Crossref Scopus (120) Google Scholar and clinical decision-making.16Schwartz W.B. Medicine and the computer.N Engl J Med. 1970; 283: 1257-1264Crossref PubMed Scopus (119) Google Scholar, 17Yu V.L. Antimicrobial selection by a computer.JAMA. 1979; 242: 1279Crossref PubMed Scopus (172) Google Scholar, 18Miller R.A. Pople H.E. Myers J.D. Internist-I , an experimental computer-based diagnostic consultant for general internal medicine.N Engl J Med. 1982; 307: 468-476Crossref PubMed Scopus (676) Google Scholar Hindered by rigid, rule-based architectures, these AI-based “expert systems” failed to be generalizable, precluding widespread adoption and leading to a divergence between the fields of AI and medicine until the turn of the century.AI, in the current day, is a loosely defined term applied to a broad area of study within computer science devoted to computing systems that can perform skills normally thought to require human intelligence, such as problem-solving, visual perception, and reasoning, whereas ML is a specific set of techniques within AI that are predicated on “learning” to model patterns in data using mathematical functions. This mathematical foundation of ML is largely built on the concepts of traditional statistics (Figure 1), and thus, ML is often referred to as “statistical learning”.19James G. Witten D. Hastie T. et al.An introduction to statistical learning. Springer US, New York2021Crossref Google Scholar By leveraging its origins in computer science, ML diverges from classical statistics with its ability to use higher-dimensional mathematical operations on much larger data sets to decipher complex, nonlinear relationships. As a result, ML algorithms have proved to be very useful in medicine to discriminate between groups of interest or predict specific outcomes. In fact, models such as the Bhutani nomogram,20Bhutani V.K. Johnson L. Sivieri E.M. Predictive ability of a predischarge hour- specific serum bilirubin for subsequent significant hyperbilirubinemia in healthy term and near-term newborns.Pediatrics. 1999; 103: 6-14Crossref PubMed Scopus (620) Google Scholar the Model for End-stage Liver Disease score,21Malinchoc M. Kamath P.S. Gordon F.D. et al.A model to predict poor survival in patients undergoing transjugular intrahepatic portosystemic shunts.Hepatology. 2000; 31: 864-871Crossref PubMed Scopus (2069) Google Scholar,22Kamath P.S. Wiesner R.H. Malinchoc M. et al.A model to predict survival in patients with end-stage liver disease.Hepatology. 2001; 33: 464-470Crossref PubMed Scopus (3668) Google Scholar and the Glasgow-Blatchford risk score23Blatchford O. Murray W.R. Blatchford M. A risk score to predict need for treatment for uppergastrointestinal haemorrhage.Lancet. 2000; 356: 1318-1321Abstract Full Text Full Text PDF PubMed Scopus (708) Google Scholar could be considered some of the earliest examples of predictive ML models in the field of gastroenterology and hepatology.The progression of computing processing power, the ability to reliably store immense amounts of data, and the development of statistics-based ML techniques, combined with the implementation of electronic health records (EHRs), have heightened the interest in medical AI. The potential applications of ML algorithms in research and the practice of gastroenterology and hepatology remain vast. As this technology, and its acceptance, continues to advance, such algorithms will have an ever-increasing role in every facet of gastroenterology. Thus, having a working understanding of the basic concepts of AI and ML has become a necessary tool in every clinician's skillset. Here, we aim to provide a primer to the foundations of ML and introduce some of the most common model architectures used in medicine.The Basics of MLDataML techniques are designed to understand and mathematically represent the patterns present in data. As a result, the key to building accurate and applicable ML algorithms lies both in the size and quality of data used. The omics revolution and widespread deployment of the EHR have provided us with innumerable data sets with previously unfathomable amounts of experimental and clinical information that have been crucial for ML applications in the health sciences. However, the components and organization of these large volumes of data present a challenge in understanding and verifying their underlying quality. The wide variety of building blocks in medical data includes pixels that make up radiological and histopathological images, words as part of clinical documentation, time-based values from remote sensors, nucleotide bases from next-generation sequencing, and at the simplest level, rows from descriptive tables. Some of these base constituents can easily be stored in known formats and organized into a table or sets of relational tables and, thus, are called structured data (Table 1). Although organization into structured data does not directly indicate the quality or interpretability of the information contained within, the ability to use a structured architecture to index and search for specific instances allows for easier verification of quality. Alternatively, the components of images, clinical documentation text, or even audio recordings have no predefined relational organization and are considered unstructured data. Occasionally, these sources of unstructured data can be organized at a higher level using metadata—information describing where, when, and how the data were created—which are referred to as semi-structured data. Although more advanced ML techniques such as deep learning (DL) can, in some cases, utilize unstructured data, traditional ML methodologies tend to require use of structured data.Table 1Glossary of TermsTermDefinitionStructured dataInformation that has been stored in a defined, known framework, such as a database, so that it can be indexed, referenced, or searched easily and accurately. Structured data are usually quantitative and composed of numerical values, dates, or short text strings. EHRs often store laboratory results and flowsheet information (eg vitals) in a structured format. This term purely reflects the organization of data and does not reflect on its content.Unstructured dataData that are not stored in a well-defined framework that can be referenced. Most clinical data that are accumulated are unstructured; this includes the text from clinical documentation, radiology images, endoscopy videos, and scanned reports. Because of the lack of a storage framework, unstructured data often require manual extraction of information. For example, one cannot reference the liver in computed tomography images just by selecting a filter for “liver pixels”; this usually requires manual identification for each set of images.FeaturesThe input variables in a data set, also known as the independent variables, predictors, or simply the variables. Features can simply be categorical values or continuous numerical ranges, or more complex components such as groups of pixels in an image. New features can be created by combination or transformation; this is known as feature engineering.Ground truthThe outcome or output variable that is used to train or to test a model's prediction or classification. This is usually a measured variable or one that has been determined by domain experts and is considered the gold or reference standard.Class/LabelThe model's output variable in a classification problem. If the outputs are mutually exclusive, they are known as classes; if not, they are referred to as labels. A model to determine if a polyp was cancerous would theoretically output “cancerous” or “non-cancerous” as classes. However, a model to identify various structures on a liver biopsy slide could output several labels for each slide such as “portal vein”, “central vein”, “steatosis”, and “fibrosis”.Loss/cost functionMathematical functions that calculate the difference between the ground truth and the model's predicted values during the training of a model. This function is minimized to optimize the model's prediction.ParametersThe main adjustable factors available to a model to optimize its performance during training. These are analogous to coefficients in statistical regression and can also be considered the weights applied to each feature.HyperparametersAdjustable factors, also known as tuning parameters, that determine how advanced models are set up to learn from the data. These are adjusted before training only and can be evaluated based on performance on the tuning set. Examples include the regularization function in ridge regressions, the k value in k-nearest neighbors, and the learning rate and depth in neural networks.Training setThe subset of the data that is used to train a model. These are the data in which various combinations of parameters are adjusted to minimize the loss function and establish an optimal model.Test setData that are used to evaluate a model's performance. These should be data that the model has not been trained or tuned on at any point. As such, it can be a small subset of the data that was held out or a completely external set of similar data that can be used for evaluation. Usually should be at least 20% of the training set in size.Tuning setSometimes referred to as the “validation” set, this is a small subset of the data available for model development. These data are used in advanced machine learning models to adjust the hyperparameters of the model to ensure that the model is not overfitting or underfitting data that was not used for training.OverfittingThe situation where a model has been trained to be very specific to the data contained in the training set and is thus not generalizable. This will be reflected by excellent performance metric on the training data but poor performance on tuning or test sets. This can occur if too many features are used in a model.UnderfittingA model is underfitting if it continues to perform poorly on the training set despite all hyperparameter optimization. This is indicative that the model framework is a poor candidate for representing the relationships in the data.RecallThe sensitivity or true positive rate of the model's predictions.PrecisionThe positive predictive value of the model's predictions.AccuracyUsed to evaluate the performance of a binary classifier. Defined as the ratio of correct predictions to the total number of predictions.F-scoreAnother measure of the accuracy of a binary classifier model. The traditional version of this metric is the F1-score, which represents the harmonic mean of the precision and recall.c-statisticA term for the value of the area under the receiver operating characteristic curve (AUROC or AUC). Used to evaluate the performance of a binary classifier. Open table in a new tab The composition and organization of data sets can take a variety of forms, but in general, they are considered a collection of unique points or observations with several variables. Each unique point is defined by the values of its variables. When modeling data, using either classical statistics or ML, certain variables are considered as candidate input variables, also known as independent variables or predictors, but referred to as features in ML. If a variable representing the output of the model is present, it is referred to as the dependent or output variable. Features, depending on the property they are describing, can either be categorical, either as a simple binary or composed of a set of discrete values, or as a range of continuous numerical values. In addition, features can be created anew by combination, mathematical transformation, or conversion from continuous to categorical, in a process known as feature engineering.Model ConsiderationsThe “learning” aspect of ML corresponds to the initial training phase of building a model, where an ML model is trained on, or “learns” from, a representative data set. This can occur in two main frameworks: supervised learning or unsupervised learning. Data that have been labeled with an output variable using a predetermined gold or reference standard, known as the ground truth, by subject matter experts or through actual measurement can be used to train a supervised model. Supervised learning leverages the ground truth information as an anchoring endpoint to guide the training of a model. Conversely, unsupervised techniques are designed to be used on data sets without ground truth. Without this predefined output, unsupervised learning relies on identifying distinctive groups of patterns within the features provided. Another distinct process, semi-supervised learning, is a combination of these two frameworks that is used when only a relatively small amount of data has available ground truth. This technique learns from both the subset with ground truth as well as the patterns in the remaining data to build a series of iterative models that assign all instances with a tentative output variable. These derived outputs are then used to train a final model. Lacking the information contained in an established endpoint, techniques that utilize semi-supervised or unsupervised learning require much larger quantities of data to achieve the same level of performance as those using supervised learning. As a result, most high-performing ML models in the health sciences tend to use supervised learning techniques.Regardless of the learning framework, the applicability of all ML models is heavily dependent on the sources of the data used. This is especially impactful in the use of ML models in health care, as these data are often plentiful but also usually restricted to large, academic referral centers in industrialized, western countries as sources. These populations are often imbalanced in terms of disease severity and demographics which in turn can result in similarly skewed model predictions. Without the presence of ground truth, skewed data can unduly influence models created with unsupervised learning methodologies, and extra scrutiny is required to determine the scope of such models.Supervised learning models, however, are not immune to this phenomenon as both the source of the features and the source of ground truth (eg from one expert or measurement vs from a consensus of multiple experts or measurements) can impact the model's results. The accumulation of more representative data is usually more labor intensive and often involves combining several sources. This trade-off between model applicability and data collection effort is one of the several competing interests to be considered when choosing how to build an ML model.ML can be used for understanding a data set with two overarching, yet competing, goals in mind: pattern inference or outcome prediction. Methodologies that optimize one usually do so at the cost of the other. Models with good pattern inference simplify the overarching relationships between candidate features and outcomes and, as a result, tend to have less accuracy in their predictive abilities. Conversely, the most accurate predictive models can rely on deducing very complex relationships between the features and outcomes rendering an understandable explanation very difficult. Another way to view the trade-offs between these competing paradigms is to compare discriminative and generative methodologies. Discriminative models concentrate on calculating the most efficient boundary between different outcomes, almost completely ignoring the overall distribution of the outcomes. Generative models, on the other hand, build a full representation of the distribution of each outcome without specifically focusing on separating them.These trade-offs between predictability and interpretability are important to recognize in some of the most common ML algorithms applied to medicine. Supervised algorithms that are used to place data points into discrete groupings are known as classifiers and are generally referred to as solving a “classification problem”. If the groupings output by a classifier are mutually exclusive, they are known as classes but otherwise are called labels. Classifiers are prototypical discriminative models and can often lack information on the features driving the discrimination. The solutions to “regression problems” are more interpretable models, using classical and nonlinear regression methodologies, that can output continuous numerical values based on a set of features balancing accuracy and predictive abilities. Unsupervised clustering algorithms that are used to group unlabeled data points also tend to balance discriminative and generative features.Training a ModelThe foundation of building an ML model is based on optimizing the mathematical operations applied to the input features to achieve the closest possible outcomes or predictions to the ground truth. In terms of the classic linear regression models, this can be simplified to choosing the coefficients for each independent variable such that the deviation between the model prediction and ground truth is minimized. This deviation, with regard to regression analysis, is measured using the mean squared error function which calculates the average squared difference between the actual and predicted outcome values. With more advanced ML models, higher-level functions such as cross-entropy loss estimate this deviation better than mean squared error. In general, these functions that capture the difference between model predictions and ground truth are known as loss functions or cost functions.During training, an ML model learns by varying its parameters, adjustable values specific to each feature and analogous to coefficients in regression models, to minimize its loss function. This process occurs on a subset of the full data set that has been prespecified to be the training set. Once the optimal parameters have been chosen, the performance of the model can be evaluated on a test set. A test set is data structured exactly like the training set but that the model has never seen, either as a hold-out subset of the initial data set or from a completely external source. Traditionally, the test set should be at least 20% the size of the training set. Poor performance on the test set can be indicative of either underfitting or overfitting. Underfitting usually shows poor performance in both the training and test sets and indicates that the model framework, even when optimized, is a poor candidate to represent the relationships present in the data. Overfitting will show excellent performance in the training set but poor performance in the test set; this is usually an indicator that the model has been overly optimized for the specific data points in the training set and thus has lost generalizability to data it has not seen.While traditional regression models are limited in how they are trained, more advanced regression and ML models have the option of adapting how they learn based on their tuning parameters. These tuning parameters, or hyperparameters, are also adjustable values but do not form a part of the model and are established before training. However, hyperparameters have a large impact on the final parameters and model performance and thus form an important part of optimizing the model. In advanced regression models, required hyperparameters include the regularization function which allows for feature selection, whereas other ML and DL models can have several hyperparameters including the learning rate as well as the size and complexity of the network architecture. As hyperparameters are not adjusted during training, their effects need to be specifically monitored. This is usually performed using a tuning set before the model is finalized and evaluated on a test set.Evaluating a ModelAs described previously, ML models are evaluated based on their performance on a test set. This evaluation is very similar to traditional biostatistical metrics, albeit with slightly different nomenclature. The most common metrics include recall (sensitivity) and precision (positive predictive value). In addition, the ratio of correct predictions to total predictions (known as accuracy) and the F-score are used to quantify an ML model's accuracy. Visually, the model performance is often shown as a receiver operating characteristic curve with additional reporting of the area under the curve, also known as the c-statistic, as a quantitative measure of performance. Increasingly, the precision-recall plot and the area under the curve are also being reported, particularly in the setting of imbalanced data sets.24Saito T. Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLoS One. 2015; 10e0118432Crossref Scopus (1493) Google ScholarTypes of ModelsClassical Supervised ML AlgorithmsRegression models, as alluded to previously, form the crucial link between ML and statistics. Linear regression (Table 2) is the simplest and most well-known of these models with the lowest computational cost. Although they offer great interpretability and produce quantitative rather than class predictions, they are not able to account for the several nonlinear relationships found in medical data. However, several nonlinear regression techniques, including polynomial and stepwise regressions as well as regression splines, have been developed to enhance their flexibility. In addition to nonlinear regressions, techniques such as ridge,25Hoerl A.E. Kennard R.W. Ridge regression: biased estimation for nonorthogonal problems.Technometrics. 2000; 42: 80Crossref Scopus (481) Google Scholar elastic net,26Zou H. Hastie T. Regularization and variable selection via the elastic net.J R Stat Soc Series B Stat Methodol. 2005; 67: 301-320Crossref Scopus (9596) Google Scholar and least absolute shrinkage and selection operator27Tibshirani R. Regression shrinkage and selection via the Lasso.J R Stat Soc Series B Stat Methodol. 1996; 58: 267-288Crossref Scopus (1670) Google Scholar are regression frameworks that allow for hyperparameter tuning to assist with feature selection.Table 2Classical Supervised Machine Learning TechniquesTechniqueOverviewLinear regressionClassical statistical model that calculates a “line of best fit” between inputs and output. Can be made more flexible to estimate limited nonlinear relationships by using stepwise regressions and splines.Advanced regression (ridge, LASSO, elastic net)Expansion of linear regression models with a regularization hyperparameter. Allows for feature selection but like linear regression cannot estimate complex relationships.Support vector machineDiscriminative, classification technique for both linear and nonlinear relationships. Uses a kernel function to mathematically transform each data point in a higher-dimensional feature space so a hyperplane (high-dimensional geometrical plane) separates groups.Decision treesSimple, yet versatile, model that uses several levels of branched decision points (nodes) based on feature values that end in groupings of terminal nodes called leaves. The simplicity allows for straightforward interpretation and feature selection, but it is difficult to model com

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call