Abstract

In the Beginning… Artificial intelligence (AI) is a term coined by John McCarthy, a computer scientist who spent most of his career at Stanford University where he received many awards and honors, including the 1971 Turing Award for his contributions to the topic of AI [3, 72]. AI has expanded into the healthcare sector with the promise of enhanced predictive, diagnostic, and decision-making capabilities. The allure of AI in orthopaedics has been fueled by growing datasets, integration of medical records with images and financial transactions, systematic recording of patient outcomes, increasing storage capacity, and a steep rise in affordable computational power [8]. Since the mid-20th century, AI has seen much success with rule-based approaches in medicine including electrocardiogram interpretation [35], disease diagnosis [17, 45], treatment selection [66], and even early forms of clinical reasoning [4]. Some of the more exciting recent advances in medical AI have occurred in medical image diagnostic systems, a domain previously limited to human experts, including devices that achieve physician-level accuracy across many fields [28, 40, 49, 55, 65, 68, 74, 75]. The Argument In orthopaedic surgery, among other specialties, clinical decision-making and image interpretation often have subjective components that involve a dependence on reviewer expertise. Part of the appeal of AI lies in its objectivity, reproducibility, and ability to incorporate large sums of data into each decision. In addition, unlike humans viewing images or slides, these machines do not make errors associated with carelessness or fatigue [38]. Naturally, these useful tools have found their way into orthopaedics to an increasing extent. Recently, Cabitza et al. [11] reported the number of articles within the orthopaedic literature mentioning machine learning has increased by a factor of 10 since 2010. With this increase in volume and challenging approach, some have argued that many studies have had poor methodology, and that machine learning and deep learning techniques provide no advantage over traditional statistics, which is not true [12, 16, 48, 72]. How AI algorithms work and are defined and validated is critically important to clinicians who rely on them [41], as is the question of in what clinical settings these tools may be applied in ways that help patients. The tools themselves vary, as does the degree to which they have been validated for clinical use; clinicians and scientists continue to debate which types of models are most likely to be effective and which ones are ready for real-world use. To begin to answer these questions, an overview of AI, machine learning (ML), and deep learning that focuses on differences among algorithms and best practices for model implementation in clinical medicine seems important. Essential Elements We began with a search of MEDLINE and Google Scholar, using the terms “artificial intelligence,” “machine learning,” and “deep learning” and limiting the years from January 2016 to December 2020. This resulted in approximately 8000 titles; after removing duplicate results, eliminating irrelevant articles based on the title/abstract, and discarding less clinically relevant topics such as image segmentation and gait analysis, there were 485 articles to be further screened. Most of these would have informed our section (below) “AI in Orthopaedics,” and in all likelihood, more than 100 would have met reasonable quality criteria for that section alone. In the interest of keeping this review of modest length, focused, and relevant for the clinician approaching this broad topic for the first time, we narrowed our review subjectively to those articles we believed were of the best quality that also met general-interest and readability standards for an audience of nonspecialists in AI. As such, we focused on articles published in general medicine and orthopaedic journals such as JAMA, Nature, Clinical Orthopaedics and Related Research®, and the Journal of Bone and Joint Surgery, American Volume, as well as subspecialty journals such as The Spine Journal and the Journal of Arthroplasty. Given the breadth of the topic and the number of articles involved, we did not conduct a formal systematic review or formal study-by-study assessment of quality with tools such as the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis guidelines (TRIPOD), the Grading of Recommendations, Assessment, Development, and Evaluations criteria (GRADE), or the Methodological Index for Nonrandomized Studies checklist (MINORS). Readers should interpret our comments and recommendations here with an understanding of this limitation. What We (Think) We Know Orthopaedics has seen a substantial increase in research leveraging the power of AI. The following questions may help orient readers to understand, reproduce, investigate, and question the growing AI resources. What are AI, Machine Learning, and Deep Learning, and How Do They Differ? The term AI has been used to describe any device that perceives its environment and takes action that maximizes its chance of successfully achieving its goals [10]. Meanwhile, ML is a subset of AI used to describe when the system learns from past data [46] (Fig. 1). As a differentiating example, electrocardiogram interpretation systems may be considered AI, but not ML, because they were programmed to classify rhythm strips according to a set of rules, rather than the software being modeled by prior data and outcomes directly [9]. In the field of orthopaedics, the Mako system may be considered AI because it constrains the cutting system to the stereotactic boundary but does not require training data before it has decision-making capabilities [70].Fig. 1.: This figure shows definitions of artificial intelligence, machine learning (ML), and deep learning.These rule-based AI systems are costly to build and often require time-consuming and advanced computational power because of the algorithmic decision tree the software is forced through [80]. Creating such software requires thorough communication from a series of experts in diverse fields, further complicating its development. As a result, more recent AI systems have leveraged ML algorithms, which may account for complex interactions by identifying patterns among the data and performing these calculations in a timely and hands-free manner [18]. Even with advances in ML, the more broadly defined AI applications, now termed “symbolic AI,” have remained in healthcare to handle and improve complex tasks [8]. ML, a subfield of AI, uses statistical, data-driven, and automated rules derived from a large set of samples to transform the given inputs to given outputs [21]. ML is a natural extension of traditional statistical approaches [6]. First introduced by IBM as a method of achieving AI [15, 63], the “learning” is computationally achieved by incremental optimization of a mathematical problem [11]. Ongoing efforts in ML have targeted identifying an algorithm that is applicable to a wider array of clinical questions. The foundation of artificial neural network research stems from attempts to mirror the learning power of the brain, with the theory that nature has already evolved a powerful learning algorithm (Fig. 2). The functional unit, the artificial neuron, uses the calculations commonly performed in ML, such as traditional regression techniques, and compounds these results with subsequent neurons and calculations [43]. Here, the neuron may excite or inhibit neurons it communicates with, and “learns” in part by adjusting the weights and importance of neurons and inputs. This communication network develops an intricate organization of mathematical functions, allowing neural networks to better define complex relationships between inputs and outcomes than traditional ML algorithms do [41]. The term “deep learning” stems from the exploration of the communication network depth (hence, “deep” learning) in addition to the design architecture and mathematical operations performed [42].Fig. 2.: This example of an artificial neural network shows the relationship among input, hidden, and output layers as well as the fluctuating network width and depth. The left side represents a simple artificial neural network, where a single hidden layer is employed. The right side represents a deep artificial neural network with multiple layers. The developer may change the width (number of neurons) in a given layer, the depth (number of layers), and the connections between neurons.A classic example of deep learning involves classifying a picture of handwritten numbers between 0 and 9. Early artificial neural network layers may function to identify vertical and horizontal lines in the picture. Further layers may focus on the pattern or position of such vertical and horizontal lines as they relate to each other (Fig. 3). Given that every person has unique handwriting, there must be a complex interaction of this pattern recognition. If the writing is tilted or italicized, the interpretation of vertical and horizontal lines may not have the same relationship to the correctly identified answer or digit. To account for these interactions mathematically, the model may incorporate a nonlinear function. In other words, there may be multiple combinations of vertical and horizontal lines that account for a single digit. Historically, in neural networks, each neuron has had a nonlinear sigmoid activation function, similar to those seen with neurons in our brain, where they would be activated after reaching a certain action potential [67]. In its simplest form, this binary classification algorithm of the neuron is known in statistics as logistic regression; thus, deep learning may be considered a compounded ML approach [20]. As the algorithm trains on the dataset, the computer tells the neural network which neurons should and should not have been activated and adjusts according to parameters the user controls.Fig. 3.: This figure shows an example of an artificial neural network classifying handwritten digits. The input layer (the handwritten number 8) shows the red-green-blue values for each pixel. In this example, beginning hidden layers may begin to function by identifying patterns in the image, such as vertical lines, horizontal lines, or curves. Later layers will build on this. The final output would be to classify the image to the appropriate digit; in this case, the number 8.The Growth of Deep Learning Over the past few decades, there has been substantial growth in a computer’s ability to record, store, and process data. The advent of graphical processing units, originally designed for games, greatly improved the speed of matrix multiplication. Any grid of datapoints—such as a table of numbers—may be represented as a matrix, but an image may also be represented as a matrix, since a picture is simply a table of red, green, and blue values. These numbers may be continuous variables such as patient weights or categorical variables like diagnoses, for which a number may correspond to a specific disease. The second change driving the success of deep learning is the rapid growth of data storage availability and larger datasets recorded. Artificial neural network performance improves with larger datasets, while traditional ML performance tends to plateau [71] (Fig. 4). However, this improvement is dependent on appropriate training parameters and a representative dataset which it will later be applied to [1, 12, 44]. As one can imagine, training an algorithm on a population dissimilar from which it will be applied may be detrimental if deployed in a healthcare setting [22, 48, 52].Fig. 4.: This graph shows the relationship between the amount of data and algorithm performance of traditional ML algorithms and neural networks of few (shallow) layers and many (deep) layers.Finally, the complexity of clinical problems often requires a degree of flexibility among ML models. Here, neural networks may adjust their communication network among neurons and architecture to fit a specific problem. What is Supervised and Unsupervised Learning? ML algorithms can be classified as either supervised or unsupervised learning. Supervised learning methods involve collecting a large number of training cases (for example, the image of an animal) with desired output labels (such as the word “dog”) [6]. Meanwhile, unsupervised learning involves identifying subclusters of the original data by inferring underlying patterns in unlabeled data [80]. With enough example images, the algorithm might eventually infer the existence of several classes of animals without the need to explicitly label them. However, unsupervised learning is frequently difficult to control. For example, the user may find it useful to differentiate between pictures of a cat and a dog, while an unsupervised approach may result in the algorithm differentiating between breeds of dogs or size of the animal. What Are the Advantages to Using Deep Learning? Deep learning algorithms have been shown to have greater flexibility in solving an array of problems, and they improve performance with larger datasets. Among pivotal results supporting the use of neural networks is the universal approximation theorem by Cybenko [14], who proved there is a neural network with a single layer that may solve any continuous function within a given error constraint. Put another way, a network with enough input parameters can answer almost any real-world question [14, 31]; however, the amount of data required to appropriately train the algorithm with a given accuracy and without overfitting the data also increases [59-62]. Montúfar et al. [47] added to this conclusion by showing that a neural network’s ability to model a given problem grows exponentially with neural network depth (number of layers) and only polynomially with network width (number of neurons in a given layer) (Fig. 5). Neural network depth and width can be adjusted (see Fig. 2). Knowing that exponential growth will always eventually outperform polynomial growth, the attention of neural network design transitioned toward networks with greater depth and more layers. This understanding is the foundation for deep learning, in which the depth of the network (number of layers) is explored in addition to the width (number of neurons) at each layer.Fig. 5.: This graph shows how any exponential relationship will always eventually outperform a polynomial relationship. In this figure, although the polynomial equation has a larger exponent of 5, with its smaller base value of 2, the exponential equation will still eventually outperform the polynomial equation.What Clinical Questions Will Benefit from Each Approach? Tasks that surgeons find repetitive and mundane may be those most suited for AI; the expression “the machine doesn’t tire” expresses this idea well. Repetitive, high-volume tasks can result in medical errors when handled by humans, and using machines for these tasks allows surgeons to focus on clinical questions that benefit from human insight and gives them more time with their patients [7, 19, 29]. As an example, AI algorithms were recently shown to classify knee osteoarthritis on plain radiographs as accurately as fellowship-trained arthroplasty surgeons [64]. Similarly, recent studies have demonstrated that even expert surgeons are no more likely than chance to predict which patients find meaningful improvement after knee replacement [25], while a computer algorithm did fairly well [24]. Further, as noted, these machines do not make errors associated with fatigue, burnout, or carelessness. Use of these models may improve patient safety and efficacy of their care, with several studies showing clinician time devoted to imaging interpretation decreased when aided by ML models [26, 30, 39]. Is AI Better Than a Top-of-the-line Clinician? In short, for some applications, yes. All AI algorithms are developed from past data. Diagnosing osteoarthritis on a radiograph using AI may not outperform clinicians, given that the algorithm relies on clinicians to label radiographs according to the severity of osteoarthritis. However, an algorithm predicting postoperative outcomes may outperform clinicians, given that the outcomes may refer to laboratory values, patient satisfaction scores, or other factors that do not directly rely on the clinician’s interpretation. To further this point, a recent study by Ghomrawi et al. [25] showed even high-volume surgeons were no better than chance in discriminating between patients who may benefit from THA. To answer this question more definitively, future studies should involve prospective randomized clinical trials, since most studies currently are retrospective, observational studies and as such, are at high risk of bias [50]. Some algorithms may surpass human expertise; some of these involve an area of deep learning called reinforcement learning. Here, the algorithm often is initially trained on available data to gain a preliminary understanding of its purpose, then it subsequently competes against itself and “learns” based on a predefined reward function and a clearly defined environment. On a superficial level, this may appear simple; however, this methodology has typically been limited to board games and environments requiring only basic physics such as gravity because of the incredibly complex and ever-changing environment that real-world environments provide. As an example, the Google DeepMind AlphaGo algorithm [69] learned the game of Go and beat the 18-time work champion, Lee Sedol [78]. Unlike the popular success IBM supercomputer Deep Blue had in beating chess champion Garry Kasparov [76], the game of Go has more possible board positions than atoms in the universe, meaning the computer could not consider every possible board position and instead had to rely on training experience to make decisions. Despite how impressive the Go victory was, its success was in one specific game with a strict, narrow set of predefined rules. AI in Orthopaedics Diagnoses based on medical imaging and treatment outcome prediction are among the most studied topics, although deep learning and support vector machines are among the most frequently applied algorithms [11]. Several studies have developed models for fracture detection, with two classifying fractures [2, 5, 13, 34, 39, 53, 73, 79]. Robotic systems controlled by AI are routinely used for assembly-line practices and in biomedical laboratories [77]. However, the expansion and adoption of autonomous systems in surgery has been considerably slower, with most systems, such as the FDA-approved da Vinci surgical system (Intuitive Surgical, Inc.), simply translating surgeon hand movements as a form of robotically assisted surgery [27]. Autonomous knot-tying robots have been developed recently, and they are one of the most commonly used procedures during surgery [77]. Other successful applications of AI include optimization of scheduling at hospitals. The University of Colorado Health implemented an AI-based surgical scheduler, increasing revenue by USD 15 million (4%) and the number of blocks by 47% [37]. They did so 10% faster and retained six additional surgeons without changing their total staffed operating room block allocation. Similarly, New York Presbyterian Hospital used AI scheduling optimization to reduce patient wait times in the clinic by 50% [36]. These examples show the pragmatic and effective delivery of AI today. Recent advances in patient-specific cost, length of stay, and disposition modeling have modeled historical data that could be applied prospectively, with an area under the curve between 0.72 and 0.89 [33, 51, 56-58]. Because current bundled payment models frequently do not account for differential preoperative risks, AI techniques allow for more reliable preoperative prediction of outcomes [29]. Knowledge Gaps and Unsupported Practices Several ethical considerations should be identified and mitigated when using AI algorithms, including patient privacy, confidentiality, and consent. The use and potential risk of AI should be discussed with the patient [56]. Surgeons must understand the data from which the algorithm was developed because application to their patient may not be representative [81]. For example, an algorithm trained on inpatient data may not correctly translate to the outpatient setting. Similarly, care must be taken when preparing and describing data to ensure their appropriate implementation. Further, widespread implementation of AI models leaves medical systems potentially vulnerable to adversarial attacks, an advanced technique to subvert an otherwise reliable AI model [23]. Consequently, caution must be taken when relying on AI algorithms, because Finlayson et al. [23] pointed out that it is always easier to break systems than to protect them. Barriers and How to Overcome Them Careful data collection that minimizes sources of bias and ensures relevance to contemporary practice is critically important. Developing systems of the future based on data from today is inherently limited and further exposes the necessity to incorporate details regarding data such as the period from which the data were drawn and from what patient population. As an example, using AI with data collected on prior implants may not apply to those in common use today; it becomes critical for the user to understand this limitation when making decisions. Further, one must also take care to avoid statistical shortcomings that derive from sampling problems or measurement error in prediction variables, as well as heterogeneity of effects. Erroneous claims related to social factors, gender, and race may also result from improper data collection or modeling; these have the potential to cause real-world harm, and a critical methodological eye is called for when papers on these themes are evaluated. As an example, smaller subgroups for sampling or statistical convenience may be lumped into larger subgroups (sometimes labeled “other”), such as when the number of individuals in a smaller ethnic group is too small to be analyzed individually and is included in—for example—a group labeled “non-white”. This may undermine the model’s ability to make predictions for the groups that could not be individually analyzed, and assuming that they parallel those in the larger groups to which those patients were assigned may not be accurate. To combat these issues, we believe that models of the future will be able to use AI to identify bias and alert the clinician if scenarios like these should arise. As an additional tactic to mitigate bias among data collection is the potential for uniform collection without the possibility for selection bias; noninvasive tools like fitness trackers may make this more possible for endpoints like vital signs. Further, developers may preferentially train algorithms from randomized controlled trials as opposed to observational data [54]. With regular use of computers and electronic health records in clinical settings, computational power is not a barrier. However, computation performed locally or in the “cloud” has implications for patient privacy and maintenance of programs and hardware and requires careful consideration [41]. Similar to after-market surveillance for medications, continued monitoring of ML systems is essential to help detect unexpected problems including adversarial attacks, changes in practice, or changes in patient populations [32]. Without continued monitoring of these ML systems for adversarial attacks, or malicious attempts to subvert an otherwise useful model, their widespread implementation may carry serious health consequences. 5-year Forecast AI is a powerful tool with great potential in the medical field. We speculate that the implementation of these technologies will continue to increase, given the continued expansion of data storage and processing power. These algorithms incorporate large amounts of data into each decision, are reproducible, offer improved accuracy with data growth, are flexible in architecture and the clinical question to which it may be applied, and, importantly, these machines do not fatigue. However, clinicians should verify the validity and impact of such methods, similar to any other diagnostic or prognostic tool. Finally, these algorithms are no substitution for a clinician; we believe it is likely that clinicians will never be replaced. Despite sometimes grandiose claims, AI is still only useful in narrowly defined contexts. We anticipate that over the next 5 years, informed clinicians will incorporate AI into their practices as a tool to improve outcomes and reduce complications and burnout, but that it will not be used as a substitute for clinical reasoning or expertise.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call