INTRODUCTION Since John McCarthy introduced the term “artificial intelligence (AI)” in 1955,1 AI research has been growing. “AI” is an umbrella term that encompasses a vast degree of computer technologies (eg, expert systems, computer vision, robotics, and machine learning) (Figure 1A) as well as a concept of a machine-imitating human intelligence2,3 (Table 1). Modern AI is defined as a system’s ability to (1) perceive the current world, ie, data; (2) to cause and compare different approaches to achieve specific goals based on given data; (3) to tune their performance and apply to unseen data; and (4) to repeat the previous processes multiple times and update the previous learning.4 When reviewing results from AI models, it is therefore critical to understand whether they are appropriately developed and validated (Figure 1B). TABLE 1. - Definition of artificial intelligence and its subfields AI □ An umbrella term that encompasses a vast degree of computer technologies (eg, expert systems, computer vision, robotics, and machine learning) as well as the concept of a machine imitating human intelligence□ A system with the ability to perceive data, to cause and compare different algorithms to achieve specific goals, to analyze their performance and tune them, and then apply to unseen data, and repeat the previous process and update the previous learning Expert system □ Subset of AI□ Rule-based systems built with explicit coding of decision rules Machine learning □ Subset of AI□ Training a computer model to solve problems (eg, prediction) by using statistical theories or identifying specific patterns in the data (eg, phenotyping) Deep learning □ Subset of machine learning□ Algorithms to process multiple layers of information to model intricate relationships among data Decision tree □ Supervised machine learning algorithm□ Flowchart structure like a tree that has internal nodes, branches, and leaves○Internal nodes contain questions such as whether a patient has a fever >100.4F○Branches represent the answer (ie, yes or no)○Leaves represent final class labels□ Random forest is an ensemble of decision trees K-nearest neighbor □ Supervised machine learning algorithm□ Is used for classification and regression tasks based on similarities (ie, proximity or distance) between available data and new data Naïve Bayes □ Supervised machine learning algorithm□ Probabilistic classifiers based on Bayes’ theorem with an assumption of independence among predictor variables K-means clustering □ Unsupervised machine learning algorithm□ Identify similar characteristics in the dataset and partition into subgroups AI, artificial intelligence. FIGURE 1.: History of artificial intelligence and methodological evolution of artificial intelligence-related literature. A, A brief history of artificial intelligence: artificial intelligence was introduced in the 1950s along with machine learning.1 , 3 Since the 1970s, expert systems showed some success, such as in electrocardiogram interpretation.3 In the 2010s, substantial advances were achieved with deep learning, including image classification.3 B, Methodological evaluation of artificial intelligence algorithms. C, Area under the receiver operating characteristic. AUROC, area under the receiver operating characteristic.The Quality and Quantity of Input Data Like conventional parametric and semiparametric diagnostic or prognostic models, the quality of the conclusions drawn from the AI-based algorithms relies on the characteristics of the dataset, which is used to train the model. If the model is trained on/based on a biased dataset, the model itself is likely to be biased.5 AI might assume that probabilities are static or outcomes within the dataset are optimized, which might not be the case. Hence, similar to conventional methods, it is essential to look at the exclusion and inclusion criteria of the study. For example, a model for postliver transplant (LT) graft survival, which focuses on matching LT donor and recipient pairs, might not be accurate for recipients with hepatocellular carcinoma (HCC) if the input dataset did not include enough HCC patients. Also, an imbalanced distribution of covariates (eg, race) or outcomes (eg, rejection) within a dataset, can affect the usefulness of the algorithm. If black patients are underrepresented in the dataset, the model might not accurately provide a diagnostic or prognostic prediction for black patients. The input dataset for a disease diagnostic model needs to have a representative epidemiological spectrum of disease to be predictive.6 A recent report on the overall accuracy of diagnosing kidney allograft rejection was reported to be 92.9%.7 The authors used a deep learning–based computer-aided diagnostic system using diffusion-weighted magnetic resonance imaging combined with creatinine clearance and serum creatinine.7 But the grades of rejection and how rejection was confirmed, was not reported. Thus, for instance, if the input dataset mainly included Banff II or III rejection, the model may not accurately detect Banff I rejection. Similarly, if the outcome variable is not balanced, it is unlikely that the model, which is trained on an imbalanced dataset, truly represents the performance of the outcome of interest. Model Development and Validation Appropriate Reference Standard and Outcomes of Interest There are 2 broad divisions of machine learning (ML), which have different objectives: supervised and unsupervised learning. Supervised learning trains a model to predict a known status or group.5 The outcomes of interest need to be very clearly defined. Many algorithms are readily available such as decision tree, K-nearest neighbor, and Naïve Bayes (Table 1).8 For example, for kidney allograft rejection, the precise definition of rejection is paramount, which many papers have failed to do.7,9 Reeve et al10 used outcomes confirmed by 3 pathologists as the reference standard to verify the diagnosis of kidney allograft T cell–mediated rejection and antibody-mediated rejection reliably. In contrast, unsupervised learning models do not require clearly defined outcomes,5 and the computer will identify similar patterns within the dataset, such as the clustering of subgroups with similar characteristics, for example, K-means clustering8 (see Table 1). This method can be useful for hypothesis generation and to identify clusters of patients (phenotyping) or events that are more similar to each other. Model Development We will focus mainly on supervised learning from this point on since most literature in transplantation is based on supervised learning. Many ML models divide data into a training set and a test set during the model development.5 After the initial training of the model, the model undergoes internal validation to assess its performance. Sensitivity and regularization parameters can be fine-tuned at this stage to help optimize prediction without overfitting the data. For internal validation, K-fold cross-validation and bootstrapping are commonly used methods.5,6 For example, in 10-fold cross-validation, the dataset is split into 10 sets, and then 1 set is used for validation and 9 sets are used for training. This provides an average performance of 10 sets to reduce selection bias and to obtain more stable error estimates.11 Bootstrapping is a method of random sampling, and it is commonly used for very small datasets to get inferences about the dataset, such as standard error, confidence interval.11 If manuscripts about AI or ML models do not describe the internal validation or cross-validation process, the readers should view the results with caution.5 Overfitting refers to a condition where the model fits too well to the training set, but then the results likely do not apply/fit well to an external validation set.6 For example, a model could simply “memorize” the labels of the input dataset, which would cause the accuracy to be 100%. However, this model would have very poor generalizability when predicting new data.5 This can be mitigated partially by regularization or reducing the “flexibility” of the algorithm,11 paradoxically reducing the “learning” ability but increasing the potential generalizability to unseen data. Given that the chief benefit of AI is flexibility above traditional models, comparing the final model to traditional methods with an appropriate statistical test is very important. Discrimination and Calibration Discrimination is defined as a model’s ability to distinguish different conditions,12 such as rejection or not. The C-statistic or the area under the receiver operating characteristic (AUROC) is the most commonly reported model performance metrics for classification problems. The AUROC curve indicates the relationship between false-positive rate (1-specificity) and true-positive rate (sensitivity)6 (Figure 1C). A greater AUROC indicates that the model has higher sensitivity with the same false-positive rate and therefore possesses better discriminatory power. In general, an AUROC <0.60 discriminates poorly, an AUROC 0.60–0.75 is regarded as possibly helpful discrimination, and an AUROC >0.75 is considered useful.12 The AUROC can also assist in choosing optimal cutoff values depending on the desire to optimize sensitivity, specificity, disease prevalence, clinical setting, and cost of misclassification.6 However, the use of AUROC does not reflect a model’s performance well when the proportion of negative outcomes is very high.13 For example, data in which 95% have no rejection will give the same AUROC as with a balanced data set although the false-positive rate using the unbalance data will be high. Calibration or goodness of fit refers to how accurately a model can predict the absolute risk estimates.12 Hence, to adequately interpret the results, discrimination and calibration need to be assessed at the same time.12 For example, if a model predicts the risk of dying much lower or higher than the actual risk of dying, then the calibration of that model is poor. The calibration curve and the Hosmer-Lemeshow test are commonly used to assess the degree calibration.6 Other Measures Accuracy, precision (or positive predictive value [PPV]), recall (or sensitivity), specificity, negative predictive value, and F1 scores are also commonly used to report on model performance. These measures can be easily obtained from a confusion matrix and should all be evaluated (Table 2). The confusion matrix, or error matrix, is a table summarizing the model performance.14 TABLE 2. - A, example of a confusion matrix—data. B, example of a confusion matrix—result A Actual “rejection” Actual “no rejection” Model A Predicted “rejection” 170 (TP) 70 (FP) Predicted “no rejection” 10 (FN) 50 (TN) Model B Predicted “rejection” 120 (TP) 10 (FP) Predicted “no rejection” 60 (FN) 110 (TN) B Model A Sensitivity 0.94 0.67 Specificity 0.42 0.92 Accuracy 0.73 0.77 Model B PPV 0.71 0.92 NPV 0.83 0.65 F1 Score 0.81 0.77 For example, in a set of 300 biopsy results, there are 180 cases of “rejection” and 120 cases of “no rejection.” Two machine learning models were trained and produced these predictions.Equations14 and calculations for model A: FN, false negative; FP, false positive; NPV, negative predictive value; PPV, positive predictive value; TN, true negative; TPR, true-positive rate; TR, true positive. Accuracy refers to the total fraction of correct predictions.14 Model A correctly predicted 170 true rejections and 50 true nonrejections in a cohort of 300, leading to an accuracy of 0.73 (Table 2). While model A and model B have similar accuracy (0.73, 0.77), their abilities to detect rejection vary. For example, model A can predict 170 cases as rejection out of 180 total rejection, while model B predicted only 120 cases (sensitivity 94% versus 67%). Hence, accuracy alone can give a false impression of a model, especially if the dataset is imbalanced. In datasets with very high true-negatives rejection (and therefore low true-positives rejection), accuracy can be high, yet it is not useful to detect rejection. Positive predictive value (PPV) or precision (true positives/[true positives + false positives]) refers to the ratio of correctly predicted positive cases to all predicted positive cases. It estimates the probability that a positive result implies the actual presence of the disease or outcomes.15 When a model has a high PPV, the positive result indicates the patient almost certainly has the disease. Unlike sensitivity, PPV is affected by the prevalence of the disease or the outcome.15 For example, the PPV in models A and B is 0.71 and 0.92, respectively (Table 2). If the prevalence of rejection is very high, model B is useful to detect a disease even with low sensitivity due to the high PPV.15 Negative predictive value (true negatives/[false negatives + true negatives])14 is defined as the correct negative predictions from the total of negative predictions (Table 2). Similar to PPV, a high-negative predictive value (NPV) indicates that the patient with a negative predicted outcome is highly likely not to have a disease. Unlike specificity, NPV is influenced by the prevalence.15 F1 score accounts for both precision and recall (or sensitivity) but is not influenced by true negative rates.14,16 Therefore, the F1 score is a more suitable performance metric when the data are imbalanced with a large number of true negatives.16 As mentioned above, high accuracy does not reflect the model’s performance adequately if the input dataset is imbalanced. An F1 score ranges from 0 to 1, and a higher value denotes better classification performance14 (Table 2). Imbalanced data should be accounted for in model development. When the positive (rejection) and negative (no rejection) outcomes are not properly balanced (usually <10% positive outcomes) in training data, the model performance may suffer. This can be mitigated by synthetically balancing the data using algorithms such as Synthetic Minority Over-Sampling Technique, using measures such as balanced accuracy, which is defined as an average of sensitivity and specificity,17 or a Bayesian modeling approach (ie, adding weight to the smaller subset to balance out the dominant subset).18 In any case, the algorithm trained and then tested on an imbalanced dataset needs to be retrained and retested after balancing. External Validation/Generalization Ideally, AI or ML models should be validated in a different patient population, at another site or in prospective cohorts, much like in traditional statistical models. Because the model performance could drop in a new data environment such as different computed tomography image resolution with different computed tomography scanner, different electronic medical health records system. It is, therefore, essential to pay attention to whether a model is externally validated. AI IN TRANSPLANTATION The impact of AI in transplantation has not been extensively studied but has gained more attention in recent years. We will introduce some recently published AI-related papers in transplantation and use the above-mentioned quality tests to describe the papers. One-year postheart transplant survival was investigated using the United Network of Organ Sharing data from 1987 to 2014.19 The authors tested multiple ML algorithms (eg, neural network) and traditional statistical models (eg, logistic regression).19 Overall, the study was well designed and followed proper methodology. For example, 80% of the data was used for training, and 20% was used for validation. Serial bootstrapping was performed to examine the stability of the C statistics. Calibration was assessed with calibration curves and the Hosmer-Lemeshow goodness of fit. The neural network showed similar C statistics (0.66) compared with logistic regression C statistics (0.65). However, the calibration of the neural network was lower than the traditional statistical models. The authors concluded that the ML algorithms trained on this dataset did not outperform traditional methods. However, they suggested that ML might outperform traditional statistical methods if augmented by data from electronic health records. Reeve et al10 reported on ensembles ML models providing automated kidney biopsy reports in conjunction with molecular measurement for allograft rejection. Their random forest-based algorithms demonstrated a similar level of balanced accuracy compared with pathologists (92% for T cell–mediated rejection and 94% antibody-mediated rejection). The study has multiple strengths: a very large dataset with a diverse population and a gold-standard reference such as confirmed rejection by pathologists. Also, the paper reported on the accuracy, sensitivity, specificity, PPV, NPV, and balanced accuracy for the readers to understand the model’s performance. This study would be strengthened by external validation. Bertsimas et al20 developed an optimized prediction of mortality model (OPOM) to predict LT candidate’s 3-month waitlist mortality or removal from the waitlist. During the model development, the authors used a very large dataset (1 618 966 observations) from the standard transplant analysis dataset in patients 12 years or older. The impact of OPOM was assessed with the liver simulated allocation model and compared with the model for end-stage liver disease (MELD). Allocation based on the OPOM model showed 417.96 fewer deaths per year compared with MELD, and more LT in female patients. The AUROC of OPOM was highest at 0.859 (0.841 MELD-Na). Overall, the study was well designed and improved on the current system, such as MELD. Interestingly, the authors included the pediatric population, which might impact the prediction results for adults. The authors carefully designed the study to be useful for HCC and non-HCC patients. The OPOM should be tested in a new patient group for external validation. CONCLUSION The use of AI in transplantation to answer pertinent questions is increasing and undoubtedly will continue to provide valuable information. As with any method or tool, it is essential to understand its limits and to be able to evaluate its application critically. Is the input dataset appropriate to answer the question? Is the internal and external validation of the model appropriate? How well did the algorithm performs discrimination, accuracy, calibration, F1 score, NPV, precision? Users should know the strength and limitations of the AI to avoid overreliance, which can lead to spurious conclusions or neglect clinical warning signs. Since AI algorithms may struggle to detect very rare diseases or unusual cases, clinicians’ discretion is important for those cases. Therefore, having a basic understanding of how to evaluate AI/ML papers will ensure that this promising methodology is appropriately applied.

Full Text

Published Version
Open DOI Link

Get access to 115M+ research papers

Discover from 40M+ Open access, 2M+ Pre-prints, 9.5M Topics and 32K+ Journals.

Sign Up Now! It's FREE

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call