Abstract

The International Coalition of Medicines Regulatory Authorities recently announced a collaboration to address the expanded use of Real-World Data (RWD) to support regulatory decision-making.1 Drug safety and comparative effectiveness studies using electronic health records and medical claims databases are being used increasingly because compared with randomized clinical trials (RCTs) they can offer larger populations and be completed more quickly at a fraction of the cost. While database studies deliver results with impressive efficiency, this efficiency can come at a scientific cost. Instead of designing data collection to meet the scientific needs of a particular study, database studies analyze (secondary) data not designed to support scientific research. Consequently, database studies are more vulnerable than RCTs to biases arising from inaccurate data. A challenge in using database studies to guide health policy decisions, therefore, is ensuring that database studies are not only efficient, but also accurate. One potential source of error in database studies is outcome misclassification. Diagnoses recorded in health encounter databases are influenced by factors such as variation in patient care-seeking behavior, access to care, quality of care, clinical challenges of differential diagnosis, and diagnoses that can indicate suspected or rule-out conditions. To address uncertainties in identifying health outcomes, database researchers often identify outcomes using combinations of diagnoses, procedures, treatments, medical specialties, place of service, and other factors in the form of an algorithm (also called a computable phenotype or operational definition).2, 3 Owing to limitations in source data, however, even the best algorithms are unlikely to identify every case of interest (false negative errors) and can identify as cases individuals who do not have the outcome (false positive errors). Because errors in identifying outcomes can distort study results, it has been proposed that “validation of algorithms is essential to avoid misclassification bias.”4 A recent FDA Guidance for RWD advises that “FDA expects validation of the outcome variable to minimize outcome misclassification.”5 Regulators impose high scientific standards and even though validation is not widely practiced in epidemiology, validation is routinely requested for studies submitted to regulators. Many public health researchers receive little formal training in validation, however, and validation studies are not covered thoroughly in texts.6 Consequently, many researchers are guided by convention and validation tends to be conducted perfunctorily. As regulators contemplate wider use of database studies and despite routine use of validation, the impact of outcome misclassification bias in these studies is rarely assessed and remains largely unknown. Analytic methods to adjust results for misclassification have been described but have not been widely adopted.7-17 In this Commentary, we highlight shortcomings in popular validation methods and strategies for reducing misclassification bias. Basic concepts of validation have been described recently for epidemiology4, 6 and specifically as practiced in pharmacoepidemiology.3 In brief, outcome validation entails comparing outcome classification in the database with outcome classification in a gold (or reference) standard (e.g., medical records or disease registry data). Because the gold standard is typically more expensive and time- consuming to obtain than the information in the database used for the main study, the gold standard is usually used for a sample of the study population to assess the performance of the case identifying algorithm.2, 3, 16 Algorithm performance is assessed by parameters such as positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity. The conventional strategy is to report the proportion of potential cases that was confirmed by the gold standard, or PPV (Table 1).2-4, 16-18 A high PPV means the outcome of interest is confirmed for a large proportion of potential cases and is taken to suggest the algorithm is fit for purpose, with a PPV of 70%–80% or greater considered to indicate a high-performing algorithm.3, 16, 18 For a couple of reasons, the conventional practice of reporting an effect estimate and a PPV does not avoid or minimize bias from outcome misclassification. First, a PPV indicates the proportion of people identified as cases who are noncases (false positive errors), but does not address cases the algorithm failed to identify (false negative errors). While the importance of each kind error varies by setting, each type of error can bias study results.7-17 Second, to understand the impact of misclassification on study results requires parameters that estimate false positive (specificity or PPV) and false negative (sensitivity or NPV) errors in each of the comparison groups.7-17 Differences between comparison groups is readily seen as important in the context of confounding bias, but less easily recognized for misclassification bias. Even in the presence of a high PPV, small differences between comparison groups in misclassification errors can introduce important bias and lead to incorrect conclusions.9, 16 A simulation study demonstrated that a result showing a negative association (RR = 0.87 (95% CI 0.76, 0.99)) with an outcome PPV of 93% could be the result of outcome misclassification masking an even stronger association (RR = 0.70 (95% CI 0.60, 0.81)).16 Alternatively, the negative result could be entirely spurious due to misclassification bias creating an association when in fact there was none (RR = 1.0 (95% CI 0.88, 1.13)).16 Simply put, reporting an effect estimate accompanied by a high PPV does not indicate the presence, direction or magnitude of outcome misclassification bias. FDA recommends including a quantitative bias analysis to demonstrate whether and how outcome misclassification might affect study results. Using validation to inform bias analyses has implications for the design of validation studies.6 The conventional approach to validation in database studies is to calculate one PPV, whereas to support bias analysis validation studies would need to stratify by exposure status in calculating a PPV for each comparison group.6-17 Although conceptually straightforward, stratifying validation samples by exposure status would necessitate larger validation samples to maintain precision. Second, unless there are sound reasons to disregard false negative errors, estimates of sensitivity or NPV are also needed.7-17 Quantitative bias analyses are usually presented using estimates of sensitivity and specificity which can be challenging to estimate in database studies. Bias analysis methods have also been developed for predictive values,6-12, 16, 17 but while estimates of PPV are commonplace, estimates of NPV in studies of rare outcomes can require large sample sizes and are reported rarely. It is sometimes thought that PPV is the only bias parameter that can be estimated in database studies.16 Brenner and Gefeller8 developed a simple bias analysis that uses PPVs in each comparison group so is well-suited to correct for outcome misclassification bias in pharmacoepidemiology.8 Using only PPVs requires that sensitivity be nondifferential, however, which is a tenuous assumption rarely supported by data.19-23 They also present an equation for a simple bias analysis that replaces the nondifferentiality assumption with estimates of PPV and sensitivity in each comparison group.8, 17 Estimating sensitivity in a database study is not as straightforward as estimating PPV. To estimate sensitivity, we need as a denominator a sample of true cases. When studying rare outcomes, a random sample of the study population would need to be quite large to identify a useful sample of true cases. By excluding noncases using conditional sampling, however, it may be possible to target a sample that is enriched in people more likely to be cases.24 For instance, one could design a “screening” algorithm aimed at capturing all (or nearly all) of the cases so that it has near-perfect sensitivity, while still being specific enough to exclude many noncases.25 Validating a random sample of people identified by such as screening algorithm could generate sufficient cases to be used to estimate sensitivity of algorithms designed to have high PPVs (Table 1). Suppose one were interested in occurrence of serious acute nonfatal myocardial infarction (AMI). A primary algorithm might require a primary hospital discharge diagnosis of AMI.26, 27 To assess sensitivity of this algorithm in the study population, a more sensitive screening algorithm could be developed by relaxing the requirement that the AMI diagnosis be in the primary position thus including a discharge diagnosis of AMI in any position.28 One might also broaden the list of diagnoses to include diagnoses associated with AMI, such as unstable angina. Such a screening algorithm might identify all AMI cases of interest and include a larger proportion of AMI cases than a simple random sample of the population. One could then obtain gold standard information on AMI diagnosis for a random sample of people identified by the screening algorithm. Because the population identified by the screening algorithm is enriched in AMI cases, a much smaller sample of people who meet the screening algorithm is needed to obtain a desired number of AMI cases than a sample of the entire study population. After outcome classification using the gold standard, the confirmed AMI cases could then serve as the denominator for estimating the sensitivity of the primary algorithm (e.g., a primary discharge diagnosis of AMI). By designing the validation sample to include people from each comparison group, one could estimate values of PPV and sensitivity of the primary algorithm in each comparison group. These bias parameter estimates could then be used in a bias analysis to correct the RR estimate for outcome misclassification bias to the accuracy of the gold standard.8, 17 Conventional practice in safety and effectiveness studies using validation is to report an effect estimate along with a PPV to assess algorithm performance. A problem with this approach is that we cannot infer from a PPV the impact of outcome misclassification errors on study results. A high PPV does not mean outcome misclassification bias is negligible, and a low PPV does not mean that misclassification bias is important. A preferable approach is to use validation in RWD to inform quantitative bias analysis that corrects effect estimates for misclassification bias to the accuracy of the gold standard. Such a strategy requires larger validation samples and additional resources to complete database studies but would not substantially affect the efficiency advantage of database studies compared with RCTs. Validation for RWD studies is common only for studies submitted to regulators mainly because regulators request it. The proposed approach of using validation to inform bias analysis also is likely to be adopted only if regulators request it. We recommend that regulators strongly consider requesting validation studies to inform quantitative bias analyses. Studies that use quantitative bias analysis to correct results for misclassification bias would strengthen the validity of RWD studies so that they can be used more confidently to support societal and personal health policy decisions. The authors declare no conflict of interest.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call