Authors' reply to Hennessy and Leonard's comment on "Desideratum for evidence-based epidemiology".

J Marc Overhage,Martijn J Schuemie,Patrick B Ryan,Paul E Stang

doi:10.1007/s40264-014-0254-8

Abstract

We appreciate Hennessy and Leonard’s [1] comments on our paper and their strong support for the need to carefully characterize the performance of epidemiologic methods and analysis choices (which we collectively refer to as analyses). The work performed as part of the Observational Medical Outcomes Partnership (OMOP) is but a first step in this journey. We particularly appreciate the authors making the point that ‘‘Problematically implemented studies do not invalidate the underlying research designs, just those implementations’’. We fully agree with this assertion and the importance of beginning to systematically answer questions about what makes analyses problematic. We also share their belief that the empirical assessment of the performance of analyses applied to observational datasets is an essential prerequisite for understanding the reliability of any evidence developed from observational studies. Every measurement approach is limited in precision and accuracy, and the OMOP investigators have consistently acknowledged that the performance evaluation framework we used has limitations [2]. However, without any measure of performance, we are limited to making assumptions based on our theories. Science advances primarily by testing such theories and improving them based on empiric evidence. Ideally, we would measure the performance of analyses with a reference set of positive controls with known effect sizes and negative controls without an effect. Unfortunately, such a collection does not exist so we developed a reference set based on a systematic evaluation of multiple knowledge sources (including literature, product labeling, and a systematic review) that provide insight into others’ assessment of a causal relationship between selected exposures and outcomes [3]. The evidence supporting these controls falls short of the desired, but inherently unknowable, gold standard, and is further limited by the fact that most positive controls may have been known to clinicians before our data were generated [4]. It is important to note that we relied on positive controls only for measures of discrimination, while estimates of error and calibration are based only on negative controls which suffer less from these shortcomings. Simulated data, whether fully synthetic or created by injecting signal into real data, is another approach to creating a ‘gold standard’ [5]. It suffers from its own limitations, including the validity of the assumptions and the degree to which it reflects the relevant complexities of the real world. Experiments based on either simulated or realworld data alone are unlikely to be sufficient given their limitations, and offer complementary data on empirical performance. We measured the performance of analyses using fully synthetic data and obtained essentially the same results as when using the real data [6]. Despite these limitations, it is clear to us that empirically measuring performance of analyses for a particular question on a specific dataset (not generic performance as a one-size-fits-all solution) and using the empirical operating characteristics to calibrate the result is essential. Expert opinion and subjective arguments about theoretical beliefs around potential bias are not a sound foundation on which to develop evidence that is intended to be used to inform medical decisions, whether at the population or individual patient level. As is the case for controls, there is no gold standard for the definition of health outcomes (HOIs). J. M. Overhage (&) Siemens Medical Solutions, Malvern, PA, USA e-mail: marc.overhage@siemens.com

Full Text