Quantitative Models for Causal Analysis in the Era of Genome Wide Association Studies

Steven S Coughlin

doi:10.2174/1874924001003010118

Abstract

Causal inference in health research is a complex endeavor partly because the biomedical enterprise involves researchers from many disciplines including clinical medicine, epidemiology, genetics, basic sciences such as pathology and cell biology, and the behavioral sciences. A multidisciplinary approach is often needed to study health concerns and interpret findings, drawing upon expertise from epidemiologists, statisticians, physicians, nurses, geneticists, psychologists, and other practicing clinicians and researchers. In addition to the diversity of scientific disciplines and professions that are represented in many study groups, the range of health topics that can be studied is large and can include physical injuries such as traumatic brain injury; pain syndromes and other neurological conditions; chronic health conditions such as obesity, cancer, respiratory illnesses, and cardiovascular disease, gastrointestinal illnesses such as irritable bowel syndrome, infectious diseases such as H1N1 influenza and hepatitis C, psychiatric conditions such as post traumatic stress syndrome, depression, and suicide, adverse reproductive outcomes, and other health problems and concerns. Another feature of health research is that a range of study designs are employed by researchers including surveillance systems, observational studies with a case-control or cohort design, cross-sectional surveys, and randomized controlled trials. In recent years, observational studies include the large platforms of cases and controls that are identified for genome-wide association studies [1, 2]. In addition to statistical geneticists, the researchers who analyze data from genome-wide association studies and proteomics research often include persons with expertise in bioinformatics or machine learning techniques. These three features of health research (diversity of scientific disciplines, wide variety of health topics of interest, and alternative study designs) create both challenges and opportunities for researchers attempting to identify causal associations with possible etiologic agents and new therapeutic targets, so that research findings can be translated into targeted clinical interventions and evidence-based therapies. For example, in studies with an observational design, where assignment of exposures is not under control of the investigators, assessments of causality can be more challenging than in randomized trials [3, 4]. Investigations into the distribution and determinants of health conditions attempt to gain new knowledge through observation and inductive logic. Causal criteria commonly cited in epidemiology include temporal order of exposure and disease, biologic gradient or dose-response curve, biologic plausibility, biologic coherence, and consistency of findings, although some authors have recommended subsets of the criteria or refined definitions [5-7]. The strength of the observed association is also important in some assessments of causality. Criteria for causal criteria are widely used as a heuristic aid for assessing whether associations observed in epidemiologic research are causal although criteria-based methods provide only general guidelines for assessing the causality of associations rather than a strict checklist for identifying a causal relationship [3, 4]. The model of sufficient component causes [8] is widely used in epidemiology as a framework for teaching and understanding multicausality. A sufficient component cause is made up of a number of components, no one of which is sufficient for the disease or adverse health condition on its own [4, 8]. Diseases and adverse health conditions can be caused by more than one causal mechanism and each causal mechanism involves the combined action of several component causes. For example, both genetic factors and environmental exposures may have a role in the development of neurologic conditions such as amyotrophic lateral sclerosis. Other examples of diseases caused by interactions between genes and environment include complex, common diseases such as cancer, coronary heart disease, and diabetes [1]. A large and growing literature has dealt with statistical modeling approaches for estimating causal parameters or identifying causal associations using data from observational studies [9-13]. However, much of this important literature has not dealt directly with the special challenges that arise in causal assessments of data from genome-wide association studies including information about environmental exposures. Recent advances in genetics have challenged traditional frameworks for causal inference in observational research [14]. The goal of this article is to consider challenges that arise in causal assessments of data from genome-wide association studies, which utilize high throughput genotyping technologies to analyze biological specimens collected from large numbers of cases and controls for up to one million single nucleotide polymorphisms (SNPs) [1, 2]. Before considering those challenges, I briefly discuss key developments in quantitative models for causal analysis: counterfactual analysis and graphical causal models and structural equations modeling. I then provide a summary of quantitative techniques for analyzing data from genome-wide association studies and related gene expression and proteomic data, and offer some recommendations for causal assessments of results from such studies.

Full Text