Abstract

GENOME-WIDE ASSOCIATION STUDIES 1 GENERATE large volumes of results. While the strongest signals are the focus of most reports, full online publication of thousands or millions of association results has been encouraged. These aggregate results may be valuable for future scientific research, but analysts have recently shown that the aggregate results actually may reveal information about participants. Specifically, if study participants’ genetic information is available, largescale reporting of population-level variant-disease associations enables easy reconstruction of individuals’ disease states. That individual-level information can be obtained from aggregate association results may be surprising. In fact, high precision individual data may be achieved. For example, in a study of 1000 cases and 1000 controls, reporting separate variant disease counts for 5000 variants could enable anyone who knew a study participant’s genotype to determine his or her disease status with 99% sensitivity and specificity. Aggregate regression results also may reveal information; reporting the default (additive model) odds ratios for 10 000 variants and disease gives the same prediction accuracy. The phenomenon is not limited to binary disease states (ie, either having or not having a disease). In a typical recent genome-wide association study, associations between left ventricular mass and 2.5 million genetic variants were studied in 12 612 individuals. A full report of these associations would provide 2.5 million regression estimates. With data from 1 variant, the corresponding regression estimate can be used to give a very weak prediction of a participant’s left ventricular mass; for example, multiplying the regression coefficient by a person’s number of copies of the variant gives a weak prediction of how far his or her disease state is from the sample average. With 2.5 million variants, the average of these predictions yields a very precise determination of an individual’s left ventricular mass. The FIGURE illustrates the phenomenon, giving withinsample predictions of left ventricular mass based on just 35 000 variants. The correlation between predicted and measured left ventricular mass is 0.86, a value typically seen in test-retest variability of left ventricular mass measurement. In other words, for predicting the original measurement, using genotypes and aggregate results performs at least as well as obtaining another actual echocardiogram. Because the clinical predictive ability of common genetic variants has been disappointing, it may seem paradoxical that individual outcomes can be determined so well from a set of association results. However, the targets of prediction are different. Clinicians want to predict disease in new patients; the genetic data are being used to reconstruct the original disease status used in the published regression estimates. In other applications of predictive models, this distinction between in-sample and out-of-sample prediction is well known and has motivated bias-correcting techniques such as replication, cross-validation, and resampling. The ability to infer the disease states of genome-wide association studies’ participants raises important issues of consent. Even if participants have agreed to the release of their genetic data, typically they will not have consented to the release of sensitive disease information. In this situation, publishing aggregate results for thousands of variants could be seen as breaching the limits of consent. Problems also arise when genetic data are not public. If a participant has used 1 of the increasing number of commercial genotyping services, naive publishing of aggregate results could disclose his or her disease information to that third party. Given these concerns, publishing complete genomewide aggregate results is not safe. Because genotypes at a few thousand independent variants appear to be the minimum data required for accurate prediction, reporting only the highly significant results from most genome-wide association studies will typically not disclose disease information. However, the compromise position of publishing all associations that reach intermediate levels of significance (such as P 10) will often allow unacceptably accurate predictions. In the post−genome era, traditional boundaries between individual and aggregate data have become blurred. Although further work is required to identify situations in which individual information will be disclosed, as an interim measure, the current authors recommend that genome-wide

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.