Abstract

Integration of independent data resources across -omics platforms offers transformative opportunity for novel clinical and biological discoveries. However, application of emerging analytic methods in the context of selection bias represents a noteworthy and pervasive challenge. We hypothesize that combining differentially selected samples for integrated transcriptome analysis will lead to bias in the estimated association between predicted expression and the trait. Our results are based on in silico investigations and a case example focused on body mass index across four well-described cohorts apparently derived from markedly different populations. Our findings suggest that integrative analysis can lead to substantial relative bias in the estimate of association between predicted expression and the trait. The average estimate of association ranged from 51.3% less than to 96.7% greater than the true value for the biased sampling scenarios considered, while the average error was − 2.7% for the unbiased scenario. The corresponding 95% confidence interval coverage rate ranged from 46.4% to 69.5% under biased sampling, and was equal to 75% for the unbiased scenario. Inverse probability weighting with observed and estimated weights is applied as one corrective measure and appears to reduce the bias and improve coverage. These results highlight a critical need to address selection bias in integrative analysis and to use caution in interpreting findings in the presence of different sampling mechanisms between groups.

Highlights

  • We evaluate the magnitude and direction of bias through an in silico case study in which data are derived from four established cohorts, namely: (1) Genome-Tissue Expression (GTEx) project cohort[9] and independently generated data from (2) the National Health and Nutrition Examination Survey (NHANES)[10], a population-based cohort; (3) the Chronic Renal Insufficiency Cohort (CRIC)[11], an example “sick” cohort; and (4) the Genetics of Niacin and Endotoxemia (GENE) study cohort[12], a representative “healthy” cohort

  • The distributions of body mass index (BMI) across cohorts are described and compared as one marker to indicate whether the cohorts were derived from similar populations

  • These results are stratified by sex and race/ethnicity because of the established modifying role of sex and race/ethnicity in genetic associations with BMI39–41 and limited to individuals age 21 to 70 for consistency across the GTEx, CRIC and NHANES cohorts

Read more

Summary

Results

In all cases, the Willcoxon RS test rejects the null that the medians of the BMI distributions are equal These results are consistent for Black/non-Hispanic women and men (Supplement Table S1) with the exception that we are unable to detect a difference in the BMI distribution for Black/Non-Hispanic men between the GTEx and NHANES cohorts. As one corrective measure to address selective sampling, we apply inverse probability weighting (IPW)[27] in the first-stage model fitting procedure using known and estimated sampling weights (see Methods) using data simulated according to scenario 2 IPW is an established approach for single cohort analysis to account for differences due to the non-random sampling from a target population, and involves applying a weight to each observation equal to the inverse of the probability that the observation was selected into the sample. The IPW coverage rates are 75.1% and 75.8% for known and estimated weights, respectively representing a marked improvement in coverage compared to scenario 2 without IPW (69.5%), and comparable to the RS scenario in which we see 75.0% coverage

Discussion
1: RS 2: GTEx-RS 3: GTEx-CRIC 4: GTEx-GENE 2
Methods
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.