Abstract The use electronic health record (EHR)-linked biobanks, including All of Us (AOU), the Michigan Genomics Initiative (MGI), and the UK Biobank (UKB), has become increasingly common in cancer research. However, each has a different participant recruitment strategy resulting in non-probability samples, which can lead to selection bias. An unanswered question is whether researchers should use selections weights when analyzing these data. Using colorectal cancer as the test case, we investigate the impact of using such weights on descriptive and analytic tasks in these three biobanks. We curated sociodemographic and clinical diagnosis data for 726,841 individuals (n = 244,071, 81,243, and 401,167, in AOU, MGI, and UKB, respectively). EHR ICD (diagnosis) code data were mapped to broader 2,042 codes, called phecodes, using the new phecode X mapping table developed by researchers at Vanderbilt University. Selection weights were constructed in AOU and MGI to make them more representative of the US adult population, using data from the 2019 National Health Interview Survey; previously described weights for the UKB were used. We estimated phenomewide prevalences, pairwise correlations, and phenome dimensionality (via principal components analysis). To investigate the role of weighting on conclusions from hypothesis testing and association estimation, we conducted a colorectal cancer phenomewide association study and estimated the sex-colorectal cancer log-odds ratio, respectively. We found that phecode prevalences in AOU and MGI decreased following weighting (median prevalence ratio [MPR]: 0.82 and 0.61, respectively) while those in UKB increased (MPR: 1.06). MGI is enriched for phecodes compared to AOU (MPR: 1.15) and UKB (MPR: 6.28). Weighting PCA had minimal impact on phenome dimensionality (e.g., 732 PCs explaining 95% of cumulative variation in AOU vs. 711 after weighting). Weighted PheWAS for colorectal cancer identified 21 hits not identified in unweighted PheWAS, but only one was from a new disease category. Weighted estimates of the female-colorectal cancer log-odds ratio overlapped with the benchmark range in MGI and UKB though resulted in a null association in AOU. Weighting had limited impact on dimensionality estimation and hypothesis testing but were important to consider for prevalence and association estimation. Results from untargeted analyses should be followed up with targeted analyses using curated weights. The importance of weights depends on the estimates obtained and inference goals and can improve the representativeness of results for cancer-related outcomes based on EHR-linked biobank data. Importantly, EHR-linked biobanks should be explicit in reporting recruitment and selection mechanisms and when possible, supply selection weights to researchers for population-based inference along with a clear definition of the target population. Citation Format: Maxwell Salvatore, Ritoban Kundu, Xu Shi, Christopher R. Friese, Seunggeun Lee, Lars G. Fritsche, Alison M. Mondul, David Hanauer, Celeste L. Pearce, Bhramar Mukherjee. To weight or not to weight? Studying the effect of selection bias in three EHR-linked biobanks with applications to colorectal cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 4866.
Read full abstract